From: David Robinson (David.Robinson@sun.com)
Date: 02/06/03-11:21:20 AM Z
Message-ID: <3E429990.2050705@Sun.COM> Date: Thu, 06 Feb 2003 11:21:20 -0600 From: David Robinson <David.Robinson@sun.com> Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft] Dan Oscarsson wrote: >>What I was getting at is that if the server does not enforce a >>normalization form for filenames then the clients had better use one >>normalization form to avoid interop problems, and if the client does the >>normalization then that task can be moved to user-level, as in libc. >> > > > Yes, and because of that I think the protocol must mandate > normalisation form C. Then the server (kernel) code kan assume the > format to be that and do not need to do any normalisation. If a client > sends text in another form and thereby violates the protocol strange > things may happen and bad answers. That is ok, that should happen if > you do not follow the protocol. The whole meaning of defining a protocol > is to agree on what "language" to speak and the meaning of the "words". > If the protocol says UCS form C encoded using UTF-8 and somebody > transmitts UCS-2 you violates the protocol. > If we define the normalisation form to be "undefined", we get a protocol > where the "words" are loosly defined. I see no reason to allow more than > one form of a "word". Why complicate the world? By selecting the most > favoured normalisation form many systems can directely send text over > the protocol without change. Both the Unix community and the World Wide Web > have selected form C to be used. I am sure there are many more. I am far from an expert on normalization, but in following this thread it seem to boil down to the question of "who makes it right" and what "right" is. For the latter, I am going to defer to the normalization experts to determine what is the best "form". Traditionally NFS has had a philosophy of dumb servers and smart clients which has allowed development of high performance low impact servers. The client was the party that was responsible for mapping its semantics on to the protocol definition. Based on this and my understanding of Unicode, at most the server should be limited to validating that the utf8 string it receives is actually a valid encoding, the server should not perform the complex process of normalization. If we follow this approach, two clients that use different encoding schemes may send different utf8 strings to access the same file, one or both may fail depending on the form the server stored the name in. Again for simplicity the server is just doing a bitwise comparision. Some files may be inaccessible by certain clients, but as Nico says above, we already have this today and it doesn't seem to be a problem. To help this problem we can either mandate a normalization form on the wire, "MUST use form XYZ" or we can recommend that clients use a common normalization form, "SHOULD use form XYZ". I would tend to favor the latter as it allows clients in a homogenous environment to not pay the price of normalizing to an unfavorable encoding. It is also useful to note that if a filename uses an encoding that is not the preferred encoding, the client can still access it by simply performing a READDIR and returning the bits acquired in a subsequent LOOKUP. What is presented to the application need not be what the server returns, as long as the client performs the mapping function. So the interesting options are: 1) Server performs normalization 2) Protocol specifies wire standard normalization 3) Client uses recommended common normalization Given where we are in the IETF process, I suggest #3 and the WG publish the recommended normalization form. Over time, (in a minor version?) we could migrate to option #2. -David
This archive was generated by hypermail 2.1.2 : 03/04/05-02:12:05 AM Z CST