From: Nicolas Williams (Nicolas.Williams@sun.com)
Date: 01/28/03-01:33:12 PM Z
Date: Tue, 28 Jan 2003 13:33:12 -0600 From: Nicolas Williams <Nicolas.Williams@sun.com> Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft] Message-ID: <20030128133311.J16765@binky.central.sun.com> On Tue, Jan 28, 2003 at 05:54:37PM +0100, Dan Oscarsson wrote: > >Normalization form C ALWAYS starts with decomposition, that is, > >normalization form D. > > > >Any software which can perform normalization form C can also perform > >normalization form D. > > That depends on what you normalise from. If you start with any of the > ISO 8859-1 character sets then converting and normalising into > UCS form C is easy done without any knowledge on form D. That's codeset conversion. In the NFSv4 case we're talking about the client sending unnormalized UTF-8 (therefore Unicode) filename strings. The server has to then normalize to a canonical form (to prevent equal name / unequal encoding conflicts). This process ALWAYS starts with normalization to form D; end of story. It may well be that the majority of clients are doing early normalization (possibly by converting from other codesets straight to UTF-8, Unicode form C. This may be and, if so, means that one form may be better than another because it is most commonly used. But this is not relevant to the above statememnt that you took issue with :) > >I believe that this is what the draft specifies. An on-the-wire > >normalization form specification would be an optimization, but is not > >absolutely necessary. > > OK. What will happen if we do not require it? The server has to normalize the client's filenames to avoid equal name / unequal encoding conflicts. If it were required the server would only have to check that clients send normalized filenames and return some error if they don't. In order words: nothing. I.e., NFSv4 is NOT broken wrt i18n by not specifying an on-the-wire normalization form for filenames. I've said this now more than once now - do you take issue with this statement? > I assume the same as that which happens now with NFSv3. Oh no, not at all. NFSv4 uses UTF-8, and therefore Unicode, on the wire for filenames - not so for NFSv3. Big difference. > I have a user working on a Mac that have mounted a file system > through NFSv3 from our Unix machine. When a file name is created > on the Mac it is written as UTF-8 encoded names of some form of Unicode. > When I look at the same file from my Unix machine the name is a mess. > All my software on Unix uses ISO 8859-1. See above. NFSv4 clients are responsbile for any conversions to/from the users' locales' codesets. It seems clear to me that you have not read the NFSv4 draft and thought this through carefully. > >Normalization form C is limited to composed characters defined in > >Unicode 3.0 (we're past 3.0); as such it really means "composed for > >these characters, decomposed for everything else" - so why not just > >decomposed then? > > Normalisation form C is for the current version of Unicode (3.2). > Form C have most characters that have been "precomposed" before in > that form (in legacy character sets). People are used to working with > precomposed characters. No, normalization form C is limited to using composed characters defined in Unicode 3.0. I'll search for a reference tonight and post it tomorrow, but I'm quite sure of this. > > > >I don't thing encoding length is that big a deal, but cycles spent in > >normalization, space dedicated to normalization data structures, *that* > >is a big deal. > > > >This is why I'm for form D (on the wire as an optimization). > > Why would form D result in smaller tables? > Most text today is already in precomposed form. See way above. We're talking about the server appying a normalization form to UTF-8 encoded filenames sent by the client; we're NOT talking about codeset conversion. Normalization to form D unquestionably requires less data than normalization to form C. [...] > But if we do like is intended in web and DNS to have the normalisation > done by sender, the server can assume that incoming data on the wire > is in the correct format and need not do any normalisation or > checking on it. You can assume it is correct, it not, server routines > to compare filenames (case sensitive or case insensitive) will fail > to match. This is true, we could do this, but I would rather not trust that all clients will be well behaved, particularly considering that the current NFSv4 draft does not specify a normalization that clients MUST use for case-sensitive filenames. > But you always still have to handle the encoding used in the file system > on the server. If the file server uses EBCDIC in the file system, > the NFS server have to convert between EBCDIC and UCS form C. This is an implementation-specific issue and has little or nothing to do with the NFSv4 spec. Legacy filesystems (as has been pointed out in this thread) will be particularly problematic because a mixture of codesets can and is used on legacy filesystems to begin with. In fact, the legacy issue is present even outside the context of NFSv4. > I can see no way to avoid that (except doing the bad solution > in DNS were everything is encoded using ASCII characters forcing every > application handling file names on the system to be rewritten). > And as you have to do that, the simplest is to have only one > standard character set and form to convert to/from. The solution is to use only UTF-8 on disk for filenames for all new filesystems and give users conversion tools for legacy systems. Nico --
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:51 AM Z CST