From: Nicolas Williams (Nicolas.Williams@sun.com)
Date: 01/24/03-01:31:14 PM Z
Date: Fri, 24 Jan 2003 13:31:14 -0600 From: Nicolas Williams <Nicolas.Williams@sun.com> Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis- 05 draft] Message-ID: <20030124133114.Y16765@binky.central.sun.com> On Fri, Jan 24, 2003 at 10:41:25AM -0800, Noveck, Dave wrote: > > Er, no, utf8str_cs requires that a normalization form be used, just not > > on the wire. So the problem of legacy filesystems remains even if we do > > not act to recommend or require a specific normalization form on the > > wire. > > In that case, there doesn't seem to be any reason not to pick one > form, assuming we can agree on what that is. Right. > So it appears that we are supposed to map utf8str_cs "equivalent" strings > to the same filename, but what is the form of equivalence for which this > is required, cannonical equivalence or compatibility equivalence? This > affects the normalization forms that we may choose among. The draft leaves it to the server implementation. This means that servers must be able to normalize, not merely check for normalized input. > > As for the legacy filesystems, the problem is far worse than you > > suggest, since some filesystems allow arbitrary 8-bit data in filenames, > > including arbitrary non-UTF-8 encodings, without any codeset tagging > > whatsoever. But this is not relevant to the matter of whether > > utf8str_cs should mandate a specific normalization form for filenames on > > the wire. > > OK. Sigh. I don't think you are not the first to get that blow to the stomach, sinking feeling. > > "Normalization form" is a concept specific to Unicode (well, to any > > codeset which has multiple equivalent encodings for the same character, > > which mostly means Unicode). > > It hurts when you do that, so my advice would have been "don't do that". We're not going to get anywhere arguing that Unicode should or should not have had more than one normalization form :) > > But I insist: the legacy issue has nothing to do with selecting a single > > required normalization form for filenames on the wire. The legacy > > filesystem issues should be addressed in a separate thread. See above. > > At this point I'm thinking a suicide counseling hotline would be the > appropriate forum, but it'll pass. It'll pass. There will be more than one vendor / implementor who will have to deal with this and in the context of more than one IETF protocol (Kerberos i18n comes to mind - look at the KRB WG archives for a good example of the legacy issues). > > Think of a group of users using a shared directory: one user may know > > how to enter a ligature, many may not even be able to tell that some > > filename uses a ligature - they may not know what a ligature is, how to > > recognize one, much less how to type one in. Doesn't it then make sense > > to use the K compatibility mappings for ligatures? Obviously, for the > > single-user case the answer is "no", but for the multi-user case the > > answer is not clear. > > But I may not know how to enter a Cyrillic Veh, either, or recognize the > difference between that and a B, cause there isn't one. If there are > other cyrillic characters I might have a clue, but I kind of think that > asking the file system to deal with that problem would not be appropriate, > as it would also not be appropriate to ask it to deal with my confusion > about ligatures. Indeed. I'm not attached to the K forms. It's certainly easiest to just specify normalization form D - that would require the least space for Unicode-related data structures and the least number of CPU cycles to implement (the C forms always start with the D forms). > Anyway, if I have a choice, I would go for something that doesn't, on > dubious grounds, add additional equivalences, to what we are forced to > have. So I would prefer fundamental equivalence to canonical equivalence > to compatibility equivalence, consistent with with what the existing > spec requires. And on the grounds you offered, I would prefer D to C > and thus KD to KC. Given the choice beween D and KD (do I have that?), > I would go for D and leave compatibility equivalence out of the picture. Agreed. Cheers, Nico --
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:48 AM Z CST