From: Noveck, Dave (Dave.Noveck@netapp.com)
Date: 01/24/03-11:39:39 AM Z
Message-ID: <C8CF60CFC4D8A74E9945E32CF096548A072A3F@SILVER.nane.netapp.com>
From: "Noveck, Dave" <Dave.Noveck@netapp.com>
Subject: RE: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis- 05 draft]
Date: Fri, 24 Jan 2003 09:39:39 -0800
> Oh boy. I've been thinking about the Unicode issues as well.
>
> My thoughts:
>
> - By specifying _a_ Unicode normalization form for filename components
> on the wire we can avoid the need to normalize strings in the kernel.
>
> Checking that inputs are normalized is a lot simpler and less
> resource consuming than actually performing normalization.
>
> Obviously, changing utf8str_cs to specify a normalization form may
> require a new error code and may be an incompatible change.
As far as the error code goes, I would think you could use NFS4ERR_BADNAME.
> If the WG cannot reach consensus about such a change then a
> recommendation that a given normalization form be used should help
> prevent unnecessary normalization operations by NFSv4 clients/servers
> that heed the recommendation (a norm. form could even be
> "negotiated", but let's not go there).
My worry about this is the whole issue of existing file systems and
their contents. I'm sure that there are file systems out there that
do not normalize and thus there may be directories which contain multiple
files that would map to the same string under any given normalization
form. If you impose a normalization form, then you have the issue of
dealing with that legacy data.
> - I prefer normalization forms D or KD. Normalization forms C and KC
> involve normalization to form D first (decomposition), followed by
> recomposition (but only using composed codepoints from Unicode
> version 3.0).
At least I feel the need for a "Normalization Forms for Dummies" document.
Maybe other working group members do as well. Any pointers to something
that will explain this stuff to those who have not already immersed
themsleves in this area.
> All we need for NFSv4 is a canonical normalization form; form D being
> lighter weight than form C makes form D more desirable.
>
> On the other hand, the native norm. form produced by input mechanisms
> at the clients may be preferred by this WG, whatever it may be, as
> long as all/most clients are consistent.
Dream on.
By the way does anybody know what CIFS does about the normalization issue?
> By using the normalization
> form imposed at the earliest stage possible we may avoid the need to
> re-normalize most of the time and get away with checking for
> normalized strings in the libraries and in the kernel with occasional
> normalization in the libraries.
I think my point is the earliest possible point was when the files were
created, i.e. it may be long gone.
> - I'm not sure that I care about avoiding the K forms (as suggested by
> Dan Oscarsson) as I see the K forms as merely slightly reducing the
> available namespace in exchange for reducing the scope for confusion.
>
> The typical example of a K normalization is conversion of ligatures
> such as "fi" or "ae" (one codepoint each) to the visually related
> codepoints that make them up ('f' and 'i' for "fi" and 'a' and 'e'
> for "ae"); there may be other form K substitutions that may actually
> be offensive to speakers of live languages, so further study of the
> matter may be warranted. But then, this is a network filesystem
> protocol, some compromises have to be made.
Although I have not studied this area and so I may be missing the
justification for some of this, this strikes me as strange. If
someone names a file with the ligature fi, then maybe he did it for a
reason, such as the fact that it is about the ligature fi. He could
have named it ordinary fi, but chose not to for his own reasons. It
seems wierd to change it to ordinary fi and that file may also exist.
Visual similarity seems a strange basis to map characters. Are you
going to map Cyrillic Veh to Roman B, because they look the same?
> Nico
> --
On Fri, Jan 24, 2003 at 09:14:06AM -0600, Spencer Shepler wrote:
>
> Forwarding on behalf of Dan.
>
> --
> Spencer
>
> Date: Sat, 18 Jan 2003 10:45:13 +0100 (CET)
> From: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se>
> Reply-To: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se>
> Subject: Comments on NFSv4 rfc3010bis-05 draft
> To: spencer.shepler@sun.com
> X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.5 SunOS 5.9 sun4u sparc
>
> This is a comment on draft-ietf-nfsv4-rfc3010bis-05
>
> I have for many years worked with internationalisation and when I
> look at the internationalisation section 11 in the draft I do not
> understand the choice made.
>
> First you have to separate the format used in the protocol from
> the way to match strings in server or client. It is somewhat
> unclear in the document.
>
> Looking at the most important string: utf8str_cs which is used in
> naming components and pathnames, I see that you do not specify a
> normalisation form. Why not?
>
> To create simple and effcient implementations it is very good
> if all strings are normalised. As names are important to people,
> normalisation form KC cannot be used. Instead normalisation
> form C MUST be used. This will allow all special forms of a character
> that a person what to have in a name be preserved. At the same time
> if will give a compact format to transmitt over the net and ONE
> single format to decode/encode in implementations.
> The "stringprep" document was created to handle matching of
> names in DNS, it is not generally suitable.
>
> If utf8str_cs strings need to be compared case insensitively,
> you have to use form KC and single character lower casing.
> The case mapping in stringprep will result in names that
> should be different will be matched as equivalent (this is due
> to going to far in case mapping).
> The table B.2 which the draft says to use when doing
> case insensitiv matching is only usefull if normalisation form KC
> has been used before. It cannot be used on unnormalised or
> form C text.
>
> As an implementor of programs handling UCS I want to have
> a simple efficient export/import of text as well as a good
> case insensitive matching. By always requiring normalisation
> form C I get simpler code and that form will never destroy
> any data in the text. In addition to from C I would
> prefer to have most of form KC applied, but only the part
> that equivalences equivalent characters. Form KC unfortunately
> does equivalencing of characters that are not equivalent and
> forcing some to be decomposed.
>
> I am sure that not mandating a normalisation form in NFSv4
> will result in more complex code and more failurs due to
> mismatch between client/server handling.
> So why not normalise?
>
> Regards,
>
> Dan
>
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:48 AM Z CST