From: Nicolas Williams (Nicolas.Williams@sun.com)
Date: 01/24/03-10:45:26 AM Z
Date: Fri, 24 Jan 2003 10:45:26 -0600
From: Nicolas Williams <Nicolas.Williams@sun.com>
Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]
Message-ID: <20030124104526.R16765@binky.central.sun.com>
Oh boy. I've been thinking about the Unicode issues as well.
My thoughts:
- By specifying _a_ Unicode normalization form for filename components
on the wire we can avoid the need to normalize strings in the kernel.
Checking that inputs are normalized is a lot simpler and less
resource consuming than actually performing normalization.
Obviously, changing utf8str_cs to specify a normalization form may
require a new error code and may be an incompatible change.
If the WG cannot reach consensus about such a change then a
recommendation that a given normalization form be used should help
prevent unnecessary normalization operations by NFSv4 clients/servers
that heed the recommendation (a norm. form could even be
"negotiated", but let's not go there).
- I prefer normalization forms D or KD. Normalization forms C and KC
involve normalization to form D first (decomposition), followed by
recomposition (but only using composed codepoints from Unicode
version 3.0).
All we need for NFSv4 is a canonical normalization form; form D being
lighter weight than form C makes form D more desirable.
On the other hand, the native norm. form produced by input mechanisms
at the clients may be preferred by this WG, whatever it may be, as
long as all/most clients are consistent. By using the normalization
form imposed at the earliest stage possible we may avoid the need to
re-normalize most of the time and get away with checking for
normalized strings in the libraries and in the kernel with occasional
normalization in the libraries.
- I'm not sure that I care about avoiding the K forms (as suggested by
Dan Oscarsson) as I see the K forms as merely slightly reducing the
available namespace in exchange for reducing the scope for confusion.
The typical example of a K normalization is conversion of ligatures
such as "fi" or "ae" (one codepoint each) to the visually related
codepoints that make them up ('f' and 'i' for "fi" and 'a' and 'e'
for "ae"); there may be other form K substitutions that may actually
be offensive to speakers of live languages, so further study of the
matter may be warranted. But then, this is a network filesystem
protocol, some compromises have to be made.
Comments?
Nico
--
On Fri, Jan 24, 2003 at 09:14:06AM -0600, Spencer Shepler wrote:
>
> Forwarding on behalf of Dan.
>
> --
> Spencer
>
> Date: Sat, 18 Jan 2003 10:45:13 +0100 (CET)
> From: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se>
> Reply-To: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se>
> Subject: Comments on NFSv4 rfc3010bis-05 draft
> To: spencer.shepler@sun.com
> X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.5 SunOS 5.9 sun4u sparc
>
> This is a comment on draft-ietf-nfsv4-rfc3010bis-05
>
> I have for many years worked with internationalisation and when I
> look at the internationalisation section 11 in the draft I do not
> understand the choice made.
>
> First you have to separate the format used in the protocol from
> the way to match strings in server or client. It is somewhat
> unclear in the document.
>
> Looking at the most important string: utf8str_cs which is used in
> naming components and pathnames, I see that you do not specify a
> normalisation form. Why not?
>
> To create simple and effcient implementations it is very good
> if all strings are normalised. As names are important to people,
> normalisation form KC cannot be used. Instead normalisation
> form C MUST be used. This will allow all special forms of a character
> that a person what to have in a name be preserved. At the same time
> if will give a compact format to transmitt over the net and ONE
> single format to decode/encode in implementations.
> The "stringprep" document was created to handle matching of
> names in DNS, it is not generally suitable.
>
> If utf8str_cs strings need to be compared case insensitively,
> you have to use form KC and single character lower casing.
> The case mapping in stringprep will result in names that
> should be different will be matched as equivalent (this is due
> to going to far in case mapping).
> The table B.2 which the draft says to use when doing
> case insensitiv matching is only usefull if normalisation form KC
> has been used before. It cannot be used on unnormalised or
> form C text.
>
> As an implementor of programs handling UCS I want to have
> a simple efficient export/import of text as well as a good
> case insensitive matching. By always requiring normalisation
> form C I get simpler code and that form will never destroy
> any data in the text. In addition to from C I would
> prefer to have most of form KC applied, but only the part
> that equivalences equivalent characters. Form KC unfortunately
> does equivalencing of characters that are not equivalent and
> forcing some to be decomposed.
>
> I am sure that not mandating a normalisation form in NFSv4
> will result in more complex code and more failurs due to
> mismatch between client/server handling.
> So why not normalise?
>
> Regards,
>
> Dan
>
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:48 AM Z CST