Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]

New Message Reply About this list Date view Thread view Subject view Author view Attachment view

From: Nicolas Williams (Nicolas.Williams@sun.com)
Date: 01/24/03-10:45:26 AM Z


Date: Fri, 24 Jan 2003 10:45:26 -0600
From: Nicolas Williams <Nicolas.Williams@sun.com>
Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]
Message-ID: <20030124104526.R16765@binky.central.sun.com>

Oh boy.  I've been thinking about the Unicode issues as well.

My thoughts:

 - By specifying _a_ Unicode normalization form for filename components
   on the wire we can avoid the need to normalize strings in the kernel.

   Checking that inputs are normalized is a lot simpler and less
   resource consuming than actually performing normalization.

   Obviously, changing utf8str_cs to specify a normalization form may
   require a new error code and may be an incompatible change.

   If the WG cannot reach consensus about such a change then a
   recommendation that a given normalization form be used should help
   prevent unnecessary normalization operations by NFSv4 clients/servers
   that heed the recommendation (a norm. form could even be
   "negotiated", but let's not go there).

 - I prefer normalization forms D or KD.  Normalization forms C and KC
   involve normalization to form D first (decomposition), followed by
   recomposition (but only using composed codepoints from Unicode
   version 3.0).

   All we need for NFSv4 is a canonical normalization form; form D being
   lighter weight than form C makes form D more desirable.

   On the other hand, the native norm. form produced by input mechanisms
   at the clients may be preferred by this WG, whatever it may be, as
   long as all/most clients are consistent.  By using the normalization
   form imposed at the earliest stage possible we may avoid the need to
   re-normalize most of the time and get away with checking for
   normalized strings in the libraries and in the kernel with occasional
   normalization in the libraries.

 - I'm not sure that I care about avoiding the K forms (as suggested by
   Dan Oscarsson) as I see the K forms as merely slightly reducing the
   available namespace in exchange for reducing the scope for confusion.

   The typical example of a K normalization is conversion of ligatures
   such as "fi" or "ae" (one codepoint each) to the visually related
   codepoints that make them up ('f' and 'i' for "fi" and 'a' and 'e'
   for "ae"); there may be other form K substitutions that may actually
   be offensive to speakers of live languages, so further study of the
   matter may be warranted.  But then, this is a network filesystem
   protocol, some compromises have to be made.

Comments?

Nico
-- 

On Fri, Jan 24, 2003 at 09:14:06AM -0600, Spencer Shepler wrote:
> 
> Forwarding on behalf of Dan.
> 
> -- 
> Spencer
> 

> Date: Sat, 18 Jan 2003 10:45:13 +0100 (CET)
> From: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se>
> Reply-To: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se>
> Subject: Comments on NFSv4 rfc3010bis-05 draft
> To: spencer.shepler@sun.com
> X-Mailer: dtmail 1.3.0 @(#)CDE Version 1.5 SunOS 5.9 sun4u sparc 
> 
> This is a comment on draft-ietf-nfsv4-rfc3010bis-05
> 
> I have for many years worked with internationalisation and when I
> look at the internationalisation section 11 in the draft I do not
> understand the choice made.
> 
> First you have to separate the format used in the protocol from
> the way to match strings in server or client. It is somewhat
> unclear in the document.
> 
> Looking at the most important string: utf8str_cs which is used in
> naming components and pathnames, I see that you do not specify a
> normalisation form. Why not?
> 
> To create simple and effcient implementations it is very good
> if all strings are normalised. As names are important to people,
> normalisation form KC cannot be used. Instead normalisation
> form C MUST be used. This will allow all special forms of a character
> that a person what to have in a name be preserved. At the same time
> if will give a compact format to transmitt over the net and ONE
> single format to decode/encode in implementations.
> The "stringprep" document was created to handle matching of
> names in DNS, it is not generally suitable.
> 
> If utf8str_cs strings need to be compared case insensitively,
> you have to use form KC and single character lower casing.
> The case mapping in stringprep will result in names that
> should be different will be matched as equivalent (this is due
> to going to far in case mapping).
> The table B.2 which the draft says to use when doing
> case insensitiv matching is only usefull if normalisation form KC
> has been used before. It cannot be used on unnormalised or
> form C text.
> 
> As an implementor of programs handling UCS I want to have
> a simple efficient export/import of text as well as a good
> case insensitive matching. By always requiring normalisation
> form C I get simpler code and that form will never destroy
> any data in the text. In addition to from C I would
> prefer to have most of form KC applied, but only the part
> that equivalences equivalent characters. Form KC unfortunately
> does equivalencing of characters that are not equivalent and
> forcing some to be decomposed.
> 
> I am sure that not mandating a normalisation form in NFSv4
> will result in more complex code and more failurs due to
> mismatch between client/server handling.
> So why not normalise?
> 
> Regards,
> 
>    Dan
> 


New Message Reply About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:48 AM Z CST