Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]

New Message Reply About this list Date view Thread view Subject view Author view Attachment view

From: Dan Oscarsson (Dan.Oscarsson@kiconsulting.se)
Date: 01/28/03-10:54:37 AM Z


Message-Id: <200301281651.h0SGpqn28528@malmo.trab.se>
Date: Tue, 28 Jan 2003 17:54:37 +0100 (CET)
From: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se>
Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]

>Normalization form C ALWAYS starts with decomposition, that is,
>normalization form D.
>
>Any software which can perform normalization form C can also perform
>normalization form D.

That depends on what you normalise from. If you start with any of the
ISO 8859-1 character sets then converting and normalising into
UCS form C is easy done without any knowledge on form D.

As a very large set of character sets in use today are, in Unicode
concepts "precomposed", translation to UCS form C can be done
without any form D normalisation.

If you from the beginning start by defining that the final result
is UCS form C, conversion/input/normalisation can very often
easily be done very directely. No need to being able to handle
normalisation D.

>
>> But we cannot require it in storage, only on the wire.
>
>In storage we must leave the fileserver free to do as it pleases, but
>for one restriction: it must use a reasonably canonical form in storage,
>otherwise equal filenames with unequal encodings could be allowed.
>
>I believe that this is what the draft specifies.  An on-the-wire
>normalization form specification would be an optimization, but is not
>absolutely necessary.

OK. What will happen if we do not require it?
I assume the same as that which happens now with NFSv3.
I have a user working on a Mac that have mounted a file system
through NFSv3 from our Unix machine. When a file name is created
on the Mac it is written as UTF-8 encoded names of some form of Unicode.
When I look at the same file from my Unix machine the name is a mess.
All my software on Unix uses ISO 8859-1.

If instead the Mac NFS client did know that ISO 8859-1 was the standard
to be used, it could have translated the names when sending
to the Unix system and translated it back when reading from the Unix
host.


>Normalization form C is limited to composed characters defined in
>Unicode 3.0 (we're past 3.0); as such it really means "composed for
>these characters, decomposed for everything else" - so why not just
>decomposed then?

Normalisation form C is for the current version of Unicode (3.2).
Form C have most characters that have been "precomposed" before in
that form (in legacy character sets). People are used to working with
precomposed characters.

>
>I don't thing encoding length is that big a deal, but cycles spent in
>normalization, space dedicated to normalization data structures, *that*
>is a big deal.
>
>This is why I'm for form D (on the wire as an optimization).

Why would form D result in smaller tables?
Most text today is already in precomposed form.

All programs I have (and all open source I have fetched) handles
text using precomposed characters.

>
>> Yes, it may result in additional code in servers, but many system can
>> create very efficient code to convert between legacy character set
>> and UCS normalising form C. So I think it will not be that expensive.
>
>I have no figures close at hand, but I don't think that Unicode
>normalization data structures and code are small (remember, we're
>talking about kernel constraints here, complete with small stacks).

I cannot say how small they can be, but I think they need not be that
big.

But if we do like is intended in web and DNS to have the normalisation
done by sender, the server can assume that incoming data on the wire
is in the correct format and need not do any normalisation or
checking on it. You can assume it is correct, it not, server routines
to compare filenames (case sensitive or case insensitive) will fail
to match.

But you always still have to handle the encoding used in the file system
on the server. If the file server uses EBCDIC in the file system,
the NFS server have to convert between EBCDIC and UCS form C.
I can see no way to avoid that (except doing the bad solution
in DNS were everything is encoded using ASCII characters forcing every
application handling file names on the system to be rewritten).
And as you have to do that, the simplest is to have only one
standard character set and form to convert to/from.

   Dan


New Message Reply About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:51 AM Z CST