Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]

New Message Reply About this list Date view Thread view Subject view Author view Attachment view

From: Dan Oscarsson (Dan.Oscarsson@kiconsulting.se)
Date: 01/26/03-10:31:46 AM Z


Message-Id: <200301261629.h0QGT2n27419@malmo.trab.se>
Date: Sun, 26 Jan 2003 17:31:46 +0100 (CET)
From: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se>
Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]

As I stated this I will try to give you some answers.
There is a lot to say and I expect now when trying to write it
down I will forget many important aspects or facts, but hopefully it
will give you some answers.
(for your background, I have worked with internationalisation for a 
long time, both implementing in software (like filesharing from
unix with MS Windows and MacOS) and in IETF groups (like the
internationalisation of DNS).
-


The international standard ISO 10646 defines UCS (Universal Character
Set).  The Unicode definition and ISO 10646 are kept in sync.  The
Unicode group adds more semantics to handling characters which is not
covered by the ISO standard.  Unfortunately during its creation many
ways to encode the same character was introduced.  This is due to how
characters were included from legacy character sets and because of the
combining characters the Unicode group wanted to use everywhere.
Because of this there is more than one way to encode some of the
characters making it difficult to handle matching and sorting.  There
are case where the same character have two code points (binary
values), for example "A with ring above" and "Angstrom sign".  And
there are cases where a character can be represented by a single code
point or by multiple by using combining code points, for example "A
with ring above" or as "A" plus "combining ring above".  To be able to
compare or sort characters you need to normalise the character
encoding.  To normalise means to get only ONE representation for each
character.

Unicode defines four normalisation forms:  D, KD, C and KC.  D stands
for decomposed, C for composed, K for compatibility.  The D forms
decompose that is, they split each glyph that can be split into a base
character and one or more combining ones.  The C form does the
oposite, it combines sequences of base+combining characters into a
single character (code point).  In the K forms they add
"compatibility" mapping where all "compatible" characters are mapped
into one representation (for example:  "angstrom sign" into "a with
ring above").  Unicode and the normalisation forms have several
inconsistensies.  Generally the ASCII characters are treated
inconsistenly.  For example:  The letter "a with ring above" can be
decomposed but the letter "i" cannot.  "a with ring above" is a vowel
in Swedish and is as much a letter as "i" is.  And stand alone accents
in the ASCII range cannot be decomposed but stand alone accents in the
ISO 8859-1 range (code values 0-255) can be.

The reason I and many others prefer the C form is that it is close to
legacy character sets (ASCII and ISO 8859-1 are true subsets sharing
code points with UCS) and how characters are handled in many legacy
systems.  And it is also compact resulting in smaller storage and less
data sent over the wire.  The W3C have selected form C as the the one
to be used for the webb.  For many form C is much lighter than form D.
It can easily be handled during client input and closely matches
legacy encodings.  There is no need to first use D-forms
(decomposition) to get text normalised using the compact (C) forms.
It is often very easy to just directely enter characters in form C or
KC when they are input into the system.  The decomposed D form is
probably better for a few languages and for very advanced word
processing systems handling all characters at the same time.  All
systems using ASCII or ISO 8859-1 already have the data in form C.

At little about why I am not sure the K form is good:  The K forms
removes "compatible" characters and replace them by ONE single code
representation.  But what do "compatible" mean?  In one of the
comments to my original comments there is a talk about ligatures.
Yes, "fi" is a ligature (a typographical form) and is decomposed by
the K form into "f" and "i".  But "ae" is not a ligature, it is a
letter, and is not decomposed by the K form.  This is as it should be.
But the K form also replaces som characters with a different code
point that may change the meaning of the character or replacing a
singe code point with several (even in form KC).  This may be disliked
by some people as a string is changed and will make KC not directly
compatible with ISO 8859-1.  Form C do not destroy any information but
will result in multiple representations of one character.  And
multiple representation is very bad.  So if we want to have a defined
"standard" to follow, form KC is probably best despite resulting in
some semantic information lost (though I think this only happens in
special characters, not those that normaly are used in a name).

"stringprep" was created for use with the internationalisation of DNS.
If may not be perfect for all systems.  The W3C have selected form C
to be used instead of KC which "stringprep" prefers.  - The basic
thing I want to say is that for two systems to interoperate they have
to speak the same language.  Inside a system any laguage can be used,
but when talking to the other system you have to use the same
language.

To avoid complex and CPU consuming software you want to have
normalised text.  If you do not require that, each NFS server and
client need to have server to handle unnormalised character strings
resulting in a lot of code and CPU time.  If instead you have one well
defined character order having only one representation of a character,
the code gets much simpler.  Normalisation can be done at the input
end, servers can expect all data to already be normalised.  From all I
have read and studied, form C or KC is the preferred choice.  If form
C is selected all "equivalent" glyphs should be forbidden to avoid
more than one representation of a code point.  I do not want to write
code to test for matches with multiple representation of a character.
(there have been a lot of discussion of things like Cyrillic
characters being equivalent to Latin ones.  While some look the same
and have the same historic origin most do not want to make them
"equivalent".  This will result in confused users, but I see no easy
solution to that).  Not normalising and just comparing bytes, means
that a directory can store the same filename several times as a
character can be represented by different byte sequences.  But the
user may see them as exactely the same.

- About legacy character sets.  If the protocol says that all
filenames are to be transmitted in the NFS protocol using UCS form KC
(or C), the NFS servers and clients must convert from the character
set encoding used on the system.  Otherwise it will not work well.
For example, on my Solaris system I use ISO 8859-1 as the character
set in use.  This means that all filenames in my system is encoded
using ISO 8859-1 (latin-1).  When my files are mounted using NFSv3 on
a MacOS system, filenames gets corrupted due to no standard on
character set.  If my NFSv3 server had converted the filenames from
ISO 8859-1 into UTF-8 and the MacOS had handled that, or if the MacOS
client side had an option telling it that the filenames shoudl be in
ISO 8859-1, it would have worked.  But to avoid everybody having to
implement all character sets in the world, it is easiest to use ONE
encoding on the wire.

- Comparing names:  When you do case insensitive comparing of
characters it gets even more complex.  For example, the "small letter
german sharp s" can be equivalent to "SS" and is treated so by the
"stringprep" document.  But there are German words that have different
meaning if written as "ss" or "small letter german sharp s".  And
Chinese have a set of character that are equivalent but those are not
handled by the "stringprep" document.  I think doing one to one case
insensivititi after the Unicode tables combined the the Chinese SC/TC
equivalence is best (and fastest to implement and use).

-
In summary:
I think NFSv4 must require ONE encoding of character data.
UCS normalised using form KC or form C with forbidden characters,
encoded as UTF-8. It is the responsibility of the server/client
to convert between local system encoding and the protocol
format.

I will try to answer your coming questions, but it may be a day or two
between them as I have a lot to do, at the moment.


   Dan


New Message Reply About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:49 AM Z CST