From: Dan Oscarsson (Dan.Oscarsson@kiconsulting.se)
Date: 01/26/03-10:31:46 AM Z
Message-Id: <200301261629.h0QGT2n27419@malmo.trab.se> Date: Sun, 26 Jan 2003 17:31:46 +0100 (CET) From: Dan Oscarsson <Dan.Oscarsson@kiconsulting.se> Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft] As I stated this I will try to give you some answers. There is a lot to say and I expect now when trying to write it down I will forget many important aspects or facts, but hopefully it will give you some answers. (for your background, I have worked with internationalisation for a long time, both implementing in software (like filesharing from unix with MS Windows and MacOS) and in IETF groups (like the internationalisation of DNS). - The international standard ISO 10646 defines UCS (Universal Character Set). The Unicode definition and ISO 10646 are kept in sync. The Unicode group adds more semantics to handling characters which is not covered by the ISO standard. Unfortunately during its creation many ways to encode the same character was introduced. This is due to how characters were included from legacy character sets and because of the combining characters the Unicode group wanted to use everywhere. Because of this there is more than one way to encode some of the characters making it difficult to handle matching and sorting. There are case where the same character have two code points (binary values), for example "A with ring above" and "Angstrom sign". And there are cases where a character can be represented by a single code point or by multiple by using combining code points, for example "A with ring above" or as "A" plus "combining ring above". To be able to compare or sort characters you need to normalise the character encoding. To normalise means to get only ONE representation for each character. Unicode defines four normalisation forms: D, KD, C and KC. D stands for decomposed, C for composed, K for compatibility. The D forms decompose that is, they split each glyph that can be split into a base character and one or more combining ones. The C form does the oposite, it combines sequences of base+combining characters into a single character (code point). In the K forms they add "compatibility" mapping where all "compatible" characters are mapped into one representation (for example: "angstrom sign" into "a with ring above"). Unicode and the normalisation forms have several inconsistensies. Generally the ASCII characters are treated inconsistenly. For example: The letter "a with ring above" can be decomposed but the letter "i" cannot. "a with ring above" is a vowel in Swedish and is as much a letter as "i" is. And stand alone accents in the ASCII range cannot be decomposed but stand alone accents in the ISO 8859-1 range (code values 0-255) can be. The reason I and many others prefer the C form is that it is close to legacy character sets (ASCII and ISO 8859-1 are true subsets sharing code points with UCS) and how characters are handled in many legacy systems. And it is also compact resulting in smaller storage and less data sent over the wire. The W3C have selected form C as the the one to be used for the webb. For many form C is much lighter than form D. It can easily be handled during client input and closely matches legacy encodings. There is no need to first use D-forms (decomposition) to get text normalised using the compact (C) forms. It is often very easy to just directely enter characters in form C or KC when they are input into the system. The decomposed D form is probably better for a few languages and for very advanced word processing systems handling all characters at the same time. All systems using ASCII or ISO 8859-1 already have the data in form C. At little about why I am not sure the K form is good: The K forms removes "compatible" characters and replace them by ONE single code representation. But what do "compatible" mean? In one of the comments to my original comments there is a talk about ligatures. Yes, "fi" is a ligature (a typographical form) and is decomposed by the K form into "f" and "i". But "ae" is not a ligature, it is a letter, and is not decomposed by the K form. This is as it should be. But the K form also replaces som characters with a different code point that may change the meaning of the character or replacing a singe code point with several (even in form KC). This may be disliked by some people as a string is changed and will make KC not directly compatible with ISO 8859-1. Form C do not destroy any information but will result in multiple representations of one character. And multiple representation is very bad. So if we want to have a defined "standard" to follow, form KC is probably best despite resulting in some semantic information lost (though I think this only happens in special characters, not those that normaly are used in a name). "stringprep" was created for use with the internationalisation of DNS. If may not be perfect for all systems. The W3C have selected form C to be used instead of KC which "stringprep" prefers. - The basic thing I want to say is that for two systems to interoperate they have to speak the same language. Inside a system any laguage can be used, but when talking to the other system you have to use the same language. To avoid complex and CPU consuming software you want to have normalised text. If you do not require that, each NFS server and client need to have server to handle unnormalised character strings resulting in a lot of code and CPU time. If instead you have one well defined character order having only one representation of a character, the code gets much simpler. Normalisation can be done at the input end, servers can expect all data to already be normalised. From all I have read and studied, form C or KC is the preferred choice. If form C is selected all "equivalent" glyphs should be forbidden to avoid more than one representation of a code point. I do not want to write code to test for matches with multiple representation of a character. (there have been a lot of discussion of things like Cyrillic characters being equivalent to Latin ones. While some look the same and have the same historic origin most do not want to make them "equivalent". This will result in confused users, but I see no easy solution to that). Not normalising and just comparing bytes, means that a directory can store the same filename several times as a character can be represented by different byte sequences. But the user may see them as exactely the same. - About legacy character sets. If the protocol says that all filenames are to be transmitted in the NFS protocol using UCS form KC (or C), the NFS servers and clients must convert from the character set encoding used on the system. Otherwise it will not work well. For example, on my Solaris system I use ISO 8859-1 as the character set in use. This means that all filenames in my system is encoded using ISO 8859-1 (latin-1). When my files are mounted using NFSv3 on a MacOS system, filenames gets corrupted due to no standard on character set. If my NFSv3 server had converted the filenames from ISO 8859-1 into UTF-8 and the MacOS had handled that, or if the MacOS client side had an option telling it that the filenames shoudl be in ISO 8859-1, it would have worked. But to avoid everybody having to implement all character sets in the world, it is easiest to use ONE encoding on the wire. - Comparing names: When you do case insensitive comparing of characters it gets even more complex. For example, the "small letter german sharp s" can be equivalent to "SS" and is treated so by the "stringprep" document. But there are German words that have different meaning if written as "ss" or "small letter german sharp s". And Chinese have a set of character that are equivalent but those are not handled by the "stringprep" document. I think doing one to one case insensivititi after the Unicode tables combined the the Chinese SC/TC equivalence is best (and fastest to implement and use). - In summary: I think NFSv4 must require ONE encoding of character data. UCS normalised using form KC or form C with forbidden characters, encoded as UTF-8. It is the responsibility of the server/client to convert between local system encoding and the protocol format. I will try to answer your coming questions, but it may be a day or two between them as I have a lot to do, at the moment. Dan
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:49 AM Z CST