From: Nicolas Williams (Nicolas.Williams@sun.com)
Date: 01/29/03-12:20:37 PM Z
Date: Wed, 29 Jan 2003 12:20:37 -0600
From: Nicolas Williams <Nicolas.Williams@sun.com>
Subject: Re: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis-05 draft]
Message-ID: <20030129122037.B16765@binky.central.sun.com>
On Wed, Jan 29, 2003 at 11:03:05AM +0100, Dan Oscarsson wrote:
> >> That depends on what you normalise from. If you start with any of the
> >> ISO 8859-1 character sets then converting and normalising into
> >> UCS form C is easy done without any knowledge on form D.
> >
> >That's codeset conversion. In the NFSv4 case we're talking about the
> >client sending unnormalized UTF-8 (therefore Unicode) filename strings.
> >The server has to then normalize to a canonical form (to prevent equal
> >name / unequal encoding conflicts). This process ALWAYS starts with
> >normalization to form D; end of story.
>
> Even so, when doing normalisation to C I expect you can combine
> the normalisation D part as a part in normalisation to C that do
> not give any (or very little) addition. I doubt that normalisation
> form C need to take more data than form D. Form D can take a lot
> more data space from the kernel so it is not suitable if
> we have kernels with little available memory to work in.
Normalization form D is the first step in normalizing to form C. The
reason is that to map decomposed characters to composed ones one needs
to look up {composed char, combining mark} tuples in a table, but to
keep the number of entries in the table down (there can be many legal,
composeable combinations) one must first make sure that the input
consists of decomposed characters with combining marks ordered correctly
(the order of combining marks is not important outside this context).
So, normalization form C requires composition tables which are not
required for normalization form D and requires normalizing form D as the
first step to nomalizing to form C. So normalization form C requires
more infrastructure (and tables - data) than form D. My source: chapter
14, page 569 of "Unicode Demystified" (I highly recommend this book).
> >In order words: nothing. I.e., NFSv4 is NOT broken wrt i18n by not
> >specifying an on-the-wire normalization form for filenames. I've said
> >this now more than once now - do you take issue with this statement?
>
> It is not broken, but it will make it a lot more difficult to
> implement and increase possible failure. It will result in big
> tables for normalisation everywhere, as the format will never
> match a systems internal needs (except those using unnormalised
> data which must be very rare). Optimising will be difficult.
> The end mounting a file system from a server will both need code
> to handle normalisation and conversion to local character set.
Given Mike Eisler's answers to my questions yesterday, I think we could
get by with a future recommendation to clients that they use a given
normalization form, but otherwise do no normalization or checking at the
server, which would allow euqal name / unequal encoding filenames, but,
if all clients are well behaved, then it won't happen and all would be
happy (my security instincts say that nonetheless the server should
enforce a normalization form, but I can't yet find a convincing security
argument).
> >> I assume the same as that which happens now with NFSv3.
> >
> >Oh no, not at all. NFSv4 uses UTF-8, and therefore Unicode, on the
> >wire for filenames - not so for NFSv3. Big difference.
>
> I would not call it big. I get problems with NFSv3 due to not having a
> standardised character set and encoding.
> With NFSv4, if the mounted file system do not normalise and convert into
> my legacy character set, it will be just as bad.
Legacy is another issue. It is an issue regardless of whether you use
NFSv4, unless you convert your filesystems and somehow determine the
codeset of non-ASCII filenames.
> Even if I switched to UTF-8 as my local character set it will fail, if
> the UTF-8 encoded text is not normalised form C. No other form
> is acceptible to use due to things like invalid semantics, to
> much data space and complex and CPU consuming handling of that format.
User-level code will have to be able to cope with unnormalized Unicode
text [by normalizing as necessary], with or w/o NFSv4.
> You cannot expect systems to switch to unnormalised UTF-8 in their
> file system to help NFSv4. It will break most applications.
The C runtime environment will have to do as much normalization as is
necessary under the hood, yes.
> >No, normalization form C is limited to using composed characters defined
> >in Unicode 3.0. I'll search for a reference tonight and post it
> >tomorrow, but I'm quite sure of this.
>
> Normalisation form C is driven by the tables that Unicode define for each
> version. So it automatically follows each version.
Pages 572 and 573 of the same book mentioned above make it very clear to
me that composites added after Unicode 3.0 are disalloed in text
normalized to form C. The book references Unicode Annex #15.
Cheers,
Nico
--
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:51 AM Z CST