From: Noveck, Dave (Dave.Noveck@netapp.com)
Date: 01/27/03-11:47:27 AM Z
Message-ID: <C8CF60CFC4D8A74E9945E32CF096548A072A51@SILVER.nane.netapp.com> From: "Noveck, Dave" <Dave.Noveck@netapp.com> Subject: RE: [Dan.Oscarsson@kiconsulting.se: Comments on NFSv4 rfc3010bis- 05 draft] Date: Mon, 27 Jan 2003 09:47:27 -0800 > > But we cannot require it in storage, only on the wire. > > In storage we must leave the fileserver free to do as it pleases, but > for one restriction: it must use a reasonably canonical form in storage, > otherwise equal filenames with unequal encodings could be allowed. How would define "reasonably canonical". The requirement you are talking about is that you cannot have two different files in the same directory with canonically equivalent strings as name. That affects the format that will be used in storage but leave it to the implementation to decide exactly how. > I believe that this is what the draft specifies. An on-the-wire > normalization form specification would be an optimization, but is not > absolutely necessary. > > It is on the wire, that is between systems, that it must be standardised > > to one simple format. Systems can use any format they want. > > I remain unconvinced. > > A system which uses normalising form C as its local format for staorage > > will have a simpler implementation than others, and that will help > > push system vendors to move to the most common format used. > > UCS normalising form C is compact and do not destroy any information, > > so it is best. The K forms destroy data and the D form takes more space and > > breaks the semantic concept of letter on some letters. > > Normalization form C is limited to composed characters defined in > Unicode 3.0 (we're past 3.0); as such it really means "composed for > these characters, decomposed for everything else" - so why not just > decomposed then? > I don't thing encoding length is that big a deal, Clearly not. If we cared that much about bytes sent, we wouldn't have four bytes for each op on the request and the reply, we wouldn't have four bytes of error code that almost always has zero in it, etc. The percentage of additional bytes sent under any reasonable circumstances by doign D as apposed to C is going to be very small. > but cycles spent in > normalization, space dedicated to normalization data structures, *that* > is a big deal. If it's spent on the client, no problem. Time spent on the server on the other hand is very important :-) > This is why I'm for form D (on the wire as an optimization). It seems that form D can be checked by just looking for decomposable characters and rejecting if any is found. On the other hand, it seems, that you need some kind of state machine to check form C. > > Yes, it may result in additional code in servers, but many system can > > create very efficient code to convert between legacy character set > > and UCS normalising form C. So I think it will not be that expensive. > I have no figures close at hand, but I don't think that Unicode > normalization data structures and code are small (remember, we're > talking about kernel constraints here, complete with small stacks). I think kernel and non-kernel people tend to have different view of what is considered too expensive.
This archive was generated by hypermail 2.1.2 : 03/04/05-01:50:50 AM Z CST