From owner-freebsd-hackers Thu Aug 1 17:47:22 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id RAA08436 for hackers-outgoing; Thu, 1 Aug 1996 17:47:22 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id RAA08431 for ; Thu, 1 Aug 1996 17:47:19 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id RAA04852; Thu, 1 Aug 1996 17:43:14 -0700 From: Terry Lambert Message-Id: <199608020043.RAA04852@phaeton.artisoft.com> Subject: Re: anyone working on upgrading the msdosfs to NetBSD levels? To: rnordier@iafrica.com (Robert Nordier) Date: Thu, 1 Aug 1996 17:43:14 -0700 (MST) Cc: terry@lambert.org, hackers@freebsd.org In-Reply-To: <199608011928.VAA00586@eac.iafrica.com> from "Robert Nordier" at Aug 1, 96 09:28:07 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > If you get to where you need to work on name collision, let me know, > > and I can describe the algorithm in a couple of pages. > > I was doing some work on this just recently. When you have the > time, I'd appreciate your description. There may be a few points > that my derived algorithm misses. The name collision is handled differently based on "intend lookup" vs. "intend create". The "EXCLUSIVE" flag I added for lookups which are expected to return an EEXISTS in the "intend create" case (create target, rename target, etc.) will help. When a name is passed in, it is either capable of being a valid 8.3 name or it isn't. If it is, then the name is looked up in the 8.3 name space. For an 8.3 name, the LFN will always be the same as the 8.3 name. Note that the 8.3 name is a case insensitive name; a name "Foo" and "foo" are identical for case insensitive lookup/case sensitive storage; they are the valid short name "FOO". The short name lookup will match or fail to match. In no case is it possible for a valid short name to have a long name form that collides with another long name form. Short names are looked up in the short name space and in the long name space. The long name space lookup is done because of the direction collisions are handled. In the LFN case, the lookup is done in the long name space. This is because any LFN that comes into lookup will have an LFN form if it has been previously stored. In the LFN case, the short name is not examined. The long file name that does not match in the create case will be cannonized to a tentative short name by searching for a '.' seperated suffix from the right side of the LFN moving left. The long name up to the first period, or 6 characters, will be copied moving right. The last two characters preceeding the suffix are replaced with a provisional "tail" value: "~1". Thus: ThisIsALong.Named.Document -> THISIS~1.DOC VeryLongName.TXT -> VERYLO~1.TXT Short.doc -> SHORT.DOC (8.3 name on input) The tail is permitted to increment based on collision, "eating" the characters leftward. Windows95 and WindowsNT have support routines for VFAT that do all of this for you. For instance, "VeryLongName.TXT" has a tentative short name with a provisional numeric "tail" of "VERYLO~1.TXT". Now if there already exists a "VERYLO~9.TXT", the next allowable post-collision value is "VERYL~10.TXT". This requires that your pattern match match the suffix (".TXT"), then go through matching "VERLO" for all possible tail vales that already exist. This requires a full traversal of the short name space for the directory. Windows95 wimps out and uses a "max" value on the tail, so if you had "VERYLO~8.TXT" but no "VERYLO~1.TXT", it would still end up with the post-collision name "VERYLO~9.TXT". Mac fanatics might note that since HFS has case sensitive storage and case insensitive lookup as well, a retraversal of the directory is probably going to be a common occurance for Mac NS support, or for a NetWare Volume mounting FS, which supports DOS, Mac, Unicode, OS/2, and UNIX name spaces. Eventually, the VOP_READDIR wants to be used to implement the VOP_LOOKUP (adding create flags and removing VOP_LOOKUP), and the VOP_READDIR wants to be split out into getdirblock and getdirblockelementfromdirblock parts, per the discussion of how to get rid of the "cookie" interface that causes so many problems in the NFS server implementation. > Following your suggestion of dropping the Unicode high byte, a > primary concern is that this will itself lead to name space > collisions. No, it won't. This is because you assume that lookup (open/stat/et al.) comes in as an ISO 8859-1 character set path. If you were masking on the way in, *then* you could get collisions. For short names, masking on the way in, you *can* get collisions. These you resolve with the normal collision resoloution algorithms. > I'm a bit vague on the complete range of encodings, but I assume > that LFNs could coexist in a directory where the only difference > is in bits 8-15, which are then masked off. No. A path comes into Win95's IFS (which the VFS interface is replacing in the BSD implementation) with an attribute of "terminal component is LFN" or "terminal component is 8.3" and an attribute of "path not including terminal components does/doesn't contain LFN components". For the sake of matching the Win95 collision algorithm in a VFS implementation, you will need to process each component and determine if it is a valid 8.3 name. If it isn't, it's an LFN. The short name equivalent of an LFN component passed in *always* goes to the collision name space... it could not have been a valid short name if it is a long name. > Alternatively, masking off the high byte may result in a value of > (binary) zero embedded in the LFN, or something equally undesirable. The LFN is *always* stored as a Unicode value. Therefore it is always stored as a 16 bit value, not two 8 bit values, so an embedded 8 bit zero is allowable (even expected). The predominance of Microsoft's ISO-10646/16 (Unicode) implementation is precisely why I wanted a 16 bit instead of a 32 bit wchar_t, and why I introduced the ISUNICODE flag in the path processing constants, and why I've been arguing for a shortword character (not char) length prefix for path processing. POSIX compatability in a Unicode environment is probably best done in the library, eventually. > Won't this entail a further algorithm to produce distinct BSD LFN > representations, or do you forsee another way around this? There will be a short term interoperability issue for upgrade to a internationalized file name space, when and if it is supported at the system call layer. Other than that, it should be as 8 bit clean for typical use, and we can adulterate non-ISO-8859-1 names into the 8 bit space available. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.