From owner-freebsd-hackers  Thu Aug  1 17:47:22 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id RAA08436
          for hackers-outgoing; Thu, 1 Aug 1996 17:47:22 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id RAA08431
          for <hackers@freebsd.org>; Thu, 1 Aug 1996 17:47:19 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id RAA04852; Thu, 1 Aug 1996 17:43:14 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199608020043.RAA04852@phaeton.artisoft.com>
Subject: Re: anyone working on upgrading the msdosfs to NetBSD levels?
To: rnordier@iafrica.com (Robert Nordier)
Date: Thu, 1 Aug 1996 17:43:14 -0700 (MST)
Cc: terry@lambert.org, hackers@freebsd.org
In-Reply-To: <199608011928.VAA00586@eac.iafrica.com> from "Robert Nordier" at Aug 1, 96 09:28:07 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> > If you get to where you need to work on name collision, let me know,
> > and I can describe the algorithm in a couple of pages.
> 
> I was doing some work on this just recently.  When you have the
> time, I'd appreciate your description.  There may be a few points
> that my derived algorithm misses.

The name collision is handled differently based on "intend lookup"
vs. "intend create".  The "EXCLUSIVE" flag I added for lookups
which are expected to return an EEXISTS in the "intend create"
case (create target, rename target, etc.) will help.


When a name is passed in, it is either capable of being a valid
8.3 name or it isn't.  If it is, then the name is looked up in
the 8.3 name space.  For an 8.3 name, the LFN will always be
the same as the 8.3 name.  Note that the 8.3 name is a case
insensitive name; a name "Foo" and "foo" are identical for case
insensitive lookup/case sensitive storage; they are the valid
short name "FOO".

The short name lookup will match or fail to match.  In no case is
it possible for a valid short name to have a long name form that
collides with another long name form.  Short names are looked up
in the short name space and in the long name space.  The long name
space lookup is done because of the direction collisions are
handled.


In the LFN case, the lookup is done in the long name space.  This
is because any LFN that comes into lookup will have an LFN form
if it has been previously stored.  In the LFN case, the short name
is not examined.

The long file name that does not match in the create case will be
cannonized to a tentative short name by searching for a '.' seperated
suffix from the right side of the LFN moving left.

The long name up to the first period, or 6 characters, will be copied
moving right.

The last two characters preceeding the suffix are replaced with a
provisional "tail" value: "~1".

Thus:

ThisIsALong.Named.Document	->	THISIS~1.DOC
VeryLongName.TXT		->	VERYLO~1.TXT
Short.doc			->	SHORT.DOC (8.3 name on input)


The tail is permitted to increment based on collision, "eating" the
characters leftward.  Windows95 and WindowsNT have support routines
for VFAT that do all of this for you.

For instance, "VeryLongName.TXT" has a tentative short name with
a provisional numeric "tail" of "VERYLO~1.TXT".  Now if there already
exists a "VERYLO~9.TXT", the next allowable post-collision value is
"VERYL~10.TXT".

This requires that your pattern match match the suffix (".TXT"), then
go through matching "VERLO" for all possible tail vales that already
exist.  This requires a full traversal of the short name space for the
directory.  Windows95 wimps out and uses a "max" value on the tail,
so if you had "VERYLO~8.TXT" but no "VERYLO~1.TXT", it would still
end up with the post-collision name "VERYLO~9.TXT".


Mac fanatics might note that since HFS has case sensitive storage
and case insensitive lookup as well, a retraversal of the directory
is probably going to be a common occurance for Mac NS support, or for
a NetWare Volume mounting FS, which supports DOS, Mac, Unicode, OS/2,
and UNIX name spaces.  Eventually, the VOP_READDIR wants to be used
to implement the VOP_LOOKUP (adding create flags and removing
VOP_LOOKUP), and the VOP_READDIR wants to be split out into getdirblock
and getdirblockelementfromdirblock parts, per the discussion of how
to get rid of the "cookie" interface that causes so many problems
in the NFS server implementation.


> Following your suggestion of dropping the Unicode high byte, a
> primary concern is that this will itself lead to name space
> collisions.

No, it won't.  This is because you assume that lookup (open/stat/et al.)
comes in as an ISO 8859-1 character set path.  If you were masking on
the way in, *then* you could get collisions.  For short names, masking
on the way in, you *can* get collisions.  These you resolve with the
normal collision resoloution algorithms.

> I'm a bit vague on the complete range of encodings, but I assume
> that LFNs could coexist in a directory where the only difference
> is in bits 8-15, which are then masked off.

No.

A path comes into Win95's IFS (which the VFS interface is replacing in
the BSD implementation) with an attribute of "terminal component is LFN"
or "terminal component is 8.3" and an attribute of "path not including
terminal components does/doesn't contain LFN components".

For the sake of matching the Win95 collision algorithm in a VFS
implementation, you will need to process each component and determine
if it is a valid 8.3 name.  If it isn't, it's an LFN.  The short name
equivalent of an LFN component passed in *always* goes to the collision
name space... it could not have been a valid short name if it is a
long name.


> Alternatively, masking off the high byte may result in a value of
> (binary) zero embedded in the LFN, or something equally undesirable.

The LFN is *always* stored as a Unicode value.  Therefore it is always
stored as a 16 bit value, not two 8 bit values, so an embedded 8 bit
zero is allowable (even expected).

The predominance of Microsoft's ISO-10646/16 (Unicode) implementation
is precisely why I wanted a 16 bit instead of a 32 bit wchar_t, and
why I introduced the ISUNICODE flag in the path processing constants,
and why I've been arguing for a shortword character (not char) length
prefix for path processing.  POSIX compatability in a Unicode environment
is probably best done in the library, eventually.


> Won't this entail a further algorithm to produce distinct BSD LFN
> representations, or do you forsee another way around this?

There will be a short term interoperability issue for upgrade to a
internationalized file name space, when and if it is supported at the
system call layer.  Other than that, it should be as 8 bit clean for
typical use, and we can adulterate non-ISO-8859-1 names into the 8 bit
space available.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.