From owner-freebsd-hackers@FreeBSD.ORG  Tue Sep  6 15:59:41 2005
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
X-Original-To: freebsd-hackers@freebsd.org
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 906D416A420
	for <freebsd-hackers@freebsd.org>; Tue,  6 Sep 2005 15:59:41 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53])
	by mx1.FreeBSD.org (Postfix) with ESMTP id CC6E843D45
	for <freebsd-hackers@freebsd.org>; Tue,  6 Sep 2005 15:59:40 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by cyrus.watson.org (Postfix) with ESMTP id 1FA6146B6B;
	Tue,  6 Sep 2005 11:59:40 -0400 (EDT)
Date: Tue, 6 Sep 2005 16:59:40 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Igor Shmukler <shmukler@mail.ru>
In-Reply-To: <E1ECfTj-000Hm4-00.shmukler-mail-ru@f12.mail.ru>
Message-ID: <20050906164912.H78038@fledge.watson.org>
References: <E1ECfTj-000Hm4-00.shmukler-mail-ru@f12.mail.ru>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Dag-Erling Sm=?koi8-r?Q?=F8?=rgrav <des@des.no>,
	Sergey Uvarov <uvarovsl@mail.pnpi.spb.ru>, freebsd-hackers@freebsd.org
Subject: Re[3]: vn_fullpath() again
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Sep 2005 15:59:41 -0000


On Tue, 6 Sep 2005, Igor Shmukler wrote:

> Thank you very much for a detailed reply. I was aware of many of the 
> things you mentioned, but it never hurts to hear something one more 
> time.
>
> How do you feel about small incremental improvements to name lookup?
>
> What about looking up device name in the structure itself for VCHR nodes 
> then prepending /dev/ and returning device name, as a first step?
>
> If incremental improvements sound like a good idea, maybe we could do a 
> few small modifications that would cover some additional cases. Would 
> not it be good?

This is an issue of some importance to me, as reliable naming is very 
useful in the world of security -- especially for audit trails, where you 
want to provide reliable security log information for intrusion detection 
and post-mortems.  I have a few things in mind --

I'd like to offer something like a best-effort VOP_GETPATH(), which will 
be implemented by synthetic file systems to return a path "to the file 
system root" for a node.  This would be implemented by file systems like 
procfs and devfs to handle cases where the name cache wasn't used.  For 
example, it would return 'ptyp0' when pointed at /dev/ptyp0, and then it 
will be the name system's responsibility to figure out from the file 
system root back through the next file system.  For file systems not 
supporting it, EOPNOTSUPP might be returned.

One of the particularly not-possible-to-handle-cases is NFS, though, and 
I'm not sure we should expect to be able to hand it.  As with the 
Sun/BSD/UNIX VFS/UFS, it really is designed around the idea of only files 
and directories being first class objects, not names.  There's no 
mechanism for cache invalidation, unlike with local file systems, however, 
so we may simply be screwed here :-).

Robert N M Watson


>
> Thank you in advance,
>
> Igor
>
> -----Original Message-----
> From: Robert Watson <rwatson@FreeBSD.org>
> To: Igor Shmukler <shmukler@mail.ru>
> Date: Tue, 6 Sep 2005 16:21:47 +0100 (BST)
> Subject: Re[2]: vn_fullpath() again
>
>>
>> On Tue, 6 Sep 2005, Igor Shmukler wrote:
>>
>>>>> You are correct about the Unix file system organization, but does it
>>>>> mean reliable vnode to fullname conversation is not possible?
>>>>
>>>> Yes.  Get over it.
>>>
>>> Well, I do not think it is a Yes. I very much think it is a No. You
>>> should have continued reading my email 'til the middle or even farther.
>>
>> There are various tricks that can be played to increase the chances of
>> finding a name in the name cache, but those tricks run out quickly on
>> systems like NFS servers where files can be accessed without being looked
>> up since the last boot, or with background fsck.  This is a fundamental
>> property of the UNIX file system design, and it while it offers some quite
>> powerful capabilities, nothing changes the fact that names are
>> fundamentally second class systems in the file system and VFS design.
>>
>> The main tricks that can be played are:
>>
>> - Don't purge intermediate but unused nodes from the name cache.  A
>>    specific design choice in FreeBSD has been to allow cache entries for
>>    unused nodes to be removes so that the nodes can be reused.  On systems
>>    that rapidly consume vnodes, this allows more vnodes to be recycled, so
>>    means more memory available.  However, it also means that it is less
>>    likely to be possible to reconstruct a name from the name cache.
>>
>> - Maintain references to cache entries instead of vnodes when accessing
>>    leaf files.  This is actually somewhat the approach taken by Linux --
>>    typically the hardest name to "identify" is the last segment to reach a
>>    file, since files can have hard links (and directories typically don't).
>>    That name can rapidly be invalidated due to renaming, unlinking,
>>    linking, and so on, and hence can be quite stale, but if you assume the
>>    name space is static, this will help out with the "files don't have
>>    parents" problem.
>>
>> - With a minor redesign of UFS, eliminating hard links, it is possible to
>>    add a directory back-pointer to the parent of a file.  In this case,
>>    there is an authoritative reference to the parent.  Mind you, this comes
>>    with many down-sides: Apple attempted to ship a UNIX system without
>>    support for hard links, and had to rapidly hack support for it back into
>>    the file system.
>>
>> - Maintain a parent back-pointer for files in the vnode, reflecting the
>>    last directory used to reach the file, so that you can search that
>>    directory to find a possible name.  This requires different reference
>>    management behavior, prevents directories from falling out of the cache
>>    if a file reached via the directory is in use, and will also require
>>    walking directories, which can be very expensive.
>>
>> At heart, though, fundamental issues remain: files can have no names, or
>> they can be looked up using a name that is removed, yet still have another
>> name.  They can have several names.  They can be accessed without any
>> lookup.  The same name can refer to several files due to mountpoint
>> covering.  Throughout the design, names are assumed to be only fleetingly
>> valid (during the lookup), and of secondary importance after that.
>>
>> Most systems I've looked at try to work around a lack of names in two
>> ways:
>>
>> (1) They treat the name as something valid only at time of lookup.  For
>>      example, the Solaris audit system captures a name used to look up a
>>      node, and after that it is the responsibility of the consumer of the
>>      audit trail to identify any name operations that might affect the name
>>      of an object in use, if names are important.  Typically they have to
>>      handle three names during lookup: path to process root, path from
>>      process root to cwd, and path from cwd to file.
>>
>> (2) Apple has an underlying file system, HFS+, that actually maintains a
>>      fairly strong notion of directory hierarchy, via its catalog, so you
>>      can look up parent nodes.  They maintain a vnode backpointer from
>>      children to parents in VFS, set up during lookup.  However, this
>>      breaks for several reasons: volfs, which allows access to files by
>>      device + inode number, NFS, which allows access to files not by path,
>>      and their hacks to re-add hard links using a special directory, which
>>      can result in no sensible name being returned at all.  This is why if
>>      you look at Darwin/Mac OS X audit trails, you'll often just see lists
>>      of inode numbers and device numbers instead of names.
>>
>> (3) They attempt to strengthen the name cache, either lowering the ability
>>      to recycle system memory for intermediate directories, or accepting
>>      more stale data.  Either way, the approaches fall down in the face of
>>      the fundamental design choice to deprioritize names: NFS, direct inode
>>      access, hard links, mount point grafting.
>>
>> (4) Maintain parallel data structures, such as used by HADB, to construct
>>      "directory trees", and fall back on expensive disk searching
>>      algorithms to handle edge cases, rename, NFS access, and so on.
>>
>>
>> Robert N M Watson
>>
>
>
> ???????@Mail.ru - ???????? ??????? ??? ?????? ?????.
> http://r.mail.ru/cln2726/hosting.mail.ru/
>
>