From owner-freebsd-hackers  Fri Aug 13 18:53:33 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP
	id 04EB115039; Fri, 13 Aug 1999 18:52:53 -0700 (PDT)
	(envelope-from tlambert@usr04.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.9.3/8.9.3) id SAA08447;
	Fri, 13 Aug 1999 18:50:58 -0700 (MST)
Received: from usr04.primenet.com(206.165.6.204)
 via SMTP by smtp03.primenet.com, id smtpdAAAHkaqzq; Fri Aug 13 18:50:53 1999
Received: (from tlambert@localhost)
	by usr04.primenet.com (8.8.5/8.8.5) id SAA23891;
	Fri, 13 Aug 1999 18:50:48 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199908140150.SAA23891@usr04.primenet.com>
Subject: Re: BSD XFS Port & BSD VFS Rewrite
To: Matthew.Alton@anheuser-busch.com (Alton Matthew)
Date: Sat, 14 Aug 1999 01:50:47 +0000 (GMT)
Cc: Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
In-Reply-To: <0740CBD1D149D31193EB0008C7C56836EB8AFC@STLABCEXG012> from "Alton, Matthew" at Aug 5, 99 06:02:47 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> I am currently conducting a thorough study of the VFS subsystem
> in preparation for an all-out effort to port SGI's XFS filesystem to
> FreeBSD 4.x at such time as SGI gives up the code.  Matt Dillon
> has written in hackers- that the VFS subsystem is presently not
> well understood by any of the active kernel code contributers and
> that it will be rewritten later this year.  This is obviously of great
> concern to me in this port.

It is of great concern to me that a rewrite, apparently because of
non-understanding, is taking place at all.

I would suggest that anyone planning on this rewrite should talk,
in depth, with John Heidemann prior to engaging in such activity.
John is very approachable, and is a deep thinker.  Any rewrite
that does not meet his original design goals for his stacking
architecture is, I think, a Very Bad Idea(tm).


> I greatly appreciate all assistance in answering the following
> questions:
> 
> 1)  What are the perceived problems with the current VFS?
> 2)  What options are available to us as remedies?
> 3)  To what extent will existing FS code require revision in order
>      to be useful after the rewrite?
> 4)  Will Chapters 6,7,8 & 9 of "The Design and Implementation of
>      the 4.4BSD Operating System" still pertain after the rewrite?
> 5)  How important are questions 3 & 4 in the design of the new
>      VFS?
> 
> I believe that the VFS is conceptually sound and that the existing
> semantics should be strictly retained in the new code.  Any new
> functionality should be added in the form of entirely new kernel 
> routines and system calls, or possibly by such means as
> converting the existing routines to the vararg format &etc.

Here some of the problems I'm aware of, and my suggested remedies:

1.	The interface is not reflexive, with regard to cn_pnbuf.

	Specifically, path buffers are allocated by the caller, but
	not freed by the caller, and various routines in each FS
	implementation are expected to deal with this.

	Each FS duplicates code, and such duplication is subject
	to error.  Not to mention that it makes your kernel fat.

2.	Advisory locks are hung off private backing objects.

	Advisory locks are passed into VOP_ADVLOCK in each FS
	instance, and then each FS applies this by hanging the
	locks off a list on a private backing object.  For FFS,
	this is the in core inode.

	A more correct approach would be to hang the lock off the
	vnode.  This effectively obviates the need for having a
	VOP_ADVLOCK at all, except for the NFS client FS, which
	will need to propagate lock requests across the net.  The
	most efficient mechanism for this would be to institute
	a pass/fail response for VOP_ADVLOCK calls, with a default
	of "pass", and an actual implementation of the operand only
	in the NFS client FS.

	Again, each FS must duplicate the advisory locking code,
	at present, and such duplication is subject to error.

3.	Object locks are implemented locally in many FS's.

	The VOP_LOCK interface is implemented via vop_stdlock()
	calls in many FS's.  This is done using the "vfs_default"
	mechanism.  In other FS's, it's implemented locally.

	The intent of the VOP_LOCK mechanism being implemented
	as a VOP at all was to allow it to be proxied to another
	machine over a network, using the original Heidemann
	design.  This is also the reason for the use of descriptors
	for all VOP arguments, since they can be opaquely proxied to
	another machine via a general mechanism.  Unlike NFS based
	network filesystems, this would allow you to add VOP's to
	both machines, without having to teach the transport about
	the new VOP for it to be usable remotely.

	Like the VOP_ADVLOCK, the need for VOP_LOCK is for proxy
	purposes, and it, too, should generate a pass/fail response,
	and be largely implemented in non-filesystem specific
	higher level code.

	Again, each FS which duplicates code for this function is
	subject to duplication errors.

4.	The VOP_READIR interface is irrational.

	The VOP_READDIR interface returns its responses in "host
	cannonical format" (struct dirent, in sys/dirent.h).
	Internally, FFS operates on "directory entry blocks" that
	contain exactly these structures (an intentaional coincidence).

	The problem with this approach, is that it makes the getdents
	system call sensitive to file systems for which some of the
	information returned (e.g. d_fileno, d_reclen, d_type, d_namlen)
	are synthetic.  What this means is that a native file system
	directory implementation single directory block must be able
	to fit into the buffer passed to the getdirentries(2) system
	call, or a directory listing is not a valid snapshot of the
	current state of the directory.

	It also vastly complicates directory traversal restarts (hence
	the ncookies and a_cookies arguments, since the NFS server
	requires the ability to restart traversal, mid-block, since
	the NFSv2 protocol returns directory entries one at a time).

	The "cookie" idea must be carried out faithfully, in an FS
	specific fashion, for each FS which is allowed to be NFS
	exported.  This code duplication is subject to error, or
	worse, non-implementation due to its complexity.

	A more rational approach would be to split the operation
	into two seperate VOP's: one to acquire a snapshot of a set
	of FS specific directory entries of an arbitrary size, and
	the second to extract rentries into the user's buffer, in
	cannonical format.

5.	The idea of "root" vs. "non-root" mounts is inherently bad.

	Right now, there are several operations, all wrapped into
	a single "mount" entry point.  This is actually a partial
	transition to a more cannonically correct implemetnation.

	The reason for the "root" vs. "non-root" knowledge in the
	code has to do with several logical operations:

	1)	"Mounting" the filesystem; that is, getting the
		vnode for the device to be mounted, and doing any
		FS specific operations necessary to cause the
		correct in-core context to be established.

	2)	Covering the vnode at the mount point.

		This operation updates the vnode of the mount
		point so that traversals of the mount point will
		get you the root directory of the FS that was
		mounted instead of the directory that is covered
		by the mount.

	3)	Saving the "last mounted on" information.

		This is a clerical detail.  Read-only FS's, and
		some read-write FS's, do not implement this.  It
		is mostly a nicety for tools that manipulate FFS
		directly.

	4)	Initialize the FS stat information.

		Part of the in-core data for any FS is the mnt_stat
		data, which is what comes back from a VFS_STATFS()
		call

	The first operation is invariant.  It must be done for all
	FS's, whether they are "root" or "non-root".

	The second operation is specific to "non-root" FS's.  It
	could be moved to common, higher level code -- specifically,
	it could be moved into the mount system call.

	The third operation is also specific to "non-root" FS's.  It
	could be discarded, or it could be moved to a seperate VFS
	operation, e.g. VFS_SETMNTINFO().  I would recommend moving
	it to a seperate VFSOP, instead of discarding it.  The reason
	for this is that an intelligent person could reasonably decide
	to add the setting of this data in newfs and tunefs, and do
	away with /etc/fstab.

	The fourth operation is invariant.  It must be done for all
	FS's, whether they are "root" or "non-root".


	We can now see that we have two discrete operations:

	1)	Placement of any FS, regardless of how it is intended
		to be used, into the list of mounted filesystems.

	2)	Mapping a filesystem from the list of mounted FS's
		into the directory hierarchy.

	The job of the per FS mount code should be to take a mount
	structure, the vnode of a device, the FS specific arguments,
	the mount point credentials, and the process requesting the
	mount, and _only_ do #1 and #4.

	The conversion of the root device into a vnode pointer, or
	a path to a device into a vnode pointer, is the job of upper
	level code -- specifically, the mount system call, and the
	common code for booting.

	This removes a large amount of complex code from each of
	the file systems, and centralizes the maintenance task into
	one set of code that either works for everyone, or no one
	(removing the duplication of code/introduction of errors
	issue).

	In addition, the lack of "root" specific code in many FS's
	VFS_MOUNT entry points is the reason that they can not be
	mounted as "/".  This change would open it up, such that any
	FS that was supported by the kernel could be used as the
	root filesystem.

6.	The "vfs_default" code damages stacking

	The intent of the stacking architecture was to have the
	default operation for any VOP unknown to an FS fall through
	to the lower level code, and fail if it was not implemented.

	The use of the "vfs_default" to make unimplemented VOP's
	fall through to code which implements function, while well
	intentioned, is misguided.

	Consider the case of a VOP proxy that proxies requests.  These
	might be requests to another machine, as in the previous
	proxy example, or they might be requests to user space, to
	allow for easy developement of new filesystem layers.

	In addition, in order to get a default operation to actually
	fail, you have to intentionally create a failing VOP for that
	particular FS.

	Finally, the paradigm can not support new VOP's without a
	kernel recompilation.  This means that in order to add to
	the list of VOP's known to the system when you add a new FS,
	you don't merely have to reallocate the in-core copy of the
	vnodeop_desc to include a new (failing) member, you have to
	create a default behaviour for it, and modify the default
	operations table.  In other words, it's not extensible, as
	it was architected to be.

7.	The struct nameidata (namei.h) is broken in conception.

	One issue that recurrs frequently, and remains unaddressed,
	is the issue of namespace abstraction.

	This issue is nowhere more apparent than in the VFAT and NTFS
	filesystems, where there are two namespaces: one 8.3, and the
	second, 16 bit Unicode.

	The problem is one of coherency, and one of reference, and
	is not easily resolved in the context of the current nameidata
	structure.  Both NTFS and the VFAT FS try to cover this issue,
	both with varing degress of success.

	The problem is that there is no cannonical format that the
	kernel can use to communicate namespace data to FS's.  Unlike
	VOP_READDIR, which has the abstract (though ill-implemented)
	struct dirent, there is no abstract representation of the
	data in a pathname buffer, which would allow you to treat
	path components as opaque entities.

	One potential remedy for this situation would be to cannonize
	any path into an ordered list of components.  Ideally, this
	would be done in 16 bit Unicode (looking toward the future),
	but would minimally be seperate components with length counts
	to allow faster rejection of non-matching components, and
	frequent recalculation of length.

8.	The filesystems have knowledge of the name cache.

	Entries into the name cache, and deletion of entries from
	the name cache, should be handled in FS independent code
	at a higher level.  This can avoid expensive VFS_LOOKUP calls
	in many cases, and save marshalling arguments into and out of
	the descriptor structure, in addition to drastically reducing
	the function call overhead.

	Someone recently profiling FreeBSD's FS to detemine speed
	bottleneck (I believe it was Mike Smith, attempting to
	optimize for a ZD Labs benchmark) found that FreeBSD spends
	much of its time in namei().

9.	The implementation of namei() is POSIX non-compliant

	The implementation of namei() is by means of coroutine
	"recursion"; this is similar to the only recursion you can
	achieve in FORTRAN.

	The upshot of this is that the use of the "//" namespace
	escape allowed by POSIX can not be usefully implemented.
	This is because it is not possible to inherit a namespace
	escape deeper than a single path component for a stack of
	more than one layer in depth.

	This needs to be fixed, both for "natural" SMBFS support,
	and for other uses of the namespace escape (HTTP "tunnels",
	extended attribute and/or resource fork access in an OS/2
	HPFS or Macintosh HFS implementation, etc.), including
	forward looking research.

	This is related to item 7.

10.	Stacking is broken

	This is really an issue of not having a coherency protocol
	which can be applied between stacks of files.  It is somewhat
	related to almost all of the above issues.

	The current thinking which has been forwarded by Matt and
	John is that a vnode should have an associated vm_object_t,
	and that coherency should be maintained that way.

	This thinking is flawed for a number of reasons:

	a.	The main utility of this would be for an MFS
		implementation.  While a "fast MFS" is a
		laudable goal, it isn't sufficient to drive this.

	b.	A coherency protocol is required in any case,
		since a proxied VOP is not necessarily on the
		same machine or in the same VM space.  This
		approach would disallow the possibility of a
		user space filesystem developement framework.

	c.	There already exist aliases (VM implementation
		errors); intentionally adding aliases as an
		implementation detail will futher obfuscate them.
		Minimally, the VM system should pass a full
		branch path analysis based test procedure before
		they are introduced.  Even then, I would argue
		that it would open up a large complexity space
		that would prevent us from ever being sure about
		problem resoloution again.

	d.	Filesystems which need to transform data can
		never operate correctly, since they need to
		make local copies of the transformed content.
		This includes cryptographic, character set
		translation, compression, and similar stacking
		layers.

	Instead, I think the interface design issues (VOP_ADVLOCK,
	VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.)
	that drive the desire to implement coherency in this
	fashion be examined.  I believe that an ideal soloution
	would be to never have the pages replicated at more than a
	single vnode.  This would likewise solve the coherency
	problem, without the additional complexity.  The issue
	would devolve into locating the real backing object, and
	potentially, translating extents.


11.	The function call "footprint" of filesystems is too large

	Attempt the following:

		Compile up all of the files which make up an
		individual filesystem.  You can take all of
		the files for the ufs/ffs objects and the
		vnode_if.o from a compiled kernel for this
		exercise.

		Now link them.  Ignore the missing "main"; how
		many undefined functions are there?

	The problem you are seeing is the incursion of the VM
	system, and sloppy programming practices, into each VFS
	implementation.

	This footprint impacts filesystem portability, and is
	one reason, among many (including some of the above) that
	VFS modules are no longer very portable between BSD
	flavors.

	Minimally, the VFS incursions need to be macrotized, and
	not assume a unified VM and buffer cache (or a non-unified
	VM and buffer cache, as well, for that matter).  This would
	improve portability considerably.

	In addition to this change, a function minimzation effort
	should take place.

	If the underlying interface utilized by VFS layers was not
	the kernel (for local media FS's, like FFS or NTFS), but
	instead a variable granularity block store with a numeric
	namespace, then the "top" and "bottom" interfaces could be
	identical.  For now, however, some work can be done (and
	should be done) to reduce the function call footprint.
	This is important work, which can only aid developement
	of future work (such as a user space filesystem framework
	for use by developers and researchers).

	I hesitate to suggest this, but it might be reasonable to
	consider a struct containing externally referenced functions,
	which is registered into the FS via mount, and which is
	identical for all FS's.  This would, likewise, promote the
	idea of a user space framework.

	Ideally, work would be done to port the Heidemann framework
	to Linux, so that their developers could be leveraged.


Some FFS-specific problems are:

1.	The directory code in the UFS layer is intertwined with the
	filespace code

	Ideally, one would be able to mount a filesystem as a flat
	numeric namespace (see #7, above), and then mount the idea
	of directory management over top of that.

2.	The quota subsystem is too tightly integrated

	Quotas should be an abstract stacking layer that can be
	applied to any FS, instead of an FFS specific monstrosity.

	The current quota system is also limited to 16 bits for a
	number of values which, in FreeBSD, can be greater than
	16 bits (e.g. UID's).

	The current quota system is also broken for Y2038.

3.	The filesystem itself is broken for Y2038

	The space which was historically reserved for the Y2038 fix
	(a 64 bit time_t) was absconeded with for subsecond resoloution.

	This change should be reverted, and fsck modified to re-zero
	the values, given a specific argument.

	The subsecond resoloution doesn't really matter, but if it is
	seen as an issue which needs to be addressed, the only value
	which could reasonably require this is the modification time,
	and there is sufficient free space in the inode to be able
	to provide for this (there are 2x32 bit spares).


I have other suggestions, but the above covers the most obvious
damage.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message