Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 16 Aug 1999 21:18:45 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        wrstuden@nas.nasa.gov
Cc:        tlambert@primenet.com, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject:   Re: BSD XFS Port & BSD VFS Rewrite
Message-ID:  <199908162118.OAA04940@usr09.primenet.com>
In-Reply-To: <Pine.SOL.3.96.990816105106.27345H-100000@marcy.nas.nasa.gov> from "Bill Studenmund" at Aug 16, 99 01:48:16 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > 2.	Advisory locks are hung off private backing objects.
> > 
> > 	Advisory locks are passed into VOP_ADVLOCK in each FS
> > 	instance, and then each FS applies this by hanging the
> > 	locks off a list on a private backing object.  For FFS,
> > 	this is the in core inode.
> > 
> > 	A more correct approach would be to hang the lock off the
> > 	vnode.  This effectively obviates the need for having a
> > 	VOP_ADVLOCK at all, except for the NFS client FS, which
> > 	will need to propagate lock requests across the net.  The
> > 	most efficient mechanism for this would be to institute
> > 	a pass/fail response for VOP_ADVLOCK calls, with a default
> > 	of "pass", and an actual implementation of the operand only
> > 	in the NFS client FS.
> 
> I agree that it's better for all fs's to share this functionality as much
> as possible.
> 
> I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an 
> efficiency concern. If we actually make a VOP call, that should be the
> end of the story. I.e either add a vnode flag to indicate pas/fail-ness,
> or add a genfs/std call to handle the problem.
> 
> I'd actually vote for the latter. Hang the byte-range locking off of the
> vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on
> OS flavor) to handle the call. That way all fs's that can share code, and
> the callers need only call VO_ADVLOCK() - no other logic.

OK.  Here's the problem with that:  NFS client locks in a stacked
FS on top the the NFS client FS.

Specifically, you need to seperate the idea of asserting a lock
against the local vnode, asserting the lock via NFS locking, and
coelescing the local lock list, after both have succeeded, or
reverting the local assertion, should the remote assertion fail.

This is particularly important for transformative layers, specifically
cryptographic or compressing layers.  A similar issue exists for
character sets, e.g. a Unicode enabled OS NFS mounting via NFS
an ISO 8859-1 filesystem, and having to do the directory (de)bloat
on the fly.


> NetBSD actually needs this to get unionfs to work. Do you want to talk
> privately about it?

If you want.  FreeBSD needs it for unionfs and nullfs, so it's
something that would be worth airing.

I think you could say that no locking routine was an approval of
the uuper level lock.  This lets you bail on all FS's except NFS,
where you have to deal with the approve/reject from the remote
host.  The problem with this on FreeBSD is the VFS_default stuff,
which puts a non-NULL interface on all FS's for all VOP's.


> > 3.	Object locks are implemented locally in many FS's.
> > 
> > 	The VOP_LOCK interface is implemented via vop_stdlock()
> > 	calls in many FS's.  This is done using the "vfs_default"
> > 	mechanism.  In other FS's, it's implemented locally.
> > 
> > 	The intent of the VOP_LOCK mechanism being implemented
> > 	as a VOP at all was to allow it to be proxied to another
> > 	machine over a network, using the original Heidemann
> > 	design.  This is also the reason for the use of descriptors
> > 	for all VOP arguments, since they can be opaquely proxied to
> > 	another machine via a general mechanism.  Unlike NFS based
> > 	network filesystems, this would allow you to add VOP's to
> > 	both machines, without having to teach the transport about
> > 	the new VOP for it to be usable remotely.
> 
> Just for a point of comparison, I recently got almost all the NetBSD fs's
> to use common code. After our -Lite2 merge, all fs's were either calling
> the lock manager, or using genfs_nolock() (a version for non-locking
> fs's). Now there's a struct lock * and struct lock in struct vnode. The fs
> exports its locking behavior via the struct lock *. For most fs's, the
> struct lock * points to the struct lock, and genfs_lock() feeds that to
> the lock manager.
> 
> But we've kept the ability to do something different (like call over the
> network) alive. If the struct lock * is NULL, you have to call VOP_LOCK on
> that fs. Note that this difference only matters for layered fs's -
> everything else should be calling VOP_LOCK() and letting the dispatch code
> figure out the right thing to do.

Yes, this NULL is the same NULL I suggested for advisory locks,
above.

FreeBSD has moved to more common code, but it's all call-down
based because of the vfs_default stuff again.


> > 5.	The idea of "root" vs. "non-root" mounts is inherently bad.
> > 
> > 	Right now, there are several operations, all wrapped into
> > 	a single "mount" entry point.  This is actually a partial
> > 	transition to a more cannonically correct implemetnation.
> > 
> > 	The reason for the "root" vs. "non-root" knowledge in the
> > 	code has to do with several logical operations:
> > 
> > 	1)	"Mounting" the filesystem; that is, getting the
> > 		vnode for the device to be mounted, and doing any
> > 		FS specific operations necessary to cause the
> > 		correct in-core context to be established.
> > 
> > 	2)	Covering the vnode at the mount point.
> > 
> > 		This operation updates the vnode of the mount
> > 		point so that traversals of the mount point will
> > 		get you the root directory of the FS that was
> > 		mounted instead of the directory that is covered
> > 		by the mount.
> > 
> > 	3)	Saving the "last mounted on" information.
> > 
> > 		This is a clerical detail.  Read-only FS's, and
> > 		some read-write FS's, do not implement this.  It
> > 		is mostly a nicety for tools that manipulate FFS
> > 		directly.
> > 
> > 	4)	Initialize the FS stat information.
> > 
> > 		Part of the in-core data for any FS is the mnt_stat
> > 		data, which is what comes back from a VFS_STATFS()
> > 		call
> 
> You forgot:
> 
> 	5)	Update export lists
> 
> 		If you call the mount routine with no device name
> 		(args.fspec == 0) and with MNT_UPDATE, you get
> 		routed to the vfs_export routine

This must be the job of the upper level code, so that there is
a single control point for export information, instead of spreading
it throughout ead FS's mount entry point.

> > 	The first operation is invariant.  It must be done for all
> > 	FS's, whether they are "root" or "non-root".
> > 
> > 	The second operation is specific to "non-root" FS's.  It
> > 	could be moved to common, higher level code -- specifically,
> > 	it could be moved into the mount system call.
> 
> I thought it was? Admitedly the only reference code I have is the ntfs
> code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it
> is, I thought it'd be an ok reference.

No.

Basically, what you would have is the equivalent of a variable
length "mounted volume" table, from which mappings (and exports,
based on the mappings) are externalized into the namespace.


> > 	The third operation is also specific to "non-root" FS's.  It
> > 	could be discarded, or it could be moved to a seperate VFS
> > 	operation, e.g. VFS_SETMNTINFO().  I would recommend moving
> > 	it to a seperate VFSOP, instead of discarding it.  The reason
> > 	for this is that an intelligent person could reasonably decide
> > 	to add the setting of this data in newfs and tunefs, and do
> > 	away with /etc/fstab.
> > 
> > 	The fourth operation is invariant.  It must be done for all
> > 	FS's, whether they are "root" or "non-root".
> 
> For comparison, NetBSD has a mount entry point, and a mountroot entry
> point. But all the other ick is there too.

Right.  It should just have a "mount" entry point, and the rest
of the stuff moves to higher level code, called by the mount system
call, and the mountroot stuff during boot, to externalize the root
volume at the top of the hierarchy.

An ideal world would mount a / that had a /dev under it, and then
do transparent mounts over top of that.



> > 	We can now see that we have two discrete operations:
> > 
> > 	1)	Placement of any FS, regardless of how it is intended
> > 		to be used, into the list of mounted filesystems.
> > 
> > 	2)	Mapping a filesystem from the list of mounted FS's
> > 		into the directory hierarchy.
> 
> 	3)	Updating export information.

Built into the higher level code, same place as #2.

> > 	The job of the per FS mount code should be to take a mount
> > 	structure, the vnode of a device, the FS specific arguments,
> > 	the mount point credentials, and the process requesting the
> > 	mount, and _only_ do #1 and #4.
> > 
> > 	The conversion of the root device into a vnode pointer, or
> > 	a path to a device into a vnode pointer, is the job of upper
> > 	level code -- specifically, the mount system call, and the
> > 	common code for booting.
> 
> My one concern about this is you've assumed that the user is mounting a
> device onto a filesystem.

No.  Vnoide, not bdevvp.  The bdevvp stuff is for the boot time stuff
in the upper level code, and only applies to the root volume.

> Layered filesystems won't do that. nullfs,
> umaptfs, and unionfs will want a directory. The hierarchical storage
> system I'm working on will want a file. kernfs, procfs, and an fs which I
> haven't checked into the NetBSD tree don't really need the extra
> parameter. Supporting all these different cases would be a hassle for
> upstream code.
> 
> > 	This removes a large amount of complex code from each of
> > 	the file systems, and centralizes the maintenance task into
> > 	one set of code that either works for everyone, or no one
> > 	(removing the duplication of code/introduction of errors
> > 	issue).
> 
> Might I suggest a common library of routines which different mount
> routines can call? That way we'd get code sharing while letting the fs
> make decisions about what it expects of the input arguments.

This is the "footprint" problem, all over again.  Reject/accept (or 
"accept if no VOP") seems more elegant, and also reduces footprint.


> I've been looking forward to ripping the export updating out of the mount
> call. It'd be nice if we could rototill both FreeBSD & NetBSD's mount
> interfaces the same way at the same time. :-)

8-).


> > 7.	The struct nameidata (namei.h) is broken in conception.
> > 
> > 	One issue that recurrs frequently, and remains unaddressed,
> > 	is the issue of namespace abstraction.
> > 
> > 	This issue is nowhere more apparent than in the VFAT and NTFS
> > 	filesystems, where there are two namespaces: one 8.3, and the
> > 	second, 16 bit Unicode.
> > 
> > 	The problem is one of coherency, and one of reference, and
> > 	is not easily resolved in the context of the current nameidata
> > 	structure.  Both NTFS and the VFAT FS try to cover this issue,
> > 	both with varing degress of success.
> > 
> > 	The problem is that there is no cannonical format that the
> > 	kernel can use to communicate namespace data to FS's.  Unlike
> > 	VOP_READDIR, which has the abstract (though ill-implemented)
> > 	struct dirent, there is no abstract representation of the
> > 	data in a pathname buffer, which would allow you to treat
> > 	path components as opaque entities.
> > 
> > 	One potential remedy for this situation would be to cannonize
> > 	any path into an ordered list of components.  Ideally, this
> > 	would be done in 16 bit Unicode (looking toward the future),
> > 	but would minimally be seperate components with length counts
> > 	to allow faster rejection of non-matching components, and
> > 	frequent recalculation of length.
> 
> NetBSD's name cache is a bit different from FreeBSD's, and might win here.
> We have just VOP_LOOKUP, which calls the cache lookup routine, rather than
> both a VOP_LOOKUP and a VOP_CACHEDLOOKUP.
> 
> Jaromir Dolecek has been discussing adding a canonicalized component name
> to the cache entries. That way the VOP_LOOKUP routine gets called,
> canonicalizes the name as it sees fit (say making it all upper case) if
> it chooses to, and hands off to the cache lookup routine. The advantage is
> that each fs can chose its on canonicalization, if it wants to. For
> instance, ffs won't do anything (it's case sensetive), while other
> case-insensitive fs's will do different things.

Can you push a Unicode name down from an appropriate system call?

I don't see any way to deal with an NT FS for characters outside
ISO 8859-1, otherwise.  8-(.


> > 9.	The implementation of namei() is POSIX non-compliant
> > 
> > 	The implementation of namei() is by means of coroutine
> > 	"recursion"; this is similar to the only recursion you can
> > 	achieve in FORTRAN.
> > 
> > 	The upshot of this is that the use of the "//" namespace
> > 	escape allowed by POSIX can not be usefully implemented.
> > 	This is because it is not possible to inherit a namespace
> > 	escape deeper than a single path component for a stack of
> > 	more than one layer in depth.
> > 
> > 	This needs to be fixed, both for "natural" SMBFS support,
> > 	and for other uses of the namespace escape (HTTP "tunnels",
> > 	extended attribute and/or resource fork access in an OS/2
> > 	HPFS or Macintosh HFS implementation, etc.), including
> > 	forward looking research.
> > 
> > 	This is related to item 7.
> 
> I'm sorry. This point didn't parse. Could you give an example?
> 
> I don't see how the namei recursion method prevents catching // as a
> namespace escape.


//apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork

You can't inherit the fact that you are looking at the resource fork
in the terminal component, ONLY.


> > 	Instead, I think the interface design issues (VOP_ADVLOCK,
> > 	VOP_GETPAGES, VOP_PUTPAGES, VOP_READ, VOP_WRITE, et. al.)
> > 	that drive the desire to implement coherency in this
> > 	fashion be examined.  I believe that an ideal soloution
> > 	would be to never have the pages replicated at more than a
> > 	single vnode.  This would likewise solve the coherency
> > 	problem, without the additional complexity.  The issue
> > 	would devolve into locating the real backing object, and
> > 	potentially, translating extents.
> 
> As NetBSD's UBC work is moving in a similar direction, and I'm interested
> in working on a compressing fs, I'm interested in the solution you
> propose.

Matt Dillion is apparently the person doing the work here.  It seems
I am out of date on the current thinking, as the vm_object_t
apprach has apparently been discarded.


> > 2.	The quota subsystem is too tightly integrated
> > 
> > 	Quotas should be an abstract stacking layer that can be
> > 	applied to any FS, instead of an FFS specific monstrosity.
> 
> It should certainly be possible to add a quota layer on top of any leaf
> fs. That way you could de-couple quotas. :-)

Yes, assuming stacking works in the first place...


> > 3.	The filesystem itself is broken for Y2038
> > 
> > 	The space which was historically reserved for the Y2038 fix
> > 	(a 64 bit time_t) was absconeded with for subsecond resoloution.
> > 
> > 	This change should be reverted, and fsck modified to re-zero
> > 	the values, given a specific argument.
> > 
> > 	The subsecond resoloution doesn't really matter, but if it is
> > 	seen as an issue which needs to be addressed, the only value
> > 	which could reasonably require this is the modification time,
> > 	and there is sufficient free space in the inode to be able
> > 	to provide for this (there are 2x32 bit spares).
> 
> I think all the *BSD's need to do the same thing here. :-)
> 
> One other suggestion I've heard is to split the 64 bits we have for time
> into 44 bits for seconds, and 20 bits for microseconds. That's more than
> enough modification resolution, and also pushes things to past year
> 500,000 AD. Versioning the indoe would cover this easily.

Ugh.  But possible...


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199908162118.OAA04940>