Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 Aug 1999 02:31:16 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        wrstuden@nas.nasa.gov
Cc:        tlambert@primenet.com, Matthew.Alton@anheuser-busch.com, Hackers@FreeBSD.ORG, fs@FreeBSD.ORG
Subject:   Re: BSD XFS Port & BSD VFS Rewrite
Message-ID:  <199908170231.TAA08526@usr02.primenet.com>
In-Reply-To: <Pine.SOL.3.96.990816143421.27345M-100000@marcy.nas.nasa.gov> from "Bill Studenmund" at Aug 16, 99 04:04:11 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > > > 2.	Advisory locks are hung off private backing objects.
> > >
> > > I'd vote againsts your implimentation suggestion for VOP_ADVLOCK on an 
> > > efficiency concern. If we actually make a VOP call, that should be the
> > > end of the story. I.e either add a vnode flag to indicate pas/fail-ness,
> > > or add a genfs/std call to handle the problem.
> > > 
> > > I'd actually vote for the latter. Hang the byte-range locking off of the
> > > vnode, and add a genfs_advlock() or vop_stdadvlock() routine (depending on
> > > OS flavor) to handle the call. That way all fs's that can share code, and
> > > the callers need only call VO_ADVLOCK() - no other logic.
> > 
> > OK.  Here's the problem with that:  NFS client locks in a stacked
> > FS on top the the NFS client FS.
> 
> Ahh, but it'd be the fs's decision to map genfs_advlock()/vop_stdadvlock()
> to its vop_advlock_desc entry or not. In this case, NFS wouldn't want to
> do that.
> 
> Though it would mean growing the fs footprint.

Nope; that's not really the problem.

The problem is if I have two local processes that get into a race
in order to obtain a remote lock.

Because the remote lock is not asserted, there's no way to ensure
that the order of service for the request is the same as the order
of request -- consider cooperating programs, like sendmail and pine
or elm (or whatever).

The only way to resolve this is to ensure that the cooperating
programs on the same system are lockstepped: at the client.  The
only way to do this is to assert the lock locally, then remotely,
if the local assertion succeeds.

In the case of our cooperating local processes, this resolves the
race condition (depending on F_SETLCK/F_SETLCKW, they behave as if
the locks were local.  Which is what you want.


> > Specifically, you need to seperate the idea of asserting a lock
> > against the local vnode, asserting the lock via NFS locking, and
> > coelescing the local lock list, after both have succeeded, or
> > reverting the local assertion, should the remote assertion fail.
> 
> Right. But my thought was that you'd be calling an NFS routine, so it
> could do the right thing.

The problem is that the local lock doesn't belong to NFS.  Even if it
did (I think this would be an error for a remotely mounted "whiteout"
in a "translucent" local FS), the problem is that in doing the local
assertion, you will intrinsically coeelesce locks.

Now if the lock mode you are requesting overlaps a previous lock,
and the modes are not exactly the same, there's no way to back out
the local promotion or demotion without a coelesce.

This doesn't resolve the most complex cases you could contrive, with
multiple stacking layers that don't support a distributed coherency
protocol for locks for two or more players, but it handles the local
vs. NFS issues acceptably.


> > > NetBSD actually needs this to get unionfs to work. Do you want to talk
> > > privately about it?
> > 
> > If you want.  FreeBSD needs it for unionfs and nullfs, so it's
> > something that would be worth airing.
> > 
> > I think you could say that no locking routine was an approval of
> > the uuper level lock.  This lets you bail on all FS's except NFS,
> > where you have to deal with the approve/reject from the remote
> > host.  The problem with this on FreeBSD is the VFS_default stuff,
> > which puts a non-NULL interface on all FS's for all VOP's.
> 
> I'm not familiar with the VFS_default stuff. All the vop_default_desc
> routines in NetBSD point to error routines.

In FreeBSD, they now point to default routines that are *not* error
routines.  This is the problem.  I admit the change was very well
intentioned, since it made the code a hell of a lot more readable,
but choosing between readable and additional function, I take function
over form (I think the way I would have "fixed" the readability is by
making the operations that result in the descriptor set for a mounted
FS instance be both discrete, and named for their specific function).


> > Yes, this NULL is the same NULL I suggested for advisory locks,
> > above.
> 
> I'm not sure. The struct lock * is only used by layered filesystems, so
> they can keep track both of the underlying vnode lock, and if needed their
> own vnode lock. For advisory locks, would we want to keep track both of
> locks on our layer and the layer below? Don't we want either one or the
> other? i.e. layers bypass to the one below, or deal with it all
> themselves.

I think you want the lock on the intermediate layer: basically, on
every vnode that has data associated with it that is unique to a
layer.  Let's not forget, also, that you can expose a layer into
the namespace in one place, and expose it covered under another
layer, at another.  If you locked down to the backing object, then
the only issue you would be left with is one or more intermediate
backing objects.

For a layer with an intermediate backing object, I'm prepared to
declare it "special", and proxy the operation down to any inferior
backing object (e.g. a union FS that adds files from two FS's
together, rather than just directoriy entry lists).  I think such
layers are the exception, not the rule.


> > > > 5.	The idea of "root" vs. "non-root" mounts is inherently bad.
> > > You forgot:
> > > 
> > > 	5)	Update export lists
> > > 
> > > 		If you call the mount routine with no device name
> > > 		(args.fspec == 0) and with MNT_UPDATE, you get
> > > 		routed to the vfs_export routine
> > 
> > This must be the job of the upper level code, so that there is
> > a single control point for export information, instead of spreading
> > it throughout ead FS's mount entry point.
> 
> I agree it should be detangled, but think it should remain the fs's job to
> choose to call vfs_export. Otherwise an fs can't impliment its own export
> policies. :-)

I think that export policies are the realm of /etc/exports.

The problem with each FS implementing its own policy, is that this
is another place that copyinstr() gets called, when it shouldn't.


> > > I thought it was? Admitedly the only reference code I have is the ntfs
> > > code in the NetBSD kernel. But given how full of #ifdef (__FreeBSD__)'s it
> > > is, I thought it'd be an ok reference.
> > 
> > No.
> 
> We've lost the context, but what I was trying to say was that I thought
> the marking-the-vnode-as-mounted-on bit was done in the mount syscall at
> present. At least that's what
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_syscalls.c?rev=1.130
> seems to be doing.
> 
> > Basically, what you would have is the equivalent of a variable
> > length "mounted volume" table, from which mappings (and exports,
> > based on the mappings) are externalized into the namespace.
> 
> Ahh, sounds like you're talking about a new formalism..

Right.  The "covering" operation is not the same as the "marking as
covered" operation.  Both need to be at the higher level.

If you wanted to get gross, you could say that it was a volume table,
and the use POSIX namespace escapes, such as "//DISK/2/..." to
access each disk as its own "/".

This sounds gross, but if you had 4M extents on a very large disk,
it would be nearly ideal for installing software: each package would
get its own "disk", and you could share "packages" instead of "FS"'s.

For something like mobile computing, consider a package as a shared
resource.  You have your presentation package which you mount off
your local net, you fly to New York to present, and you mount the
same presentation package from the network where you are a guest.
Forget paths, installation, and all that crap.


> > Right.  It should just have a "mount" entry point, and the rest
> > of the stuff moves to higher level code, called by the mount system
> > call, and the mountroot stuff during boot, to externalize the root
> > volume at the top of the hierarchy.
> > 
> > An ideal world would mount a / that had a /dev under it, and then
> > do transparent mounts over top of that.
> 
> That would be quite a different place than we have now. ;-)

Not really.  Julian Elisher had code that mounted a /devfs under
/ automatically, before the user was ever allowed to see /.  As a
result, the FS that you were left with was indistinguishable from
what I describe.

The only real difference is that, as a translucent mount over /devfs,
the one I describe would be capable of implementing persistant changes
to the /devfs, as whiteouts.  I don't think this is really that
desirable, but some people won't accept a devfs that doesn't have
traditional persistance semantics (e.g. "chmod" vs. modifying a
well known kernel data structure as an administrative operation).

I guess the other difference is that you don't have to worry about
large minor numbers when you are bringing up a new platform via
NFS from an old platform that can't support large minors in its FS
at all.  ;-).


> > > > 	The conversion of the root device into a vnode pointer, or
> > > > 	a path to a device into a vnode pointer, is the job of upper
> > > > 	level code -- specifically, the mount system call, and the
> > > > 	common code for booting.
> > > 
> > > My one concern about this is you've assumed that the user is mounting a
> > > device onto a filesystem.
> > 
> > No.  Vnode, not bdevvp.  The bdevvp stuff is for the boot time stuff
> > in the upper level code, and only applies to the root volume.
> 
> Maybe I mis-parsed. I thought you were talking about parsing the first
> mount option (in mount /dev/disk there, the /dev/disk option) into a
> vnode. The concern below is that different fs's have different ideas as to
> what that node should be. Some want it a device node which no one else is
> using (most leaf fs's), while some others want a directory (nullfs, etc),
> some want a file or device (the HSM system I'm working on) while others
> don't care (in mount -t kernfs /kern /kern , the first kern doesn't matter
> at all). But all is well with different support routines which the
> mount_foo() routine can call.

I would resolve this by passing a standard option to the mount code
in user space.  For root mounts, a vnode is passed down.  For other
mounts, the vnode is parsed and passed if the option is specified.

I think that you will only be able to find rare examples of FS's
that don't take device names as arguments.  But for those, you
don't specify the option, and it gets "NULL", and whatever local
options you specify.

The point is that, for FS's that can be both root and sub-root,
the mount code doesn't have to make the decision, it can be punted
to higher level code, in one place, where the code can be centrally
maintained and kept from getting "stale" when things change out
from under it.


> > > Might I suggest a common library of routines which different mount
> > > routines can call? That way we'd get code sharing while letting the fs
> > > make decisions about what it expects of the input arguments.
> > 
> > This is the "footprint" problem, all over again.  Reject/accept (or 
> > "accept if no VOP") seems more elegant, and also reduces footprint.
> 
> Very true. The problem is that the current VFS system was designed as a
> black box. It gets handed all calls, and it gets to decide policy, and do
> everything on its own. We're now basically discussing ways of having the
> plethora of fs's we now have do things the same way. :-)

I don't think so.

I like to think in terms of "VFS consumer" and "VFS producer".  The
implied semantics are the provenanace of the "VFS consumer".

A good example of this is to look at another VFS consumer, the NFS
server.  It really doesn't want implied semantics, and, in fact,
wants to have a set of semantics (server locking information) sent
in through a seperate communications channel.  The way things are
right now, as a VFS consumer, the NFS server is a second class citizen.

One could imagine an AppleTalk or SMB server in the kernel, as well,
also VFS consumers.  And one could imagine doing VFS operations
against files _from within the kernel_ (say in a "quota" stacking
layer, or a resource fork/extended attributes stacking layer).  The
point is, you want to stop implying some semantics for these consumers.
Where you draw the line is where you imply sematics via call-down, or
via reject/accept.  If you don't want them implied all the time, for
all consumers, then they belong in the system call layer; othersise,
they belong in the VFS layer doing the implementation.

There's an abstraction here: is the VFS stacking layer you are
talking about one that implements semantics?  For an ACL stacking
layer, your answer is yes.  But for an NFS server stacked on a
VFS?  Or a namespace hiding layer?



> > > > 7.	The struct nameidata (namei.h) is broken in conception.
> > 
> > Can you push a Unicode name down from an appropriate system call?
> > 
> > I don't see any way to deal with an NT FS for characters outside
> > ISO 8859-1, otherwise.  8-(.
> 
> Hmmm. I think the real problem is that the kernel(s) is(are) not at all
> designed well for different laguages.

Well, if you make the path component descriptor into an opaque object,
you can pass it down to the point you get to someone who understands
the encapsulated data.  The interpretation is a rendesvous -- an
agreement -- between the source providing the data, and the target
interpreting it.


> > > > 9.	The implementation of namei() is POSIX non-compliant
> > > > 
> > > > 	The implementation of namei() is by means of coroutine
> > > > 	"recursion"; this is similar to the only recursion you can
> > > > 	achieve in FORTRAN.

[ ... ]

> > > 
> > > I'm sorry. This point didn't parse. Could you give an example?
> > > 
> > > I don't see how the namei recursion method prevents catching // as a
> > > namespace escape.
> > 
> > 
> > //apple-resource-fork/intermediate_dir/some_other_dir/file_with_fork
> > 
> > You can't inherit the fact that you are looking at the resource fork
> > in the terminal component, ONLY.
> 
> Yep, there's no easy way to do that now.. The one thing which comes to
> mind is to have lookup() rip out the first component and save it in the
> namei struct.
> 
> Though the devil's advocate in me points out that this difficulty is not
> inherent in the recursion setup, but in how lookup() is designed. :-)

If it were a parameter, "namespace", to the function, it'd work, too.

The problem is that you really want to install "namespace handlers"
for these escapes, probably on a per FS basis.  The only way I can
see this working is to place the namespace into the path descriptor
_seperately_ from the path components (however they get parsed out by
that namespace).

This shows the evils of "copyinstr()" in the full light of day:  I can't
have a "//unicode/..." name space escape, unless I assume ISO-8859-1,
like the NTFS currently does, or unless I engage in some unnatural act
with my "..." following the escape (e.g. UTF-8).


> > > > 	Quotas should be an abstract stacking layer that can be
> > > > 	applied to any FS, instead of an FFS specific monstrosity.
> > > 
> > > It should certainly be possible to add a quota layer on top of any leaf
> > > fs. That way you could de-couple quotas. :-)
> > 
> > Yes, assuming stacking works in the first place...
> 
> Except for a minor buglet with device nodes, stacking works in NetBSD at
> present. :-)

Have you tried Heidemann's student's stacking layers?  There is one
encryption, and one per-file compression with namespace hiding, that
I think it would be hard pressed to keep up with.  But I'll give it
the benefit of the doubt.  8-).


> > > One other suggestion I've heard is to split the 64 bits we have for time
> > > into 44 bits for seconds, and 20 bits for microseconds. That's more than
> > > enough modification resolution, and also pushes things to past year
> > > 500,000 AD. Versioning the indoe would cover this easily.
> > 
> > Ugh.  But possible...
> 
> I agree it's ugly, but it has the advantage that it doesn't grow the
> on-disk inode. A lot of flks have designs on the remaining 64 bits free.
> :-)

Well, so long as we can resolve the issue for a long, long time;
I plan on being around to have to put up with the bugs, if I can
wrangle it... 8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199908170231.TAA08526>