Date: Sun, 13 Dec 1998 00:11:19 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: jkh@zippy.cdrom.com (Jordan K. Hubbard) Cc: vmg@novator.com, hackers@FreeBSD.ORG Subject: Re: Is it possible? Message-ID: <199812130011.RAA12494@usr01.primenet.com> In-Reply-To: <85152.913104877@zippy.cdrom.com> from "Jordan K. Hubbard" at Dec 8, 98 00:14:37 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > I have run into the proverbial brick wall. I am the administrator of
> > a fairly busy electronic commerce Web site, www.ftd.com. Because of
> > the demand placed on a single server, I implemented a load balancing
> > solution that utilizes NFS in the back end. The versions of FreeBSD
>
> Hmmm. Well, as you've already noted, NFS is not really sufficient to
> this task and never has been. There has never been any locking with
> our NFS and, as evidence would tend to suggest, never a degree of
> interest on anyone's part sufficient to actually motivate them to
> implement the functionality.
This isn't true.
Actually, Jordan was going to do this as a project in a class he
was taking taught by Kirk McKusick...
> Even with working NFS locks, it's also probably an inferior solution
> to what many folks are doing and that's load balancing at the IP
> level. Something like the Coyote Point Systems Equalizer package
> (which is also based on FreeBSD, BTW) which takes n boxes and switches
> the traffic for them from one FreeBSD box using load metrics and other
> heuristics to determine the best match for a request would be a fine
> solution, as would any of the several other similar products on the
> market.
This is potentially true.
> Unless you're up for doing an NFS lock implementation, that is.
> Terry's patches only address some purported bugs in the general NFS
> code, they don't actually implement the lock daemon and other
> functionality you'd need to have truly working NFS locks. Evidently,
> this isn't something which has actually interested Terry enough to do
> either. :-)
Actually, my patches addressed all of the kernel locking issues not
related to implementation of the NFS client RPC code, and not related
to the requisite rpc.lockd code.
I didn't do the rpc.lockd code because you were going to. I didn't
do the NFS client RPC code because I didn't have working rpc.lockd
on which to base an implemetnation.
The patches were *not* gratuitous reorganization, as I believe I can
prove; they addressed architectural issues only in as much as it was
required to address them for (1) binary compatability with previous
fcntl(2) based non-proxy locking, (2) support of the concept of
proxy locking at all, and (3) dealing with the issue of a stacking
VFS consuming an NFS client VFS layer, and the necessity of splitting
lock assertions across one or more inferior VFS's, and the
corresponding need to be able to abort a lock coelesce on a first
VFS if the operation could not be completed on the second.
Here is my architecture document, which should describe the patches
I've done (basically, all the generic kernel work), and the small
amount of work necessary to be done in user space, and in the NFS
client code. Hopefully, someone with commit priviledges will
approach these ideas, since I've personally approached them three
times without success in getting them committed.
PS: I'm pretty sure BSDI examined my code before engaging in their
own implementation, given the emails I exchanged with them over it.
Terry Lambert
terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.
==========================================================================
NFS LOCKING
1.0.0.0 Introduction
NFS locking is generally desirable. BSDI has implemented
NFS locking, purportedly using some of my FreeBSD patchese
as a starting point, to achieve the first implementation
of NFS locking not derived from Sun source code. What's
unfortunate about this is that they neglected to release
the code as open source (so far).
2.0.0.0 Server side locking
Server side locking is, by far, the easiest NFS locking
problem to solve,
Server side locking is support for allowing clients to
asser locks against files on an NFS server.
2.1.0.0 Theory of operation
Server side locking is implemented by allowing the client
to make RPC requests which are proxied to the server file
space via one or more processes (generally, two: rpc.lockd
and rpc.statd).
Operations are proxied into the local collision domain,
and enforced both against and by local locks, depending
on order of operation.
2.2.0.0 rpc.statd
The purpose of rpc.statd is to provide host status data
to machines that it is monitoring. This is generally used
to allow client machines to reassert locks (since the NFS
protocol is nominally stateless) following a server
restart. This means we can generally ignore rpc.statd
for the purposes of this discussion.
2.3.0.0 rpc.lockd
The purpose of rpc.lockd is to provide responses for file
and record locking requests to client machines.
Because NFS is nominally stateless, but locks themselves
are nominally stateful, there must be a container for the
lock state. In a UNIX system, containers for lock state
are called "processes". They provide an ownership context
for the locks, such that the locks can be discarded when
th NFS services are discontinued. As such, the rpc.lockd
is an essential part of the resource and state tracking
mechanism for NFS locks.
The current FreeBSD rpc.lockd unconditionally grants lock
requests; this is sufficient for Solaris interoperability,
since Solaris will complain bitterly if there is not a lockd
for a Solaris client to talk to, but is of rather limited
utility otherwise, since locks are not enforced, even in
the NFS collision domain, let alone between that domain
and other processes on the FreeBSD machine.
Note that it is possible to enforce the NFS locks within
the NFS collision domain solely in the rpc.lockd itself,
but this is generally not a sufficient answer, both because
of architectural issues having to do with with the current
rpc.lockd impelemtnation's handling of blocked requests
(it has none) and the
2.3.1.0 Interface problems in FreeBSD
FreeBSD has a number of interface problems that prevent
implementation of a functional rpc.lockd that enforces
locks within both collision domains.
2.3.1.1 FreebSD problem #1: Conversion of NFS handles to FD's
Historically, NFS locks have been asserted by converting
an NFS file handle into an open file descriptor, and
then asserting the proxy lock against the descriptor.
SOLOUTION
FreeBSD must implement an F_CNVT interface,
to allow the rpc.lockd to convert an NFS
handle into an open file descriptor.
This is the first step in asserting a lock: get a file
descriptor for use as a handle to the local locking
mechanisms to perform operations on behalf of client
machines.
2.3.1.2 FreeBSD problem #2: POSIX lock-release-on-close semantics
The second problem FreeBSD faces is that a lock release
by a process is implicit in POSIX locking semantics.
This will not work in FreeBSD, since the same process
proxies locks for multiple remote processes, and the
semantic enforcement needs to occur on a per remote
process basis, not on a per rpc.lockd basis.
SOLOUTION
FreeBSD must implement the fcntl option
F_NONPOSIX for flagging descriptors on which
POSIX unlock semantics must not be enforced.
This resovles the proxy dissoloution problem, whereby a
lock release by one remote client's process will not
destroy the locks held by all other remote client's
processes, as would happen if POSIX semantics were
enforced on that descriptor.
It also resolves the case where multiple locks are being
proxied using one descriptor ("descriptor caching"). The
rpc.lockd engages in descriptor caching by creating a hash
based on the device/inode pair for each fd that results
from a converted NFS file handle.
The purpose of this is twofold: First, it allows a single
descriptor to be resource counted for multiple clients
such that descriptors are conserved. Second, since the
file handle presented by one client may not match the file
handle presented by another, either because of intentional
NFS server drift to prevent session hijacking, or because
of local FS semantics, such as loopback mounts, union
mounts, etc., it provides a common rendesvous point for
the rpc.lockd.
2.3.1.3 FreeBSD problem #3: lack of support for proxy operations
The FreeBSD fcntl(2) interface lacks the ability to note
the use of a descriptor as proxy, as well as the identity
of the proxied host id and process id.
In general, what this means is that there is no support
for proxying locks into the kernel.
SunOS 4.1.3 solved this problem once; since that is the
reference implemetnation for NFS locking, even today,
inside Sun Microsystems, there is no need to reinvent
the wheel (if someone feels the needs, at least this
time, make it round).
SOLOUTION
FreeBSD must implement F_RGETLK, F_RSETLK, and
F_RSETLKW. In addition, the flock structure
must be extended, as follows:
/* old flock structure -- required for binary compatability*/
struct oflock {
off_t l_start; /* starting offset */
off_t l_len; /* len = 0 means until end of file */
pid_t l_pid; /* lock owner */
short l_type; /* lock type: read/write, etc. */
short l_whence; /* type of l_start */
};
/* new flock structure -- required for NFS/SAMBA*/
struct flock {
off_t l_start; /* starting offset */
off_t l_len; /* len = 0 means until end of file */
pid_t l_pid; /* lock owner */
short l_type; /* lock type: read/write, etc. */
short l_whence; /* type of l_start */
short l_version; /* avoid future compat. problems*/
long l_rsys; /* remote system id*/
pid_t l_rpid; /* remote lock owner*/
};
The use of an overlay structure solves the pending binary
compatability easily an elegantly: the l_version, l_rpid, and
l_rsys fields are defaulted for the F_GETLK, F_SETLK, and
F_SETLKW commands. This means that they are copied in using
the same size as they previously used, and binary compatability
is maintained.
For the F_RGETLK, F_RSETLK, and F_RSETLKW commands, since they
did not previously exist, binary compatability is unnecessary,
and they can copy in the non-default l_version, l_rpid, l_rsys
identifiers.
By fiat, the oflock l_version is 0, and the flock version is
1. Also by fiat, the value of l_rsys is -1 for local locks.
In particular, l_rsys is the IPv4 address of the requester,
and -1 is illegal, and therefore useful as a cookie for
"localhost".
This provides the framework whereby proxy operations can be
supported by FreeBSD.
2.3.1.4 FreeBSD problem #4: No support for l_rsys and l_rpid.
Having an interface is only part of the battle. FreeBSD
also fails to support l_rsys and l_rpid internally.
These values must be used as uniquifiers; that is, the
value of l_pid alone is not sufficient. When l_rsys is not
-1 (localhost), the values of l_rsys and l_rpid must also
be considered in determining whether or not locks may be
coelesced.
SOLOUTION
Add Support to the FreeBSD locking subsystem
to allow for support of these values to use in
preventing coelescence and in determining lock
equality.
This work is rather trivial, but important.
As we shall see in section 3, "Client side locking", we will
want to defer our modifications until we have a complete
picture of the issues for *both* client and server requiriments.
2.3.1.5 FreeBSD problem #5: Not all local FS's support locking
We can say that any local FS that we may wish to mount
really wants to be NFS exportable.
Without getting into the issues of the FreeBSD VFS mount
code, mount handilng, and mappinf of mounted FS's into the
user visible hierarchy, it is very easy to see that one
requirement for supporting locking is that the underlying
FS's must also support locking.
SOLOUTION
Make all underlying FS's support locking by
taking it out of the FS, and placing it at a
higher layer. Specifically, hang the lock
list off the generic vnode, not off the FS
specific inode.
This is an obvious simplification that reaps many benefits.
However, a we will discover in section 3, "Client side
locking", we wil want to defer our modifications until we
have a complete picture of the issues for *both* client
and server requiriments. Specifically, for VFS stacking
to function correctly where an inferior VFS happens to
be the NFS client VFS, we must preserve the VOP_ADVLOCK
interface as a veto-based mechanism, where local media
FS's never veto the operation (deferring to the upper level
code that manages the lock off the vnode), whereas the
NFS client code may, in fact, veto the operation (as could
any other VFS that proxies operations, e.g., an SMBFS).
2.3.2.0 Requirements for rpc.lockd
Once the FreeBSD interface issues have been addressed, it
is necessary to address the rpc.lockd itself. These
issues are primarily algorithmic in nature.
2.3.2.1 When any request is made
When a client makes a request, the first thing that the
rpc.lockd must do is check the client handle hash list
to determine if the rpc.lockd already has a descriptor
open on that file *for that handle*.
If a descriptor is not open for the handle, the rpc.lockd
must convert the NFS file handle into a file descriptor.
The rpc.lockd then fstats the descriptor to obtain the
dev_t and ino_t fields. This uniquely identifies the file
to the FreeBSD system in a way that, for security reasons,
the handle alone can not.
Note: If the FreeBSD system chose to avoid some of the
anti-hijack precations it takes, this step could be avoided,
and the handle itself used as a unique identifier.
The POSIX lock-release-on-close semantics are disabled via
an fcntl using th F_NONPOSIX command.
Given the unique identifier, a hash is computed to determine
if some other client somewhere has the file open. If so,
the structure referencing the already open FD's reference
count is incremented, and the FD is closed. The client
handle hash is updated so that subsequent operations in the
same handle do not
So there are two hash tables involved: the client handle
hash, and the open file hash.
Use of these hashes guarantees the minimum descriptor
footprint possible for the rpc.lockd. Since this is the
most scarce resource on the server, this is what we must
optimize.
We note at this point what we noted earlier: we must have
at least one descriptor per file in which locks are being
asserted, since we are the process container for the locks.
2.3.2.2 F_RGETLK
This is a straight-forward request. The request is not
a blocking request, so it is made, and the result is
returned. The rpc.lockd fills out the l_rpid and l_rsys
as necessary to make the request.
2.3.2.3 F_RSETLK
This is likewise non-blocking, and therefore likewise
relatively trivial.
2.3.2.4 F_RSETLKW
This operation is the tough one. Because the operation
would block, we have an implementation decision.
To reduce overhead, we first try F_RSETLK; if it succeeds,
we return success. This is by far the most common outcome,
given most lock contention mechanisms in most well written
FS client software (note: FS, not NFS: programs are clients
of FS services, even for local FS's).
If this returns EAGAIN, then we must decide how to perform
the operation.
We can either fork, and have the forked process close all
its copies of the descriptors, except the one of interest,
and then implement F_RSETLKW as a blocking operation, or
we can implement F_RSETLKW as a queued operation. Finally,
we could set up a time, and use F_RSETLK exclisively, until
it succeeeds. This last is unacceptable, since it does not
guarantee order of grant equals order of enqueueing, and
thus may break program expectations on semantics, resulting
in deadly embrace deadlocks between processes.
Given that FreeBSD supports the concepts of sharing a
descriptor table between processes (via vfork(2)), the
fork option is by far the most attractive, with the
caveat that we use the vfork to get a copy of the
descriptor table shared so as to not double the fd
footprint, even for a short period of time.
We can likewise enqueue state, and process SIGCLD to ensure
that the parent rpc.lockd knows about all pending and
successful requests (necessary for proper operation of the
rpc.statd daemon).
2.3.2.5 Back to the general
Now we can go back to discussing the general implementaiton.
The rpc.lockd must decrement the reference count when locks
held by a given process are removed. It can either do this
by maintaining a shadow, or, preferentially, by, after a
lock is released, performing an F_RGETLK.
This is part of the resource tracking for opn descriptors in
the rpc.lockd. If the request indicates that there are no
more locks held by that l_rsys/l_rpid pair, then the fd
reference count is decremented, and the per handle hash is
removed from the list. If the reference count goes to zero,
then the descriptor is closed.
DISCUSSION
It is useful to implement late-bingding closes.
Specifically, it is useful to not actually delete
the reference immediately.
SOLOUTION
The handle refernces, instead of being deleted, are
thrown ont a clock list. If the handles are
rereferenced within a tunable time frame, then they
are removed from the list and placed back into use;
otherwise, after sufficient time has elapsed, they
are inactivated as above.
This resolves the case of a single client generating a lot
of unnecessary rpc.lockd activity by issuing lock-unlock
pairs that would cause the references to bounce up and
down, requiring a lot of system calls. It preserves the
NFS handle hash for a time after the operation nominally
completes, in the expectation of future operations by that
client.
3.0.0.0 Client side locking
Client side locking is much harder than server side locking.
Client side locking allows clients to request locks from
remote NFS servers on behalf of local processes running on
the client machine.
3.1.0.0 Theory of operation
Client side locking is implemented by the client NFS code in
the kernel making RPC requests against the server, much in
the same way that NFS clients operate when making FS
operation requests against NFS servers.
It is simultaneously more difficult because of the code
being located in the kernel, and less difficult, since
there is a process context (the reqiesting process) to act
as a conatiner ofr the operation until it is completed by
the server.
Server side locking is implemented by allowing the client
to make RPC requests which are proxied to the server file
space via one or more processes (generally, two: rpc.lockd
and rpc.statd).
Operations are proxied into the local collision domain,
and enforced both against and by local locks, depending
on order of operation.
3.1.1.0 Interface problems in FreeBSD
FreeBSD has a number of interface problems that prevent
implementation of a functional NFS client locking.
3.1.1.1 FreeBSD problem #1: VFS stacking and coelescence
Locks, when asserted, are coelesced by l_pid. If they
are asserted by a hosted OS API, e.g., an NFS, AppleTalk,
or SAMBA server, they are coelesced by l_rsys and l_rpid,
as well; we can ignore all by l_pid in the general case,
since exporting an exported FS is foolish and dangerous.
When locks are asserted, then, the locks are coelesced if
the lock is successful. Thus, If a process had a file
[FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
Protected by the locks:
[111111111] [2222222222]
[FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
And asserted a third lock:
[333333333333333333]
[111111111] [2222222222]
[FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
That lock would be coelesced:
[111111111111111111111111111111111]
[FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF]
For a local media FS, this is not a problem, since the
operation occurs locally, and is serialized by virtue
of that fact. But for an NFS client, the lock behaviour
is less serialized.
Consider the case of a VFS stacking layer that stacks
two filesystems, and makes the files within them appear
to be two extents of a single file. We can imagine that
this would be useful for combined log files for a cluster
of machines, and for other reasons (there are many other
examples; this is merely the simplest). So we have:
[ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF]
Lets perform the same locks:
[111111111] [2222222222]
[ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF]
So far, so good. Now the third lock:
[333333333333333333]
[111111111] [2222222222]
[ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF]
Colesce, phase one:
[33333333]
[1111111111111111] [2222222222]
[ffffffffffffffff][FFFFFFFFFFFFFFFFFFFFFFFFFFF]
Oops! The second phase fails because Some other client has
the lock:
[XX]
Now we need to back out the operation on the first FS:
[33333333]
[111111111]
[ffffffffffffffff]
Leaving:
[1111111]
[ffffffffffffffff]
Ut-oh: looks like we're screwed.
SOLOUTION
Delayed coelescing. The locks are asserted, but
they are not committed (coelesced) until all the
operations have been deemed successful.
By dividing the phases of asserting vs. committing, we can
delay the coelesceing until we know that all locks are
successfully asserted.
How do we do this? Very simply, we convert the VOP_ADVLOCK
to be a veto mechanism, instead of the mechainsm by which
the lock code is acutally called, and we move the locking
operations to upper level (common) code. At the same time,
we make the OS more robust, since there is only one place,
instead of many, where the code is called.
For stacking layers that stack on more than one VFS, and for
proxy layers, such as NFS, SMB, or AppleTalk client layers,
the operation is a veto, where the operation is proxied, and
if the proxy fails, then the operation is vetoed.
So in general, VOP_ADVLOCK becomes a "return(1);" for most
of the VFS layers, with specific exceptions for particular
layer types, which *may* veto the operation by the upper
level code.
If the operation is not vetoed by the upper level code, then
the upper level code commits the operation, and the lock
ranges are coelesced.
3.1.1.2 FreeBSD problem #2: What if the NFS layer is first?
If the NFS layer is first, and the operation is subsequently
vetoed, how is the NFS coelesce backed out?
SOLOUTION
The shadow graph. The NFS client, for each
given vnode (nfsnode), must seperately maintain
the locks agains the node on a per process basis.
What this means is that when a process asserts a lock on an
NFS accessed file, the NFS client lockign code must maintain
an uncoelesced lock graph.
This is because the lock graph *will* be coelesced on the
server.
In order to back out the operation:
[33333333]
[111111111]
[ffffffffffffffff]
|
v
[1111111111111111]
[ffffffffffffffff]
The client must keep knowledge of the fact that these locks
are seperate.
This implies that locks that result in type demotions are
not type demoted to the server (i.e., locks against the
server are only asserted in promote-only mode so that if
they are backed out, there will not have been a demotion,
for example, from write to read, on the server).
There is currently code in SAMBA which models this, since
SAMBA's consumtiopn of the host FS is similar to an NFS
clients consumption of an NFS server's FS.
3.2.0.0 The client NFS VFS layer's RPC calls
So far no one has implemented this. In general, it is more
important to be a server than it is to be a client, at this
time.
The amount of effort to implement this, if one has the ISO
documents, or, more obliquely and therefore more difficult,
the rpc.lockd code in the FreeBSD source tree, is pretty
small. This would make a good one quarter project for a
Batcholer of Science in Computer Science independent study
credit.
3.3.0.0 Discussion
In general, all of the issues for an NFS client in FreeBSD
apply equally to the idea of an AppleTalk or SMB client in
FreeBSD. It is likely that FreeBSD will want to support
the ability to operate as a desktop (and therefore client)
OS, even if this is not the primary niche into which it is
currently being driven by the developers.
4.0.0.0 End Of Document
==========================================================================
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199812130011.RAA12494>
