Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 3 Aug 1996 17:14:32 +0100 (BST)
From:      Doug Rabson <dfr@nlsys.demon.co.uk>
To:        Terry Lambert <terry@lambert.org>
Cc:        Doug Rabson <dfr@render.com>, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-current@freebsd.org
Subject:   Re: NFS Diskless Dispare...
Message-ID:  <Pine.BSI.3.95.960803163842.1451A-100000@nlsys.demon.co.uk>
In-Reply-To: <199608022033.NAA06190@phaeton.artisoft.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2 Aug 1996, Terry Lambert wrote:

> > > The kernel is not currently reeentrant, and I think any underlying
> > > FS from an export will cause the block on the server.
> > 
> > I am thinking of the VXLOCK case.  I have been trying to construct a
> > scenario where the nfs client code ends up with an nfsnode which has been
> > freed by vclean() due to the lack of node locks.  I haven't managed it yet
> > but I am sure there is one.
> 
> I think the client side locking is maybe broken.  There are a lot of
> evil things in the NFS client code, and you are right about the VXLOCK.
> 
> I'd like to see each vnode reference treated as a counting semaphore
> increment, including the directory name cache references for the things.

In lite2, there is a fallback implementation of VOP_LOCK which nfs uses.
This allows multiple locking processes but keeps a count of how are
locking the vnode.  Vclean() blocks until this count drains back to zero
and then takes an exclusive lock on the vnode.  I think this should
improve the robustness somewhat.

I am not quite convinced that this is the main cause of the current nfs
instabilities as I can't come up with a killer scenario which doesn't
involve forcibly unmounting an active NFS client filesystem which doesn't
normally happen in real life.

I just grepped for uses of VOP_LOCK in the kernel and there seem to be a
few places in the vm system which appear to be using the vnode lock to
protect critical sections of code.  Have a look at vm_object_terminate()
and vm_object_page_clean() and tell me what would happen if the VOP_LOCK
is not exclusive.

> 
> The vclean code is evil and redundant and redundant, but without moving
> the vnode allocation to per FS vnodes (I've mentioned this before), there
> is very little you can do.  It's a buffer cache lose to say "no" to a
> page that's in core, but which does not have a vnode referencing it,
> so you have to reload it from disk even though a perectly good copy
> is already in memory.  8-(.

You have to start reusing vnodes sometime.  Whether it means reusing them
within a filesystem or across a global pool, it has to happen.  Even
reusing a vnode within a filesystem would involve something similar to
vclean() surely.  I don't understand the VM system well enough to judge
whether dropping a few valid pages from old vnodes is a real problem in
performance terms.

> 
> 
> > There is definately a problem with multiple processes appending to a file
> > over NFS due to nfs_write not being serialised by the lock.
> 
> This is a non-problem.  The file offset is maintained on the client
> and enforced in the uio... the offsets are part of the uio, so it is
> roughly equivalent to write( channel, buf, size, channel_offset).  They
> should be atomic.  I think the problem is that a previous append is
> not forcing an update to the file length for multiple appends.  It's a
> protocol problem that's really only resolvable with advisory locking:
> 
> client 1:	get file size		->
> 					<-	size is xxx	:server
> client 2:	get file size		->
> 					<-	size is xxx	:server
> client 1:	<append write>
> 		write at offset xxx	->
> client 2:	<append write>			[ERROR case]
> 		write at offset xxx	->

Consider the following timeline:

client 1				client 2

enter nfs_write for an append write
set uio->uio_offset to np->n_size
prepare to write first block
update np->n_size for first partial
write
sleep waiting for a buf
					enter nfs_write for an append
					set uio->uio_offset to np->n_size
					prepare to write first block
					update np->n_size
					get buf
					write block and wait for reply
wake up, finish writing first buf
write second buf overlapping client2's
first buf.

The problem is that for an append write which stretches across more than
one buf, the operation is not atomic due to the lack of serialising the
calls to write which would normally be enforced by the locks.

This is a real world problem.  Karl Denninger had this problem last year
with an http server updating log files across NFS.  He eventually moved
the logs to a local disk.

> 
> I hesitate to implement in-band reliance on a working locking
> implementation; I'd prefer an out-of-band contention resoloution of
> some kind by the clients.
> 
> Technically, you could argue the same potential problem exists in a
> local machine for writev() interleaving by two processes.  The length
> of the file baing appended will monotonically increase, but the write
> order is not guaranteed to not be mixed.  I think the same soloution
> is applicable: out of band semaphoring between the contending processes.

The write order for writev can be mixed but having data written by two
normal write syscalls from two different processes mixed up is something
which can and should be fixed IMHO.

--
Doug Rabson				Mail:  dfr@nlsys.demon.co.uk
					Phone: +44 181 951 1891





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSI.3.95.960803163842.1451A-100000>