FreeBSD Mail Archives

Date:      Thu, 21 Mar 2013 19:24:24 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
Subject:   Re: Deadlock in the NFS client
Message-ID:  <1881768576.34111.1363908264039.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303211600.46884.jhb@freebsd.org>

John Baldwin wrote:
> On Monday, March 18, 2013 9:00:54 pm Rick Macklem wrote:
> > John Baldwin wrote:
> > > On Friday, March 15, 2013 10:03:39 pm Rick Macklem wrote:
> > > > John Baldwin wrote:
> > > > > On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov
> > > > > wrote:
> > > > > > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin
> > > > > > wrote:
> > > > > > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov
> > > > > > > wrote:
> > > > > > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem
> > > > > > > > wrote:
> > > > > > > > > John Baldwin wrote:
> > > > > > > > > > I ran into a machine that had a deadlock among
> > > > > > > > > > certain
> > > > > > > > > > files
> > > > > > > > > > on a
> > > > > > > > > > given NFS
> > > > > > > > > > mount today. I'm not sure how best to resolve it,
> > > > > > > > > > though
> > > > > > > > > > it
> > > > > > > > > > seems like
> > > > > > > > > > perhaps there is a bug with how the pool of nfsiod
> > > > > > > > > > threads
> > > > > > > > > > is managed.
> > > > > > > > > > Anyway, more details on the actual hang below. This
> > > > > > > > > > was
> > > > > > > > > > on
> > > > > > > > > > 8.x with
> > > > > > > > > > the
> > > > > > > > > > old NFS client, but I don't see anything in HEAD
> > > > > > > > > > that
> > > > > > > > > > would
> > > > > > > > > > fix this.
> > > > > > > > > >
> > > > > > > > > > First note that the system was idle so it had
> > > > > > > > > > dropped
> > > > > > > > > > down
> > > > > > > > > > to only one
> > > > > > > > > > nfsiod thread.
> > > > > > > > > >
> > > > > > > > > Hmm, I see the problem and I'm a bit surprised it
> > > > > > > > > doesn't
> > > > > > > > > bite
> > > > > > > > > more often.
> > > > > > > > > It seems to me that this snippet of code from
> > > > > > > > > nfs_asyncio()
> > > > > > > > > makes too
> > > > > > > > > weak an assumption:
> > > > > > > > > 	/*
> > > > > > > > > 	 * If none are free, we may already have an iod
> > > > > > > > > 	 working
> > > > > > > > > 	 on
> > > > > > > > > 	 this mount
> > > > > > > > > 	 * point. If so, it will process our request.
> > > > > > > > > 	 */
> > > > > > > > > 	if (!gotiod) {
> > > > > > > > > 		if (nmp->nm_bufqiods > 0) {
> > > > > > > > > 			NFS_DPF(ASYNCIO,
> > > > > > > > > 		("nfs_asyncio: %d iods are already processing mount
> > > > > > > > > 		%p\n",
> > > > > > > > > 				 nmp->nm_bufqiods, nmp));
> > > > > > > > > 			gotiod = TRUE;
> > > > > > > > > 		}
> > > > > > > > > 	}
> > > > > > > > > It assumes that, since an nfsiod thread is processing
> > > > > > > > > some
> > > > > > > > > buffer for the
> > > > > > > > > mount, it will become available to do this one, which
> > > > > > > > > isn't
> > > > > > > > > true for your
> > > > > > > > > deadlock.
> > > > > > > > >
> > > > > > > > > I think the simple fix would be to recode
> > > > > > > > > nfs_asyncio() so
> > > > > > > > > that
> > > > > > > > > it only returns 0 if it finds an AVAILABLE nfsiod
> > > > > > > > > thread
> > > > > > > > > that
> > > > > > > > > it
> > > > > > > > > has assigned to do the I/O, getting rid of the above.
> > > > > > > > > The
> > > > > > > > > problem
> > > > > > > > > with doing this is that it may result in a lot more
> > > > > > > > > synchronous I/O
> > > > > > > > > (nfs_asyncio() returns EIO, so the caller does the
> > > > > > > > > I/O).
> > > > > > > > > Maybe
> > > > > > > > > more
> > > > > > > > > synchronous I/O could be avoided by allowing
> > > > > > > > > nfs_asyncio()
> > > > > > > > > to
> > > > > > > > > create a
> > > > > > > > > new thread even if the total is above nfs_iodmax. (I
> > > > > > > > > think
> > > > > > > > > this would
> > > > > > > > > require the fixed array to be replaced with a linked
> > > > > > > > > list
> > > > > > > > > and
> > > > > > > > > might
> > > > > > > > > result in a large number of nfsiod threads.) Maybe
> > > > > > > > > just
> > > > > > > > > having
> > > > > > > > > a large
> > > > > > > > > nfs_iodmax would be an adequate compromise?
> > > > > > > > >
> > > > > > > > > Does having a large # of nfsiod threads cause any
> > > > > > > > > serious
> > > > > > > > > problem for
> > > > > > > > > most systems these days?
> > > > > > > > >
> > > > > > > > > I'd be tempted to recode nfs_asyncio() as above and
> > > > > > > > > then,
> > > > > > > > > instead
> > > > > > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a
> > > > > > > > > fixed
> > > > > > > > > number of
> > > > > > > > > nfsiod threads (this could be a tunable, with the
> > > > > > > > > understanding that
> > > > > > > > > it should be large for good performance)
> > > > > > > > >
> > > > > > > >
> > > > > > > > I do not see how this would solve the deadlock itself.
> > > > > > > > The
> > > > > > > > proposal would
> > > > > > > > only allow system to survive slightly longer after the
> > > > > > > > deadlock
> > > > > > > > appeared.
> > > > > > > > And, I think that allowing the unbound amount of nfsiod
> > > > > > > > threads
> > > > > > > > is also
> > > > > > > > fatal.
> > > > > > > >
> > > > > > > > The issue there is the LOR between buffer lock and vnode
> > > > > > > > lock.
> > > > > > > > Buffer lock
> > > > > > > > always must come after the vnode lock. The problematic
> > > > > > > > nfsiod
> > > > > > > > thread, which
> > > > > > > > locks the vnode, volatile this rule, because despite the
> > > > > > > > LK_KERNPROC
> > > > > > > > ownership of the buffer lock, it is the thread which de
> > > > > > > > fact
> > > > > > > > owns the
> > > > > > > > buffer (only the thread can unlock it).
> > > > > > > >
> > > > > > > > A possible solution would be to pass LK_NOWAIT to
> > > > > > > > nfs_nget()
> > > > > > > > from the
> > > > > > > > nfs_readdirplusrpc(). From my reading of the code,
> > > > > > > > nfs_nget()
> > > > > > > > should
> > > > > > > > be capable of correctly handling the lock failure. And
> > > > > > > > EBUSY
> > > > > > > > would
> > > > > > > > result in doit = 0, which should be fine too.
> > > > > > > >
> > > > > > > > It is possible that EBUSY should be reset to 0, though.
> > > > > > >
> > > > > > > Yes, thinking about this more, I do think the right answer
> > > > > > > is
> > > > > > > for
> > > > > > > readdirplus to do this. The only question I have is if it
> > > > > > > should
> > > > > > > do
> > > > > > > this always, or if it should do this only from the nfsiod
> > > > > > > thread.
> > > > > > > I
> > > > > > > believe you can't get this in the non-nfsiod case.
> > > > > >
> > > > > > I agree that it looks as of the workaround only needed for
> > > > > > nfsiod
> > > > > > thread.
> > > > > > On the other hand, it is not immediately obvious how to
> > > > > > detect
> > > > > > that
> > > > > > the current thread is nfsio daemon. Probably a thread flag
> > > > > > should be
> > > > > > set.
> > > > >
> > > > > OTOH, updating the attributes from readdir+ is only an
> > > > > optimization
> > > > > anyway, so
> > > > > just having it always do LK_NOWAIT is probably ok (and
> > > > > simple).
> > > > > Currently I'm
> > > > > trying to develop a test case to provoke this so I can test
> > > > > the
> > > > > fix,
> > > > > but no
> > > > > luck on that yet.
> > > > >
> > > > > --
> > > > > John Baldwin
> > > > Just fyi, ignore my comment about the second version of the
> > > > patch
> > > > that
> > > > disables the nfsiod threads from doing readdirplus running
> > > > faster.
> > > > It
> > > > was just that when I tested the 2nd patch, the server's caches
> > > > were
> > > > primed. Oops.
> > > >
> > > > However, sofar the minimal testing I've done has been
> > > > essentially
> > > > performance neutral between the unpatch and patched versions.
> > > >
> > > > Hopefully John has a convenient way to do some performance
> > > > testing,
> > > > since I won't be able to do much until the end of April.
> > >
> > > Performance testing I don't really have available.
> > All I've been doing are things like (assuming /mnt is an NFSv3 mount
> > point):
> > # cd /mnt
> > # time ls -lR > /dev/null
> > # time ls -R > /dev/null
> > - for both a patched and unpatched kernel
> > (Oh, and you need to keep the server's caches pretty consistent. For
> > me
> >  once I run the test once, the server caches end up primed and then
> >  the times seem to be pretty consistent, but I am only using old
> >  laptops.)
> >
> > Maybe you could do something like the above? (I'll try some finds
> > too.)
> > (I don't really have any clever ideas for other tests.)
> 
> I've been doing find across trees on different servers (one mounted
> with
> rdirplus and one without). I've compared the current behavior
> (blocking
> lock in rdirplus + readahead) to disabling readahead for dirs as well
> as
> just using non-blocking locks in the nfsiod case in readdir+
> processing.
> I also ran my tests with various concurrent number of jobs (up to 8
> since
> this is an 8-core machine). The three cases were all basically the
> same,
> and the two possible fixes were no different, so I think you can fix
> this
> however you want.
> 
> --
> John Baldwin
Yep, same here. (Basically performance neutral for what I've tried. At
most 0.5% slower without using the nfsiods and that might have been
within the statistical variance of the test runs.)

I'll commit the one that disables read-ahead for rdirplus in April, if
that's ok with everyone. That way if there is a big performance hit
for some situation, it can be avoided by taking the "rdirplus" option
off the mount.

Thanks for your help with this, rick

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1881768576.34111.1363908264039.JavaMail.root>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation