From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 21 23:24:31 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id BD570A55;
 Thu, 21 Mar 2013 23:24:31 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 51B63168;
 Thu, 21 Mar 2013 23:24:30 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAJOVS1GDaFvO/2dsb2JhbABDiCC6KIMKgXF0giQBAQUjBFIbDgoCAg0ZAlkGLod5sAmSJoEjjTo0B4ItgRMDlmSRAoMmIIFs
X-IronPort-AV: E=Sophos;i="4.84,888,1355115600"; d="scan'208";a="20199593"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-annu.net.uoguelph.ca with ESMTP; 21 Mar 2013 19:24:24 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 0D057B4035;
 Thu, 21 Mar 2013 19:24:24 -0400 (EDT)
Date: Thu, 21 Mar 2013 19:24:24 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <1881768576.34111.1363908264039.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303211600.46884.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.202]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Mar 2013 23:24:31 -0000

John Baldwin wrote:
> On Monday, March 18, 2013 9:00:54 pm Rick Macklem wrote:
> > John Baldwin wrote:
> > > On Friday, March 15, 2013 10:03:39 pm Rick Macklem wrote:
> > > > John Baldwin wrote:
> > > > > On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov
> > > > > wrote:
> > > > > > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin
> > > > > > wrote:
> > > > > > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov
> > > > > > > wrote:
> > > > > > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem
> > > > > > > > wrote:
> > > > > > > > > John Baldwin wrote:
> > > > > > > > > > I ran into a machine that had a deadlock among
> > > > > > > > > > certain
> > > > > > > > > > files
> > > > > > > > > > on a
> > > > > > > > > > given NFS
> > > > > > > > > > mount today. I'm not sure how best to resolve it,
> > > > > > > > > > though
> > > > > > > > > > it
> > > > > > > > > > seems like
> > > > > > > > > > perhaps there is a bug with how the pool of nfsiod
> > > > > > > > > > threads
> > > > > > > > > > is managed.
> > > > > > > > > > Anyway, more details on the actual hang below. This
> > > > > > > > > > was
> > > > > > > > > > on
> > > > > > > > > > 8.x with
> > > > > > > > > > the
> > > > > > > > > > old NFS client, but I don't see anything in HEAD
> > > > > > > > > > that
> > > > > > > > > > would
> > > > > > > > > > fix this.
> > > > > > > > > >
> > > > > > > > > > First note that the system was idle so it had
> > > > > > > > > > dropped
> > > > > > > > > > down
> > > > > > > > > > to only one
> > > > > > > > > > nfsiod thread.
> > > > > > > > > >
> > > > > > > > > Hmm, I see the problem and I'm a bit surprised it
> > > > > > > > > doesn't
> > > > > > > > > bite
> > > > > > > > > more often.
> > > > > > > > > It seems to me that this snippet of code from
> > > > > > > > > nfs_asyncio()
> > > > > > > > > makes too
> > > > > > > > > weak an assumption:
> > > > > > > > > 	/*
> > > > > > > > > 	 * If none are free, we may already have an iod
> > > > > > > > > 	 working
> > > > > > > > > 	 on
> > > > > > > > > 	 this mount
> > > > > > > > > 	 * point. If so, it will process our request.
> > > > > > > > > 	 */
> > > > > > > > > 	if (!gotiod) {
> > > > > > > > > 		if (nmp->nm_bufqiods > 0) {
> > > > > > > > > 			NFS_DPF(ASYNCIO,
> > > > > > > > > 		("nfs_asyncio: %d iods are already processing mount
> > > > > > > > > 		%p\n",
> > > > > > > > > 				 nmp->nm_bufqiods, nmp));
> > > > > > > > > 			gotiod = TRUE;
> > > > > > > > > 		}
> > > > > > > > > 	}
> > > > > > > > > It assumes that, since an nfsiod thread is processing
> > > > > > > > > some
> > > > > > > > > buffer for the
> > > > > > > > > mount, it will become available to do this one, which
> > > > > > > > > isn't
> > > > > > > > > true for your
> > > > > > > > > deadlock.
> > > > > > > > >
> > > > > > > > > I think the simple fix would be to recode
> > > > > > > > > nfs_asyncio() so
> > > > > > > > > that
> > > > > > > > > it only returns 0 if it finds an AVAILABLE nfsiod
> > > > > > > > > thread
> > > > > > > > > that
> > > > > > > > > it
> > > > > > > > > has assigned to do the I/O, getting rid of the above.
> > > > > > > > > The
> > > > > > > > > problem
> > > > > > > > > with doing this is that it may result in a lot more
> > > > > > > > > synchronous I/O
> > > > > > > > > (nfs_asyncio() returns EIO, so the caller does the
> > > > > > > > > I/O).
> > > > > > > > > Maybe
> > > > > > > > > more
> > > > > > > > > synchronous I/O could be avoided by allowing
> > > > > > > > > nfs_asyncio()
> > > > > > > > > to
> > > > > > > > > create a
> > > > > > > > > new thread even if the total is above nfs_iodmax. (I
> > > > > > > > > think
> > > > > > > > > this would
> > > > > > > > > require the fixed array to be replaced with a linked
> > > > > > > > > list
> > > > > > > > > and
> > > > > > > > > might
> > > > > > > > > result in a large number of nfsiod threads.) Maybe
> > > > > > > > > just
> > > > > > > > > having
> > > > > > > > > a large
> > > > > > > > > nfs_iodmax would be an adequate compromise?
> > > > > > > > >
> > > > > > > > > Does having a large # of nfsiod threads cause any
> > > > > > > > > serious
> > > > > > > > > problem for
> > > > > > > > > most systems these days?
> > > > > > > > >
> > > > > > > > > I'd be tempted to recode nfs_asyncio() as above and
> > > > > > > > > then,
> > > > > > > > > instead
> > > > > > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a
> > > > > > > > > fixed
> > > > > > > > > number of
> > > > > > > > > nfsiod threads (this could be a tunable, with the
> > > > > > > > > understanding that
> > > > > > > > > it should be large for good performance)
> > > > > > > > >
> > > > > > > >
> > > > > > > > I do not see how this would solve the deadlock itself.
> > > > > > > > The
> > > > > > > > proposal would
> > > > > > > > only allow system to survive slightly longer after the
> > > > > > > > deadlock
> > > > > > > > appeared.
> > > > > > > > And, I think that allowing the unbound amount of nfsiod
> > > > > > > > threads
> > > > > > > > is also
> > > > > > > > fatal.
> > > > > > > >
> > > > > > > > The issue there is the LOR between buffer lock and vnode
> > > > > > > > lock.
> > > > > > > > Buffer lock
> > > > > > > > always must come after the vnode lock. The problematic
> > > > > > > > nfsiod
> > > > > > > > thread, which
> > > > > > > > locks the vnode, volatile this rule, because despite the
> > > > > > > > LK_KERNPROC
> > > > > > > > ownership of the buffer lock, it is the thread which de
> > > > > > > > fact
> > > > > > > > owns the
> > > > > > > > buffer (only the thread can unlock it).
> > > > > > > >
> > > > > > > > A possible solution would be to pass LK_NOWAIT to
> > > > > > > > nfs_nget()
> > > > > > > > from the
> > > > > > > > nfs_readdirplusrpc(). From my reading of the code,
> > > > > > > > nfs_nget()
> > > > > > > > should
> > > > > > > > be capable of correctly handling the lock failure. And
> > > > > > > > EBUSY
> > > > > > > > would
> > > > > > > > result in doit = 0, which should be fine too.
> > > > > > > >
> > > > > > > > It is possible that EBUSY should be reset to 0, though.
> > > > > > >
> > > > > > > Yes, thinking about this more, I do think the right answer
> > > > > > > is
> > > > > > > for
> > > > > > > readdirplus to do this. The only question I have is if it
> > > > > > > should
> > > > > > > do
> > > > > > > this always, or if it should do this only from the nfsiod
> > > > > > > thread.
> > > > > > > I
> > > > > > > believe you can't get this in the non-nfsiod case.
> > > > > >
> > > > > > I agree that it looks as of the workaround only needed for
> > > > > > nfsiod
> > > > > > thread.
> > > > > > On the other hand, it is not immediately obvious how to
> > > > > > detect
> > > > > > that
> > > > > > the current thread is nfsio daemon. Probably a thread flag
> > > > > > should be
> > > > > > set.
> > > > >
> > > > > OTOH, updating the attributes from readdir+ is only an
> > > > > optimization
> > > > > anyway, so
> > > > > just having it always do LK_NOWAIT is probably ok (and
> > > > > simple).
> > > > > Currently I'm
> > > > > trying to develop a test case to provoke this so I can test
> > > > > the
> > > > > fix,
> > > > > but no
> > > > > luck on that yet.
> > > > >
> > > > > --
> > > > > John Baldwin
> > > > Just fyi, ignore my comment about the second version of the
> > > > patch
> > > > that
> > > > disables the nfsiod threads from doing readdirplus running
> > > > faster.
> > > > It
> > > > was just that when I tested the 2nd patch, the server's caches
> > > > were
> > > > primed. Oops.
> > > >
> > > > However, sofar the minimal testing I've done has been
> > > > essentially
> > > > performance neutral between the unpatch and patched versions.
> > > >
> > > > Hopefully John has a convenient way to do some performance
> > > > testing,
> > > > since I won't be able to do much until the end of April.
> > >
> > > Performance testing I don't really have available.
> > All I've been doing are things like (assuming /mnt is an NFSv3 mount
> > point):
> > # cd /mnt
> > # time ls -lR > /dev/null
> > # time ls -R > /dev/null
> > - for both a patched and unpatched kernel
> > (Oh, and you need to keep the server's caches pretty consistent. For
> > me
> >  once I run the test once, the server caches end up primed and then
> >  the times seem to be pretty consistent, but I am only using old
> >  laptops.)
> >
> > Maybe you could do something like the above? (I'll try some finds
> > too.)
> > (I don't really have any clever ideas for other tests.)
> 
> I've been doing find across trees on different servers (one mounted
> with
> rdirplus and one without). I've compared the current behavior
> (blocking
> lock in rdirplus + readahead) to disabling readahead for dirs as well
> as
> just using non-blocking locks in the nfsiod case in readdir+
> processing.
> I also ran my tests with various concurrent number of jobs (up to 8
> since
> this is an 8-core machine). The three cases were all basically the
> same,
> and the two possible fixes were no different, so I think you can fix
> this
> however you want.
> 
> --
> John Baldwin
Yep, same here. (Basically performance neutral for what I've tried. At
most 0.5% slower without using the nfsiods and that might have been
within the statistical variance of the test runs.)

I'll commit the one that disables read-ahead for rdirplus in April, if
that's ok with everyone. That way if there is a big performance hit
for some situation, it can be avoided by taking the "rdirplus" option
off the mount.

Thanks for your help with this, rick