From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 22:39:00 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 4D66E529;
 Thu, 14 Mar 2013 22:39:00 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id B326BF7E;
 Thu, 14 Mar 2013 22:38:59 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEADxRQlGDaFvO/2dsb2JhbABDiC65eYJegX10gisBAQUjBFIbDgoRGQIEVQYuh3mvO5JVjVWBDRkbB4ItgRMDjzaDXYNFkQKDJiCBNzU
X-IronPort-AV: E=Sophos;i="4.84,848,1355115600"; d="scan'208";a="21301102"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 14 Mar 2013 18:38:49 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id F0BC2B3F1B;
 Thu, 14 Mar 2013 18:38:48 -0400 (EDT)
Date: Thu, 14 Mar 2013 18:38:48 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <341067813.3923482.1363300728967.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303141444.35740.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: multipart/mixed; 
 boundary="----=_Part_3923481_469795532.1363300728964"
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 22:39:00 -0000

------=_Part_3923481_469795532.1363300728964
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

John Baldwin wrote:
> On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov wrote:
> > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin wrote:
> > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote:
> > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> > > > > John Baldwin wrote:
> > > > > > I ran into a machine that had a deadlock among certain files
> > > > > > on a
> > > > > > given NFS
> > > > > > mount today. I'm not sure how best to resolve it, though it
> > > > > > seems like
> > > > > > perhaps there is a bug with how the pool of nfsiod threads
> > > > > > is managed.
> > > > > > Anyway, more details on the actual hang below. This was on
> > > > > > 8.x with
> > > > > > the
> > > > > > old NFS client, but I don't see anything in HEAD that would
> > > > > > fix this.
> > > > > >
> > > > > > First note that the system was idle so it had dropped down
> > > > > > to only one
> > > > > > nfsiod thread.
> > > > > >
> > > > > Hmm, I see the problem and I'm a bit surprised it doesn't bite
> > > > > more often.
> > > > > It seems to me that this snippet of code from nfs_asyncio()
> > > > > makes too
> > > > > weak an assumption:
> > > > > 	/*
> > > > > 	 * If none are free, we may already have an iod working on
> > > > > 	 this mount
> > > > > 	 * point. If so, it will process our request.
> > > > > 	 */
> > > > > 	if (!gotiod) {
> > > > > 		if (nmp->nm_bufqiods > 0) {
> > > > > 			NFS_DPF(ASYNCIO,
> > > > > 		("nfs_asyncio: %d iods are already processing mount %p\n",
> > > > > 				 nmp->nm_bufqiods, nmp));
> > > > > 			gotiod = TRUE;
> > > > > 		}
> > > > > 	}
> > > > > It assumes that, since an nfsiod thread is processing some
> > > > > buffer for the
> > > > > mount, it will become available to do this one, which isn't
> > > > > true for your
> > > > > deadlock.
> > > > >
> > > > > I think the simple fix would be to recode nfs_asyncio() so
> > > > > that
> > > > > it only returns 0 if it finds an AVAILABLE nfsiod thread that
> > > > > it
> > > > > has assigned to do the I/O, getting rid of the above. The
> > > > > problem
> > > > > with doing this is that it may result in a lot more
> > > > > synchronous I/O
> > > > > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe
> > > > > more
> > > > > synchronous I/O could be avoided by allowing nfs_asyncio() to
> > > > > create a
> > > > > new thread even if the total is above nfs_iodmax. (I think
> > > > > this would
> > > > > require the fixed array to be replaced with a linked list and
> > > > > might
> > > > > result in a large number of nfsiod threads.) Maybe just having
> > > > > a large
> > > > > nfs_iodmax would be an adequate compromise?
> > > > >
> > > > > Does having a large # of nfsiod threads cause any serious
> > > > > problem for
> > > > > most systems these days?
> > > > >
> > > > > I'd be tempted to recode nfs_asyncio() as above and then,
> > > > > instead
> > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed
> > > > > number of
> > > > > nfsiod threads (this could be a tunable, with the
> > > > > understanding that
> > > > > it should be large for good performance)
> > > > >
> > > >
> > > > I do not see how this would solve the deadlock itself. The
> > > > proposal would
> > > > only allow system to survive slightly longer after the deadlock
> > > > appeared.
> > > > And, I think that allowing the unbound amount of nfsiod threads
> > > > is also
> > > > fatal.
> > > >
I should mention that what I was thinking of above was more than just
getting rid of the snippet of code. It would have involved handing a
buffer directly to an available nfsiod thread (no queuing on the mount
point). That way there would never be a thread blocked waiting for a
queued buffer.

However, when I was thinking about it a little after posting, I came
to a similar (but even simpler) conclusion than what you've proposed.
(See below and attached patch.)

> > > > The issue there is the LOR between buffer lock and vnode lock.
> > > > Buffer lock
> > > > always must come after the vnode lock. The problematic nfsiod
> > > > thread, which
> > > > locks the vnode, volatile this rule, because despite the
> > > > LK_KERNPROC
> > > > ownership of the buffer lock, it is the thread which de fact
> > > > owns the
> > > > buffer (only the thread can unlock it).
> > > >
> > > > A possible solution would be to pass LK_NOWAIT to nfs_nget()
> > > > from the
> > > > nfs_readdirplusrpc(). From my reading of the code, nfs_nget()
> > > > should
> > > > be capable of correctly handling the lock failure. And EBUSY
> > > > would
> > > > result in doit = 0, which should be fine too.
> > > >
> > > > It is possible that EBUSY should be reset to 0, though.
> > >
> > > Yes, thinking about this more, I do think the right answer is for
> > > readdirplus to do this. The only question I have is if it should
> > > do
> > > this always, or if it should do this only from the nfsiod thread.
> > > I
> > > believe you can't get this in the non-nfsiod case.
> >
> > I agree that it looks as of the workaround only needed for nfsiod
> > thread.
> > On the other hand, it is not immediately obvious how to detect that
> > the current thread is nfsio daemon. Probably a thread flag should be
> > set.
> 
> OTOH, updating the attributes from readdir+ is only an optimization
> anyway, so
> just having it always do LK_NOWAIT is probably ok (and simple).
> Currently I'm
> trying to develop a test case to provoke this so I can test the fix,
> but no
> luck on that yet.
> 
Well, when I was thinking about it a bit after the last email, I was
thinking "why bother having the nfsiod threads do readdirplus at all?".

The only reason to use the nfsiod threads is read-ahead. However, for
readdir this is problematic, since the read-ahead block can only be done
when it has the correct directory offset cookie. This implies that it
usually waits until the previous directory block has been read. In other
words, the read-ahead can't usually start until the previous block has
been read. As such, why bother having the nfsiod threads do it? They
might be better used for reads and writes.
(OpenBSD doesn't do read-aheads for directories. In fact, they don't
 even keep directory blocks in the buffer cache, although I'm not
 sure I'd suggest the latter.)

It would be nice to compare the performance with/without the attached
patch. It might turn out that not using the nfsiod threads for readdir
performs just as well or better?
 
Anyhow, the attached trivial patch (which stops readdirplus from being
done by the nfsiod threads) might be worth trying?

However, I don't see a problem with modifying readdirplus to do a
non-blocking vget(), except that it might make an already convoluted
function even more convoluted.

rick

> --
> John Baldwin

------=_Part_3923481_469795532.1363300728964
Content-Type: text/x-patch; name=nfsiod.patch
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=nfsiod.patch

LS0tIGZzL25mc2NsaWVudC9uZnNfY2xiaW8uYy5zYXZ5CTIwMTMtMDMtMTQgMTc6NDk6MzIuMDAw
MDAwMDAwIC0wNDAwCisrKyBmcy9uZnNjbGllbnQvbmZzX2NsYmlvLmMJMjAxMy0wMy0xNCAxODow
Mjo1My4wMDAwMDAwMDAgLTA0MDAKQEAgLTE0MTUsMTAgKzE0MTUsMTggQEAgbmNsX2FzeW5jaW8o
c3RydWN0IG5mc21vdW50ICpubXAsIHN0cnVjdAogCSAqIENvbW1pdHMgYXJlIHVzdWFsbHkgc2hv
cnQgYW5kIHN3ZWV0IHNvIGxldHMgc2F2ZSBzb21lIGNwdSBhbmQKIAkgKiBsZWF2ZSB0aGUgYXN5
bmMgZGFlbW9ucyBmb3IgbW9yZSBpbXBvcnRhbnQgcnBjJ3MgKHN1Y2ggYXMgcmVhZHMKIAkgKiBh
bmQgd3JpdGVzKS4KKwkgKgorCSAqIFJlYWRkaXJwbHVzIFJQQ3MgZG8gdmdldCgpcyB0byBhY3F1
aXJlIHRoZSB2bm9kZXMgZm9yIGVudHJpZXMKKwkgKiBpbiB0aGUgZGlyZWN0b3J5IGluIG9yZGVy
IHRvIHVwZGF0ZSBhdHRyaWJ1dGVzLiBUaGlzIGNhbiBkZWFkbG9jaworCSAqIHdpdGggYW5vdGhl
ciB0aHJlYWQgdGhhdCBpcyB3YWl0aW5nIGZvciBhc3luYyBJL08gdG8gYmUgZG9uZSBieQorCSAq
IGFuIG5mc2lvZCB0aHJlYWQgd2hpbGUgaG9sZGluZyBhIGxvY2sgb24gb25lIG9mIHRoZXNlIHZu
b2Rlcy4KKwkgKiBUbyBhdm9pZCB0aGlzIGRlYWRsb2NrLCBkb24ndCBhbGxvdyB0aGUgYXN5bmMg
bmZzaW9kIHRocmVhZHMgdG8KKwkgKiBwZXJmb3JtIFJlYWRkaXJwbHVzIFJQQ3MuCiAJICovCiAJ
bXR4X2xvY2soJm5jbF9pb2RfbXV0ZXgpOwotCWlmIChicC0+Yl9pb2NtZCA9PSBCSU9fV1JJVEUg
JiYgKGJwLT5iX2ZsYWdzICYgQl9ORUVEQ09NTUlUKSAmJgotCSAgICAobm1wLT5ubV9idWZxaW9k
cyA+IG5jbF9udW1hc3luYyAvIDIpKSB7CisJaWYgKChicC0+Yl9pb2NtZCA9PSBCSU9fV1JJVEUg
JiYgKGJwLT5iX2ZsYWdzICYgQl9ORUVEQ09NTUlUKSAmJgorCSAgICAgKG5tcC0+bm1fYnVmcWlv
ZHMgPiBuY2xfbnVtYXN5bmMgLyAyKSkgfHwKKwkgICAgKGJwLT5iX3ZwLT52X3R5cGUgPT0gVkRJ
UiAmJiAobm1wLT5ubV9mbGFnICYgTkZTTU5UX1JESVJQTFVTKSkpIHsKIAkJbXR4X3VubG9jaygm
bmNsX2lvZF9tdXRleCk7CiAJCXJldHVybihFSU8pOwogCX0K
------=_Part_3923481_469795532.1363300728964--