From owner-freebsd-fs@FreeBSD.ORG Thu Mar 21 23:24:31 2013 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id BD570A55; Thu, 21 Mar 2013 23:24:31 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 51B63168; Thu, 21 Mar 2013 23:24:30 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAJOVS1GDaFvO/2dsb2JhbABDiCC6KIMKgXF0giQBAQUjBFIbDgoCAg0ZAlkGLod5sAmSJoEjjTo0B4ItgRMDlmSRAoMmIIFs X-IronPort-AV: E=Sophos;i="4.84,888,1355115600"; d="scan'208";a="20199593" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu.net.uoguelph.ca with ESMTP; 21 Mar 2013 19:24:24 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 0D057B4035; Thu, 21 Mar 2013 19:24:24 -0400 (EDT) Date: Thu, 21 Mar 2013 19:24:24 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <1881768576.34111.1363908264039.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201303211600.46884.jhb@freebsd.org> Subject: Re: Deadlock in the NFS client MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Mar 2013 23:24:31 -0000 John Baldwin wrote: > On Monday, March 18, 2013 9:00:54 pm Rick Macklem wrote: > > John Baldwin wrote: > > > On Friday, March 15, 2013 10:03:39 pm Rick Macklem wrote: > > > > John Baldwin wrote: > > > > > On Thursday, March 14, 2013 1:22:39 pm Konstantin Belousov > > > > > wrote: > > > > > > On Thu, Mar 14, 2013 at 10:57:13AM -0400, John Baldwin > > > > > > wrote: > > > > > > > On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov > > > > > > > wrote: > > > > > > > > On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem > > > > > > > > wrote: > > > > > > > > > John Baldwin wrote: > > > > > > > > > > I ran into a machine that had a deadlock among > > > > > > > > > > certain > > > > > > > > > > files > > > > > > > > > > on a > > > > > > > > > > given NFS > > > > > > > > > > mount today. I'm not sure how best to resolve it, > > > > > > > > > > though > > > > > > > > > > it > > > > > > > > > > seems like > > > > > > > > > > perhaps there is a bug with how the pool of nfsiod > > > > > > > > > > threads > > > > > > > > > > is managed. > > > > > > > > > > Anyway, more details on the actual hang below. This > > > > > > > > > > was > > > > > > > > > > on > > > > > > > > > > 8.x with > > > > > > > > > > the > > > > > > > > > > old NFS client, but I don't see anything in HEAD > > > > > > > > > > that > > > > > > > > > > would > > > > > > > > > > fix this. > > > > > > > > > > > > > > > > > > > > First note that the system was idle so it had > > > > > > > > > > dropped > > > > > > > > > > down > > > > > > > > > > to only one > > > > > > > > > > nfsiod thread. > > > > > > > > > > > > > > > > > > > Hmm, I see the problem and I'm a bit surprised it > > > > > > > > > doesn't > > > > > > > > > bite > > > > > > > > > more often. > > > > > > > > > It seems to me that this snippet of code from > > > > > > > > > nfs_asyncio() > > > > > > > > > makes too > > > > > > > > > weak an assumption: > > > > > > > > > /* > > > > > > > > > * If none are free, we may already have an iod > > > > > > > > > working > > > > > > > > > on > > > > > > > > > this mount > > > > > > > > > * point. If so, it will process our request. > > > > > > > > > */ > > > > > > > > > if (!gotiod) { > > > > > > > > > if (nmp->nm_bufqiods > 0) { > > > > > > > > > NFS_DPF(ASYNCIO, > > > > > > > > > ("nfs_asyncio: %d iods are already processing mount > > > > > > > > > %p\n", > > > > > > > > > nmp->nm_bufqiods, nmp)); > > > > > > > > > gotiod = TRUE; > > > > > > > > > } > > > > > > > > > } > > > > > > > > > It assumes that, since an nfsiod thread is processing > > > > > > > > > some > > > > > > > > > buffer for the > > > > > > > > > mount, it will become available to do this one, which > > > > > > > > > isn't > > > > > > > > > true for your > > > > > > > > > deadlock. > > > > > > > > > > > > > > > > > > I think the simple fix would be to recode > > > > > > > > > nfs_asyncio() so > > > > > > > > > that > > > > > > > > > it only returns 0 if it finds an AVAILABLE nfsiod > > > > > > > > > thread > > > > > > > > > that > > > > > > > > > it > > > > > > > > > has assigned to do the I/O, getting rid of the above. > > > > > > > > > The > > > > > > > > > problem > > > > > > > > > with doing this is that it may result in a lot more > > > > > > > > > synchronous I/O > > > > > > > > > (nfs_asyncio() returns EIO, so the caller does the > > > > > > > > > I/O). > > > > > > > > > Maybe > > > > > > > > > more > > > > > > > > > synchronous I/O could be avoided by allowing > > > > > > > > > nfs_asyncio() > > > > > > > > > to > > > > > > > > > create a > > > > > > > > > new thread even if the total is above nfs_iodmax. (I > > > > > > > > > think > > > > > > > > > this would > > > > > > > > > require the fixed array to be replaced with a linked > > > > > > > > > list > > > > > > > > > and > > > > > > > > > might > > > > > > > > > result in a large number of nfsiod threads.) Maybe > > > > > > > > > just > > > > > > > > > having > > > > > > > > > a large > > > > > > > > > nfs_iodmax would be an adequate compromise? > > > > > > > > > > > > > > > > > > Does having a large # of nfsiod threads cause any > > > > > > > > > serious > > > > > > > > > problem for > > > > > > > > > most systems these days? > > > > > > > > > > > > > > > > > > I'd be tempted to recode nfs_asyncio() as above and > > > > > > > > > then, > > > > > > > > > instead > > > > > > > > > of nfs_iodmin and nfs_iodmax, I'd simply have: - a > > > > > > > > > fixed > > > > > > > > > number of > > > > > > > > > nfsiod threads (this could be a tunable, with the > > > > > > > > > understanding that > > > > > > > > > it should be large for good performance) > > > > > > > > > > > > > > > > > > > > > > > > > I do not see how this would solve the deadlock itself. > > > > > > > > The > > > > > > > > proposal would > > > > > > > > only allow system to survive slightly longer after the > > > > > > > > deadlock > > > > > > > > appeared. > > > > > > > > And, I think that allowing the unbound amount of nfsiod > > > > > > > > threads > > > > > > > > is also > > > > > > > > fatal. > > > > > > > > > > > > > > > > The issue there is the LOR between buffer lock and vnode > > > > > > > > lock. > > > > > > > > Buffer lock > > > > > > > > always must come after the vnode lock. The problematic > > > > > > > > nfsiod > > > > > > > > thread, which > > > > > > > > locks the vnode, volatile this rule, because despite the > > > > > > > > LK_KERNPROC > > > > > > > > ownership of the buffer lock, it is the thread which de > > > > > > > > fact > > > > > > > > owns the > > > > > > > > buffer (only the thread can unlock it). > > > > > > > > > > > > > > > > A possible solution would be to pass LK_NOWAIT to > > > > > > > > nfs_nget() > > > > > > > > from the > > > > > > > > nfs_readdirplusrpc(). From my reading of the code, > > > > > > > > nfs_nget() > > > > > > > > should > > > > > > > > be capable of correctly handling the lock failure. And > > > > > > > > EBUSY > > > > > > > > would > > > > > > > > result in doit = 0, which should be fine too. > > > > > > > > > > > > > > > > It is possible that EBUSY should be reset to 0, though. > > > > > > > > > > > > > > Yes, thinking about this more, I do think the right answer > > > > > > > is > > > > > > > for > > > > > > > readdirplus to do this. The only question I have is if it > > > > > > > should > > > > > > > do > > > > > > > this always, or if it should do this only from the nfsiod > > > > > > > thread. > > > > > > > I > > > > > > > believe you can't get this in the non-nfsiod case. > > > > > > > > > > > > I agree that it looks as of the workaround only needed for > > > > > > nfsiod > > > > > > thread. > > > > > > On the other hand, it is not immediately obvious how to > > > > > > detect > > > > > > that > > > > > > the current thread is nfsio daemon. Probably a thread flag > > > > > > should be > > > > > > set. > > > > > > > > > > OTOH, updating the attributes from readdir+ is only an > > > > > optimization > > > > > anyway, so > > > > > just having it always do LK_NOWAIT is probably ok (and > > > > > simple). > > > > > Currently I'm > > > > > trying to develop a test case to provoke this so I can test > > > > > the > > > > > fix, > > > > > but no > > > > > luck on that yet. > > > > > > > > > > -- > > > > > John Baldwin > > > > Just fyi, ignore my comment about the second version of the > > > > patch > > > > that > > > > disables the nfsiod threads from doing readdirplus running > > > > faster. > > > > It > > > > was just that when I tested the 2nd patch, the server's caches > > > > were > > > > primed. Oops. > > > > > > > > However, sofar the minimal testing I've done has been > > > > essentially > > > > performance neutral between the unpatch and patched versions. > > > > > > > > Hopefully John has a convenient way to do some performance > > > > testing, > > > > since I won't be able to do much until the end of April. > > > > > > Performance testing I don't really have available. > > All I've been doing are things like (assuming /mnt is an NFSv3 mount > > point): > > # cd /mnt > > # time ls -lR > /dev/null > > # time ls -R > /dev/null > > - for both a patched and unpatched kernel > > (Oh, and you need to keep the server's caches pretty consistent. For > > me > > once I run the test once, the server caches end up primed and then > > the times seem to be pretty consistent, but I am only using old > > laptops.) > > > > Maybe you could do something like the above? (I'll try some finds > > too.) > > (I don't really have any clever ideas for other tests.) > > I've been doing find across trees on different servers (one mounted > with > rdirplus and one without). I've compared the current behavior > (blocking > lock in rdirplus + readahead) to disabling readahead for dirs as well > as > just using non-blocking locks in the nfsiod case in readdir+ > processing. > I also ran my tests with various concurrent number of jobs (up to 8 > since > this is an 8-core machine). The three cases were all basically the > same, > and the two possible fixes were no different, so I think you can fix > this > however you want. > > -- > John Baldwin Yep, same here. (Basically performance neutral for what I've tried. At most 0.5% slower without using the nfsiods and that might have been within the statistical variance of the test runs.) I'll commit the one that disables read-ahead for rdirplus in April, if that's ok with everyone. That way if there is a big performance hit for some situation, it can be avoided by taking the "rdirplus" option off the mount. Thanks for your help with this, rick