From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 14:57:29 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A20B7799;
 Thu, 14 Mar 2013 14:57:29 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
 [IPv6:2001:470:1f10:75::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 70A2F266;
 Thu, 14 Mar 2013 14:57:29 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id C4858B985;
 Thu, 14 Mar 2013 10:57:28 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: Deadlock in the NFS client
Date: Thu, 14 Mar 2013 10:57:13 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; )
References: <201303131356.37919.jhb@freebsd.org>
 <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca>
 <20130314092728.GI3794@kib.kiev.ua>
In-Reply-To: <20130314092728.GI3794@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Message-Id: <201303141057.13609.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Thu, 14 Mar 2013 10:57:28 -0400 (EDT)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 14:57:29 -0000

On Thursday, March 14, 2013 5:27:28 am Konstantin Belousov wrote:
> On Wed, Mar 13, 2013 at 07:33:35PM -0400, Rick Macklem wrote:
> > John Baldwin wrote:
> > > I ran into a machine that had a deadlock among certain files on a
> > > given NFS
> > > mount today. I'm not sure how best to resolve it, though it seems like
> > > perhaps there is a bug with how the pool of nfsiod threads is managed.
> > > Anyway, more details on the actual hang below. This was on 8.x with
> > > the
> > > old NFS client, but I don't see anything in HEAD that would fix this.
> > > 
> > > First note that the system was idle so it had dropped down to only one
> > > nfsiod thread.
> > > 
> > Hmm, I see the problem and I'm a bit surprised it doesn't bite more often.
> > It seems to me that this snippet of code from nfs_asyncio() makes too
> > weak an assumption:
> > 	/*
> > 	 * If none are free, we may already have an iod working on this mount
> > 	 * point.  If so, it will process our request.
> > 	 */
> > 	if (!gotiod) {
> > 		if (nmp->nm_bufqiods > 0) {
> > 			NFS_DPF(ASYNCIO,
> > 		("nfs_asyncio: %d iods are already processing mount %p\n",
> > 				 nmp->nm_bufqiods, nmp));
> > 			gotiod = TRUE;
> > 		}
> > 	}
> > It assumes that, since an nfsiod thread is processing some buffer for the
> > mount, it will become available to do this one, which isn't true for your
> > deadlock.
> > 
> > I think the simple fix would be to recode nfs_asyncio() so that
> > it only returns 0 if it finds an AVAILABLE nfsiod thread that it
> > has assigned to do the I/O, getting rid of the above. The problem
> > with doing this is that it may result in a lot more synchronous I/O
> > (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more
> > synchronous I/O could be avoided by allowing nfs_asyncio() to create a
> > new thread even if the total is above nfs_iodmax. (I think this would
> > require the fixed array to be replaced with a linked list and might
> > result in a large number of nfsiod threads.) Maybe just having a large
> > nfs_iodmax would be an adequate compromise?
> >
> > Does having a large # of nfsiod threads cause any serious problem for
> > most systems these days?
> >
> > I'd be tempted to recode nfs_asyncio() as above and then, instead
> > of nfs_iodmin and nfs_iodmax, I'd simply have: - a fixed number of
> > nfsiod threads (this could be a tunable, with the understanding that
> > it should be large for good performance)
> >
> 
> I do not see how this would solve the deadlock itself. The proposal would
> only allow system to survive slightly longer after the deadlock appeared.
> And, I think that allowing the unbound amount of nfsiod threads is also
> fatal.
> 
> The issue there is the LOR between buffer lock and vnode lock. Buffer lock
> always must come after the vnode lock. The problematic nfsiod thread, which
> locks the vnode, volatile this rule, because despite the LK_KERNPROC
> ownership of the buffer lock, it is the thread which de fact owns the
> buffer (only the thread can unlock it).
> 
> A possible solution would be to pass LK_NOWAIT to nfs_nget() from the
> nfs_readdirplusrpc(). From my reading of the code, nfs_nget() should
> be capable of correctly handling the lock failure. And EBUSY would
> result in doit = 0, which should be fine too.
> 
> It is possible that EBUSY should be reset to 0, though.

Yes, thinking about this more, I do think the right answer is for readdirplus 
to do this.  The only question I have is if it should do this always, or if it 
should do this only from the nfsiod thread.  I believe you can't get this in
the non-nfsiod case.

-- 
John Baldwin