From owner-freebsd-fs@FreeBSD.ORG  Wed Mar 13 23:33:37 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 64B8B38F;
 Wed, 13 Mar 2013 23:33:37 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id E59E1DD2;
 Wed, 13 Mar 2013 23:33:36 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqEEABoMQVGDaFvO/2dsb2JhbAA7CIgkuV6CXYFwdIIqAQEEASMEUgUWDgoCAg0ZAlkGiCEGr26SQxeBI4w4gQE0B4ItgRMDlliRAoMmIIFs
X-IronPort-AV: E=Sophos;i="4.84,840,1355115600"; d="scan'208";a="21123471"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 13 Mar 2013 19:33:35 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 6C4EDB402D;
 Wed, 13 Mar 2013 19:33:35 -0400 (EDT)
Date: Wed, 13 Mar 2013 19:33:35 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <492562517.3880600.1363217615412.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <201303131356.37919.jhb@freebsd.org>
Subject: Re: Deadlock in the NFS client
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.203]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Mar 2013 23:33:37 -0000

John Baldwin wrote:
> I ran into a machine that had a deadlock among certain files on a
> given NFS
> mount today. I'm not sure how best to resolve it, though it seems like
> perhaps there is a bug with how the pool of nfsiod threads is managed.
> Anyway, more details on the actual hang below. This was on 8.x with
> the
> old NFS client, but I don't see anything in HEAD that would fix this.
> 
> First note that the system was idle so it had dropped down to only one
> nfsiod thread.
> 
Hmm, I see the problem and I'm a bit surprised it doesn't bite more often.
It seems to me that this snippet of code from nfs_asyncio() makes too
weak an assumption:
	/*
	 * If none are free, we may already have an iod working on this mount
	 * point.  If so, it will process our request.
	 */
	if (!gotiod) {
		if (nmp->nm_bufqiods > 0) {
			NFS_DPF(ASYNCIO,
		("nfs_asyncio: %d iods are already processing mount %p\n",
				 nmp->nm_bufqiods, nmp));
			gotiod = TRUE;
		}
	}
It assumes that, since an nfsiod thread is processing some buffer for the
mount, it will become available to do this one, which isn't true for your
deadlock.

I think the simple fix would be to recode nfs_asyncio() so that it only
returns 0 if it finds an AVAILABLE nfsiod thread that it has assigned to
do the I/O, getting rid of the above.
The problem with doing this is that it may result in a lot more synchronous
I/O (nfs_asyncio() returns EIO, so the caller does the I/O). Maybe more
synchronous I/O could be avoided by allowing nfs_asyncio() to create a new thread
even if the total is above nfs_iodmax. (I think this would require the fixed array
to be replaced with a linked list and might result in a large number of nfsiod
threads.) Maybe just having a large nfs_iodmax would be an adequate compromise?

Does having a large # of nfsiod threads cause any serious problem for most
systems these days?

I'd be tempted to recode nfs_asyncio() as above and then, instead of nfs_iodmin
and nfs_iodmax, I'd simply have:
- a fixed number of nfsiod threads (this could be a tunable, with the
       understanding that it should be large for good performance)

rick

> The nfsiod thread is hung on a vnode lock:
> 
> (kgdb) proc 36927
> [Switching to thread 150 (Thread 100679)]#0 sched_switch (
> td=0xffffff0320de88c0, newtd=0xffffff0003521460, flags=Variable
> "flags" is
> not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> 1898 cpuid = PCPU_GET(cpuid);
> (kgdb) where
> #0 sched_switch (td=0xffffff0320de88c0, newtd=0xffffff0003521460,
> flags=Variable "flags" is not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0)
> at /usr/src/sys/kern/kern_synch.c:449
> #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff0358bbb7f8,
> pri=96)
> at /usr/src/sys/kern/subr_sleepqueue.c:629
> #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff0358bbb7f8,
> flags=524544,
> ilk=0xffffff0358bbb820, wmesg=Variable "wmesg" is not available.
> ) at /usr/src/sys/kern/kern_lock.c:220
> #4 0xffffffff80489219 in vop_stdlock (ap=Variable "ap" is not
> available.
> ) at lockmgr.h:94
> #5 0xffffffff80697322 in VOP_LOCK1_APV (vop=0xffffffff80892b00,
> a=0xffffff847ac10600) at vnode_if.c:1988
> #6 0xffffffff804a8bb7 in _vn_lock (vp=0xffffff0358bbb760,
> flags=524288,
> file=0xffffffff806fa421 "/usr/src/sys/kern/vfs_subr.c", line=2138)
> at vnode_if.h:859
> #7 0xffffffff8049b680 in vget (vp=0xffffff0358bbb760, flags=524544,
> td=0xffffff0320de88c0) at /usr/src/sys/kern/vfs_subr.c:2138
> #8 0xffffffff8048d4aa in vfs_hash_get (mp=0xffffff004a3a0000,
> hash=227722108,
> flags=Variable "flags" is not available.
> ) at /usr/src/sys/kern/vfs_hash.c:81
> #9 0xffffffff805631f6 in nfs_nget (mntp=0xffffff004a3a0000,
> fhp=0xffffff03771eed56, fhsize=32, npp=0xffffff847ac10a40,
> flags=524288)
> at /usr/src/sys/nfsclient/nfs_node.c:120
> #10 0xffffffff80570229 in nfs_readdirplusrpc (vp=0xffffff0179902760,
> uiop=0xffffff847ac10ad0, cred=0xffffff005587c300)
> at /usr/src/sys/nfsclient/nfs_vnops.c:2636
> ---Type <return> to continue, or q <return> to quit---
> #11 0xffffffff8055f144 in nfs_doio (vp=0xffffff0179902760,
> bp=0xffffff83e05c5860, cr=0xffffff005587c300, td=Variable "td" is not
> available.
> )
> at /usr/src/sys/nfsclient/nfs_bio.c:1600
> #12 0xffffffff8056770a in nfssvc_iod (instance=Variable "instance" is
> not
> available.
> )
> at /usr/src/sys/nfsclient/nfs_nfsiod.c:303
> #13 0xffffffff803d0c2f in fork_exit (callout=0xffffffff805674b0
> <nfssvc_iod>,
> arg=0xffffffff809266e0, frame=0xffffff847ac10c40)
> at /usr/src/sys/kern/kern_fork.c:861
> 
> Thread stuck in getblk for that vnode (holds shared lock on this
> vnode):
> 
> (kgdb) proc 36902
> [Switching to thread 149 (Thread 101543)]#0 sched_switch (
> td=0xffffff0378d8a8c0, newtd=0xffffff0003521460, flags=Variable
> "flags" is
> not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> 1898 cpuid = PCPU_GET(cpuid);
> (kgdb) where
> #0 sched_switch (td=0xffffff0378d8a8c0, newtd=0xffffff0003521460,
> flags=Variable "flags" is not available.
> )
> at /usr/src/sys/kern/sched_ule.c:1898
> #1 0xffffffff80407953 in mi_switch (flags=260, newtd=0x0)
> at /usr/src/sys/kern/kern_synch.c:449
> #2 0xffffffff8043e342 in sleepq_wait (wchan=0xffffff83e11bc1c0,
> pri=96)
> at /usr/src/sys/kern/subr_sleepqueue.c:629
> #3 0xffffffff803e5755 in __lockmgr_args (lk=0xffffff83e11bc1c0,
> flags=530688,
> ilk=0xffffff0358bbb878, wmesg=Variable "wmesg" is not available.
> ) at /usr/src/sys/kern/kern_lock.c:220
> #4 0xffffffff80483aeb in getblk (vp=0xffffff0358bbb760, blkno=3,
> size=32768,
> slpflag=0, slptimeo=0, flags=0) at lockmgr.h:94
> #5 0xffffffff8055e963 in nfs_getcacheblk (vp=0xffffff0358bbb760, bn=3,
> size=32768, td=0xffffff0378d8a8c0) at
> /usr/src/sys/nfsclient/nfs_bio.c:1259
> #6 0xffffffff805627d9 in nfs_bioread (vp=0xffffff0358bbb760,
> uio=0xffffff847bcf0ad0, ioflag=Variable "ioflag" is not available.
> ) at /usr/src/sys/nfsclient/nfs_bio.c:530
> #7 0xffffffff806956a4 in VOP_READ_APV (vop=0xffffffff808a29a0,
> a=0xffffff847bcf09c0) at vnode_if.c:887
> #8 0xffffffff804a9e27 in vn_read (fp=0xffffff03b6a506e0,
> uio=0xffffff847bcf0ad0, active_cred=Variable "active_cred" is not
> available.
> ) at vnode_if.h:384
> #9 0xffffffff80444fb1 in dofileread (td=0xffffff0378d8a8c0, fd=3,
> fp=0xffffff03b6a506e0, auio=0xffffff847bcf0ad0, offset=Variable
> "offset"
> is not available.
> ) at file.h:242
> 
> The buffer is locked by LK_KERNPROC:
> 
> (kgdb) bprint bp
> 0xffffff83e11bc128: BIO_READ flags (ASYNC|VMIO)
> error = 0, bufsize = 32768, bcount = 32768, b_resid = 0
> bufobj = 0xffffff0358bbb878, data = 0xffffff8435a05000, blkno = d, dep
> = 0x0
> lock type bufwait: EXCL by LK_KERNPROC with exclusive waiters pending
> 
> And this buffer is queued as the first pending buffer on the mount
> waiting
> for service by nfsiod:
> 
> (kgdb) set $nmp = (struct nfsmount *)vp->v_mount->mnt_data
> (kgdb) p $nmp->nm_bufq.tqh_first
> $24 = (struct buf *) 0xffffff83e11bc128
> (kgdb) p bp
> $25 = (struct buf *) 0xffffff83e11bc128
> 
> So, the first process is waiting for a block from an NFS directory.
> That
> block is in queue to be completed as an async I/O by the nfsiod thread
> pool.
> However, the lone nfsiod thread in the pool is waiting to exclusively
> lock the
> original NFS directory to update its attributes, so it cannot service
> the
> async I/O request.
> 
> --
> John Baldwin