From owner-freebsd-stable@FreeBSD.ORG  Fri Jul 26 03:06:09 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id D0443DBE
 for <freebsd-stable@freebsd.org>; Fri, 26 Jul 2013 03:06:09 +0000 (UTC)
 (envelope-from prvs=091960a35b=michael@esosoft.com)
Received: from eagle.esosoft.net (eagle.esosoft.net [66.241.144.8])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id BA22F2F0E
 for <freebsd-stable@freebsd.org>; Fri, 26 Jul 2013 03:06:09 +0000 (UTC)
Received: from [74.100.23.197] (port=37654 helo=michaelimac.castillodelsol.com)
 by eagle.esosoft.net with esmtpsa (TLSv1:AES128-SHA:128)
 (Exim 4.80.1 (FreeBSD)) (envelope-from <michael@esosoft.com>)
 id 1V2YLw-0006Cr-97; Thu, 25 Jul 2013 20:06:00 -0700
Subject: Re: NFS deadlock on 9.2-Beta1
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Content-Type: text/plain; charset=us-ascii
From: Michael Tratz <michael@esosoft.com>
In-Reply-To: <960930050.1702791.1374711910151.JavaMail.root@uoguelph.ca>
Date: Thu, 25 Jul 2013 20:05:59 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <780BC2DB-3BBA-4396-852B-0EBDF30BF985@esosoft.com>
References: <960930050.1702791.1374711910151.JavaMail.root@uoguelph.ca>
To: Rick Macklem <rmacklem@uoguelph.ca>
X-Mailer: Apple Mail (2.1508)
Cc: Steven Hartland <killing@multiplay.co.uk>, freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 26 Jul 2013 03:06:09 -0000


On Jul 24, 2013, at 5:25 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Michael Tratz wrote:
>> Two machines (NFS Server: running ZFS / Client: disk-less), both are
>> running FreeBSD r253506. The NFS client starts to deadlock processes
>> within a few hours. It usually gets worse from there on. The
>> processes stay in "D" state. I haven't been able to reproduce it
>> when I want it to happen. I only have to wait a few hours until the
>> deadlocks occur when traffic to the client machine starts to pick
>> up. The only way to fix the deadlocks is to reboot the client. Even
>> an ls to the path which is deadlocked, will deadlock ls itself. It's
>> totally random what part of the file system gets deadlocked. The NFS
>> server itself has no problem at all to access the files/path when
>> something is deadlocked on the client.
>>=20
>> Last night I decided to put an older kernel on the system r252025
>> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on
>> the client machine (it should have deadlocked by now). FreeBSD is
>> working hard like it always does. :-) There are a few changes to the
>> NFS code from the revision which seems to work until Beta1. I
>> haven't tried to narrow it down if one of those commits are causing
>> the problem. Maybe someone has an idea what could be wrong and I can
>> test a patch or if it's something else, because I'm not a kernel
>> expert. :-)
>>=20
> Well, the only NFS client change committed between r252025 and r253506
> is r253124. It fixes a file corruption problem caused by a previous
> commit that delayed the vnode_pager_setsize() call until after the
> nfs node mutex lock was unlocked.
>=20
> If you can test with only r253124 reverted to see if that gets rid of
> the hangs, it would be useful, although from the procstats, I doubt =
it.
>=20
>> I have run several procstat -kk on the processes including the ls
>> which deadlocked. You can see them here:
>>=20
>> http://pastebin.com/1RPnFT6r
>=20
> All the processes you show seem to be stuck waiting for a vnode lock
> or in __utmx_op_wait. (I`m not sure what the latter means.)
>=20
> What is missing is what processes are holding the vnode locks and
> what they are stuck on.
>=20
> A starting point might be ``ps axhl``, to see what all the threads
> are doing (particularily the WCHAN for them all). If you can drop into
> the debugger when the NFS mounts are hung and do a ```show alllocks``
> that could help. See:
> =
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerne=
ldebug-deadlocks.html
>=20
> I`ll admit I`d be surprised if r253124 caused this, but who knows.
>=20
> If there have been changes to your network device driver between
> r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
> waiting for a reply while holding a vnode lock, that would do it.)
>=20
> Good luck with it and maybe someone else can think of a commit
> between r252025 and r253506 that could cause vnode locking or network
> problems.
>=20
> rick
>=20
>>=20
>> I have tried to mount the file system with and without nolockd. It
>> didn't make a difference. Other than that it is mounted with:
>>=20
>> rw,nfsv3,tcp,noatime,rsize=3D32768,wsize=3D32768
>>=20
>> Let me know if you need me to do something else or if some other
>> output is required. I would have to go back to the problem kernel
>> and wait until the deadlock occurs to get that information.
>>=20

Thanks Rick and Steven for your quick replies.

I spoke too soon regarding r252025 fixing the problem. The same issue =
started to show up after about 1 day and a few hours of uptime.

"ps axhl" shows all those stuck processes in newnfs

I recompiled the GENERIC kernel for Beta1 with the debugging options:

=
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerne=
ldebug-deadlocks.html

ps and debugging output:

http://pastebin.com/1v482Dfw

(I only listed processes matching newnfs, if you need the whole list, =
please let me know)

The first PID showing up having that problem is 14001. Certainly the =
"show alllocks" command shows interesting information for that PID.
I looked through the commit history for those files mentioned in the =
output to see if there is something obvious to me. But I don't know. :-)
I hope that information helps you to dig deeper into the issue what =
might be causing those deadlocks.

I did include the pciconf -lv, because you mentioned network device =
drivers. It's Intel igb. The same hardware is running a kernel from =
January 19th, 2013 also as an NFS client. That machine is rock solid. No =
problems at all.

I also went to r251611. That's before r251641 (The NFS FHA changes). =
Same problem. Here is another debugging output from that kernel:

http://pastebin.com/ryv8BYc4

If I should test something else or provide some other output, please let =
me know.

Again thank you!

Michael