Date: Thu, 25 Jul 2013 20:05:59 -0700 From: Michael Tratz <michael@esosoft.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: Steven Hartland <killing@multiplay.co.uk>, freebsd-stable@freebsd.org Subject: Re: NFS deadlock on 9.2-Beta1 Message-ID: <780BC2DB-3BBA-4396-852B-0EBDF30BF985@esosoft.com> In-Reply-To: <960930050.1702791.1374711910151.JavaMail.root@uoguelph.ca> References: <960930050.1702791.1374711910151.JavaMail.root@uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jul 24, 2013, at 5:25 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote: > Michael Tratz wrote: >> Two machines (NFS Server: running ZFS / Client: disk-less), both are >> running FreeBSD r253506. The NFS client starts to deadlock processes >> within a few hours. It usually gets worse from there on. The >> processes stay in "D" state. I haven't been able to reproduce it >> when I want it to happen. I only have to wait a few hours until the >> deadlocks occur when traffic to the client machine starts to pick >> up. The only way to fix the deadlocks is to reboot the client. Even >> an ls to the path which is deadlocked, will deadlock ls itself. It's >> totally random what part of the file system gets deadlocked. The NFS >> server itself has no problem at all to access the files/path when >> something is deadlocked on the client. >>=20 >> Last night I decided to put an older kernel on the system r252025 >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on >> the client machine (it should have deadlocked by now). FreeBSD is >> working hard like it always does. :-) There are a few changes to the >> NFS code from the revision which seems to work until Beta1. I >> haven't tried to narrow it down if one of those commits are causing >> the problem. Maybe someone has an idea what could be wrong and I can >> test a patch or if it's something else, because I'm not a kernel >> expert. :-) >>=20 > Well, the only NFS client change committed between r252025 and r253506 > is r253124. It fixes a file corruption problem caused by a previous > commit that delayed the vnode_pager_setsize() call until after the > nfs node mutex lock was unlocked. >=20 > If you can test with only r253124 reverted to see if that gets rid of > the hangs, it would be useful, although from the procstats, I doubt = it. >=20 >> I have run several procstat -kk on the processes including the ls >> which deadlocked. You can see them here: >>=20 >> http://pastebin.com/1RPnFT6r >=20 > All the processes you show seem to be stuck waiting for a vnode lock > or in __utmx_op_wait. (I`m not sure what the latter means.) >=20 > What is missing is what processes are holding the vnode locks and > what they are stuck on. >=20 > A starting point might be ``ps axhl``, to see what all the threads > are doing (particularily the WCHAN for them all). If you can drop into > the debugger when the NFS mounts are hung and do a ```show alllocks`` > that could help. See: > = http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerne= ldebug-deadlocks.html >=20 > I`ll admit I`d be surprised if r253124 caused this, but who knows. >=20 > If there have been changes to your network device driver between > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck > waiting for a reply while holding a vnode lock, that would do it.) >=20 > Good luck with it and maybe someone else can think of a commit > between r252025 and r253506 that could cause vnode locking or network > problems. >=20 > rick >=20 >>=20 >> I have tried to mount the file system with and without nolockd. It >> didn't make a difference. Other than that it is mounted with: >>=20 >> rw,nfsv3,tcp,noatime,rsize=3D32768,wsize=3D32768 >>=20 >> Let me know if you need me to do something else or if some other >> output is required. I would have to go back to the problem kernel >> and wait until the deadlock occurs to get that information. >>=20 Thanks Rick and Steven for your quick replies. I spoke too soon regarding r252025 fixing the problem. The same issue = started to show up after about 1 day and a few hours of uptime. "ps axhl" shows all those stuck processes in newnfs I recompiled the GENERIC kernel for Beta1 with the debugging options: = http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerne= ldebug-deadlocks.html ps and debugging output: http://pastebin.com/1v482Dfw (I only listed processes matching newnfs, if you need the whole list, = please let me know) The first PID showing up having that problem is 14001. Certainly the = "show alllocks" command shows interesting information for that PID. I looked through the commit history for those files mentioned in the = output to see if there is something obvious to me. But I don't know. :-) I hope that information helps you to dig deeper into the issue what = might be causing those deadlocks. I did include the pciconf -lv, because you mentioned network device = drivers. It's Intel igb. The same hardware is running a kernel from = January 19th, 2013 also as an NFS client. That machine is rock solid. No = problems at all. I also went to r251611. That's before r251641 (The NFS FHA changes). = Same problem. Here is another debugging output from that kernel: http://pastebin.com/ryv8BYc4 If I should test something else or provide some other output, please let = me know. Again thank you! Michael
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?780BC2DB-3BBA-4396-852B-0EBDF30BF985>