Date: Sat, 27 Jul 2013 23:58:15 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: freebsd-stable@freebsd.org, Steven Hartland <killing@multiplay.co.uk>, Michael Tratz <michael@esosoft.com> Subject: Re: NFS deadlock on 9.2-Beta1 Message-ID: <20130727205815.GC4972@kib.kiev.ua> In-Reply-To: <806421474.2797338.1374956449542.JavaMail.root@uoguelph.ca> References: <780BC2DB-3BBA-4396-852B-0EBDF30BF985@esosoft.com> <806421474.2797338.1374956449542.JavaMail.root@uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
--5QAgd0e35j3NYeGe Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jul 27, 2013 at 04:20:49PM -0400, Rick Macklem wrote: > Michael Tratz wrote: > >=20 > > On Jul 24, 2013, at 5:25 PM, Rick Macklem <rmacklem@uoguelph.ca> > > wrote: > >=20 > > > Michael Tratz wrote: > > >> Two machines (NFS Server: running ZFS / Client: disk-less), both > > >> are > > >> running FreeBSD r253506. The NFS client starts to deadlock > > >> processes > > >> within a few hours. It usually gets worse from there on. The > > >> processes stay in "D" state. I haven't been able to reproduce it > > >> when I want it to happen. I only have to wait a few hours until > > >> the > > >> deadlocks occur when traffic to the client machine starts to pick > > >> up. The only way to fix the deadlocks is to reboot the client. > > >> Even > > >> an ls to the path which is deadlocked, will deadlock ls itself. > > >> It's > > >> totally random what part of the file system gets deadlocked. The > > >> NFS > > >> server itself has no problem at all to access the files/path when > > >> something is deadlocked on the client. > > >>=20 > > >> Last night I decided to put an older kernel on the system r252025 > > >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks > > >> on > > >> the client machine (it should have deadlocked by now). FreeBSD is > > >> working hard like it always does. :-) There are a few changes to > > >> the > > >> NFS code from the revision which seems to work until Beta1. I > > >> haven't tried to narrow it down if one of those commits are > > >> causing > > >> the problem. Maybe someone has an idea what could be wrong and I > > >> can > > >> test a patch or if it's something else, because I'm not a kernel > > >> expert. :-) > > >>=20 > > > Well, the only NFS client change committed between r252025 and > > > r253506 > > > is r253124. It fixes a file corruption problem caused by a previous > > > commit that delayed the vnode_pager_setsize() call until after the > > > nfs node mutex lock was unlocked. > > >=20 > > > If you can test with only r253124 reverted to see if that gets rid > > > of > > > the hangs, it would be useful, although from the procstats, I doubt > > > it. > > >=20 > > >> I have run several procstat -kk on the processes including the ls > > >> which deadlocked. You can see them here: > > >>=20 > > >> http://pastebin.com/1RPnFT6r > > >=20 > > > All the processes you show seem to be stuck waiting for a vnode > > > lock > > > or in __utmx_op_wait. (I`m not sure what the latter means.) > > >=20 > > > What is missing is what processes are holding the vnode locks and > > > what they are stuck on. > > >=20 > > > A starting point might be ``ps axhl``, to see what all the threads > > > are doing (particularily the WCHAN for them all). If you can drop > > > into > > > the debugger when the NFS mounts are hung and do a ```show > > > alllocks`` > > > that could help. See: > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/= kerneldebug-deadlocks.html > > >=20 > > > I`ll admit I`d be surprised if r253124 caused this, but who knows. > > >=20 > > > If there have been changes to your network device driver between > > > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck > > > waiting for a reply while holding a vnode lock, that would do it.) > > >=20 > > > Good luck with it and maybe someone else can think of a commit > > > between r252025 and r253506 that could cause vnode locking or > > > network > > > problems. > > >=20 > > > rick > > >=20 > > >>=20 > > >> I have tried to mount the file system with and without nolockd. It > > >> didn't make a difference. Other than that it is mounted with: > > >>=20 > > >> rw,nfsv3,tcp,noatime,rsize=3D32768,wsize=3D32768 > > >>=20 > > >> Let me know if you need me to do something else or if some other > > >> output is required. I would have to go back to the problem kernel > > >> and wait until the deadlock occurs to get that information. > > >>=20 > >=20 > > Thanks Rick and Steven for your quick replies. > >=20 > > I spoke too soon regarding r252025 fixing the problem. The same issue > > started to show up after about 1 day and a few hours of uptime. > >=20 > > "ps axhl" shows all those stuck processes in newnfs > >=20 > > I recompiled the GENERIC kernel for Beta1 with the debugging options: > >=20 > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/ke= rneldebug-deadlocks.html > >=20 > > ps and debugging output: > >=20 > > http://pastebin.com/1v482Dfw > >=20 > > (I only listed processes matching newnfs, if you need the whole list, > > please let me know) > >=20 > Is your "show alllocks" complete? If not, a complete list of locks > would definitely help. As for "ps axhl", a complete list of processes/thr= eads > might be useful, but not as much, I think. >=20 > > The first PID showing up having that problem is 14001. Certainly the > > "show alllocks" command shows interesting information for that PID. > > I looked through the commit history for those files mentioned in the > > output to see if there is something obvious to me. But I don't know. > > :-) > > I hope that information helps you to dig deeper into the issue what > > might be causing those deadlocks. > >=20 > Well, pid 14001 is interesting in that it holds both the sleep lock > acquired by sblock() and an NFS vnode lock, but is waiting for another > NFS vnode lock, if I read the pastebin stuff correctly. >=20 > I suspect that this process is somewhere in kern_sendfile(), since that > seems to be where sblock() gets called before vn_lock(). 14001 is multithreaded, two its threads own shared vnode lock, and third thread sleeps for a lock. >=20 > It's just a "shot in the dark", but if you can revert > r250907 (dated May 22), it might be worth a try. It adds a bunch of > stuff to kern_sendfile(), including vn_lock() calls and the date sounds > about right. The output provided is not useful. I need the data listed in the referenced 'debugging deadlocks' page. >=20 > rick > ps: I've added kib@ to the cc list, in case he can help with this. He > appears to have been involved in the commits MFC'd by r250907. >=20 > > I did include the pciconf -lv, because you mentioned network device > > drivers. It's Intel igb. The same hardware is running a kernel from > > January 19th, 2013 also as an NFS client. That machine is rock > > solid. No problems at all. > >=20 > Ok, so it sounds like we are looking for a post-January 19 commit. >=20 > > I also went to r251611. That's before r251641 (The NFS FHA changes). > The FHA changes are server related and my understanding is that the > problem is showing up in an NFS client. >=20 > > Same problem. Here is another debugging output from that kernel: > >=20 > > http://pastebin.com/ryv8BYc4 > >=20 > > If I should test something else or provide some other output, please > > let me know. > >=20 > > Again thank you! > >=20 > > Michael > >=20 > >=20 > >=20 > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" --5QAgd0e35j3NYeGe Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR9DRmAAoJEJDCuSvBvK1BUeMQAIyH669RB0mjpEVe+b63WOwg aM/bMaIOPinum4uLEuqHgGkhBx5pBtjyLNAswbQ9hfwJtoguDt91tGRHDqaGgZID mJn4xEDv9H5mR+W9+2orkRQ/JS42eub5Wo1+IBGvaSxMvZqqKmyIPClGbC7ldfHL Uto2SsPWoX89cdGOhKA1doji6ixtJ3QEEMY+9Y1OLdRT9wN9XCOEM7ggiEQg2jqV /p/nvMgZ7qA8tqR6roRDPBtMkOeOFYwzDp8/XphNiUx20TgtQpLAazse5NIKHGrb AT6mDAtCjbggCMqypHcavkpCGbLJDoBWLT5cVgph3EvDqT9CPqvJipztlky0XhGz pHygUSl7BzeaIkvoZmD+b7/dmeTW1Ivm9uKN1QJMQhnMincuLmcjDtVuHq9vlrO6 lCGzRBuHLjlngN74OI3GS6rWNOzabz/ls71+WX2rKOQ9BvLo8u6n1QGkpHJbsMgB N2/eq4JYIz+6rTUDMkGqU//ASFHsq1RZvmJDRgkt/NLSKF3VKZQjkfQ7+XhQT17D 5ZZOz6AgIy98MRzEMSlVxx5iYFPuxsvtINkdAPYVL4cQs5dem3aUcnE077fucABS U82vqROseE+s5+D8nYU5FzjZQtgN8Zt4fVoftoE91dIOhlZbXPFH6SFAp2VycD1t IMsQyDw1ggyozOOGDG/O =mF3a -----END PGP SIGNATURE----- --5QAgd0e35j3NYeGe--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130727205815.GC4972>