Date: Thu, 4 Nov 2010 22:35:15 -0700 From: Josh Carroll <josh.carroll@gmail.com> To: freebsd-stable@freebsd.org Cc: rmacklem@uoguelph.ca Subject: NFS deadlock (unkillable nfsd and no mounts work) Message-ID: <AANLkTikHKAL4m_fHjnoJBwFkD7xwKpa92uHLkMzzvm2p@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Greetings! I'm having a problem with nfsd hanging and not serving mount points, during which time it can not not be killed. This problem started happening sometime after November 2nd, since kernel from 11/2 sources does not exhibit this problem. The current kernel I'm running is via SVN I just grabbed this evening (around 5pm PDT on November 4th), but I was having the same problem yesterday around 9pm PDT after a csup yesterday (I switched to SVN today to rule out a stale /usr/src from an out of sync cvsup mirror). Here are the svn details: Path: /usr/src URL: svn://svn.freebsd.org/base/stable/8 Repository Root: svn://svn.freebsd.org/base Repository UUID: ccf9f872-aa2e-dd11-9fc8-001c23d0bc1f Revision: 214807 Node Kind: directory Schedule: normal Last Changed Author: jhb Last Changed Rev: 214791 Last Changed Date: 2010-11-04 10:25:31 -0700 (Thu, 04 Nov 2010) uname -a: FreeBSD 8.1-STABLE FreeBSD 8.1-STABLE #0 r214807: Thu Nov 4 17:13:05 PDT 2010 root@pflog.net:/usr/obj/usr/src/sys/PFLOG amd64 I have a Popcorn Hour, and as soon as I try to connect to my NFS mount with it, it hangs on the Popcorn Hour, then eventually pops up a message that says "Request cannot be processed". Likewise if I try to mount it from my macbook, it hangs then later just says operation timed out or something like that, after it hangs for quite a while. During this hang, there is nothing in /var/log indicating a problem nor any other indications something is wrong, except that none of my NFS mounts work and the nfsd process will not die. When I try to reboot the server, I wind up having to fsck all my drives (except the ZFS one), since nfsd will not die. Even kill -9 doesn't kill it (it's showing as in the D state): root 444 0.0 0.0 5812 1384 ?? D 9:30PM 0:00.00 nfsd: server (nfsd) And if I try to /etc/rc.d/nfsd stop, it just says: Stopping nfsd. Waiting for PIDS: 444 And hangs there indefinitely. I tried to run a ktrace on both the "nfsd: server" and "nfsd: master" processes (ktrace -i -d -f nfsd_server.ktrace and ktrace -i -d -f nfsd_master.ktrace), but when I try to connect to the NFS mount, ktrace doesn't capture anything and the "nfsd: server" process goes to the "D" state and then I can't kill it. If I try to kill the nfsd process BEFORE I attempt to mount anything, it properly stops with /etc/rc.d/nfsd stop or with a kill -TERM. Once I've tried to connect once, however, it can't be killed. Hoping it was perhaps related to ZFS, I commented out the one ZFS mount point in /etc/exports, but it still causes this deadlock in the nfsd process. I even went as far as to comment everything in /etc/exports and create a new export on a different disk, which did not help, I get the same nfsd hang. Another strange thing, if I try to truss on the "nfsd: server" process (the child) before I try to mount anything, it causes the process to exit immediately along with truss. If I look at what truss captured for it, I see: 411: sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0) 411: sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) 411: process exit, rval = 0 My kernel built from sources on 11/2 works fine, so it's something that has changed sometime after November 2nd. At least, my kernel from November 2nd runs fine and does not have this nfsd lockup problem. My kernel is just GENERIC with a few additions: include GENERIC device pf device pflog device coretemp device uchcom device sound device snd_hda option NETATALK option ALTQ option ALTQ_CBQ option ALTQ_HFSC option ALTQ_NOPCC option ALTQ_PRIQ option ALTQ_RED option ALTQ_RIO option COMPAT_LINUX32 option GEOM_MIRROR option LIBICONV option LIBMCHAIN option NETSMB option NULLFS option SMBFS option UDF nooption INET6 If any other information is needed, please let me know. What are the next things I should be doing to diagnose the problem? It seems specific to nfsd, but I'm not sure how to prove it's that and not something related or complimentary to nfsd. For what it's worth rpcbind and mountd both stop fine, it's just the nfsd process that is locking up. Thanks in advance for any advice on troubleshooting or root-causing the issue would be appreciated. Regards, Josh
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTikHKAL4m_fHjnoJBwFkD7xwKpa92uHLkMzzvm2p>