Date: Thu, 18 Nov 2010 14:54:23 +0200 From: Kostik Belousov <kostikbel@gmail.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: freebsd-fs@freebsd.org, Oliver Fromme <olli@lurza.secnetix.de> Subject: Re: NFS hangs (7.3) Message-ID: <20101118125423.GD2392@deviant.kiev.zoral.com.ua> In-Reply-To: <230979963.266261.1290084581845.JavaMail.root@erie.cs.uoguelph.ca> References: <201011171705.oAHH5age003849@lurza.secnetix.de> <230979963.266261.1290084581845.JavaMail.root@erie.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
--4gBflNtHT/MYzbiL Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Nov 18, 2010 at 07:49:41AM -0500, Rick Macklem wrote: > > I've got a problem on a server farm. Every now and then, > > some NFS mounts hang. This happens after a few days or > > after a few weeks. All processes trying to access files > > from the hanging mount go to state "D" and freeze. The > > only way to resolve the problem is to reboot the server. > >=20 > > "umount -f" als hangs and does not remove the hanging > > mount (even though it disappears from the output of the > > mount(8) command). > >=20 > > Here's one example from an attempt to run df(1) which > > also hangs: > >=20 > > ps -uww: > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND > > root 61930 0.0 0.0 5728 1280 p4- D 5:15PM 0:00.01 /bin/df > >=20 > > ps -lww: > > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND > > 0 61930 1 0 -4 0 5728 1280 nfs D p4- 0:00.01 /bin/df > >=20 > It would appear that the root vnode for the client mount > point is locked for some reason. Here are a couple of possible > explanations: > 1 - An infrequently executed code path doesn't VOP_UNLOCK()/vput() > as it should. This seems relatively unlikely, since others are > using the client without difficulties, but it might be an error > case that only shows up for your environment. > 2 - Another thread is holding the lock while stuck waiting for something > else. The most obvious "something else" would be an RPC reply from > the server. (A locking deadlock as mentioned below w.r.t. the spawning > of new nfsiod threads, could be another?) >=20 > I'd suggest a "ps axHl" when this happens, and then look for a thread that > is waiting for an RPC reply. I'd also suggest "nfsstat -c" done several > times over a few minutes, to see if any of the counts is changing. > Also, you can do "tcpdump -w xxx -s 0 host <nfs-server>" on the client > for a while and then look at "xxx" in wireshark (it knows NFS packets) > and see if there is any net traffic to/from the server. (This will tell > you if it is a problem related to an RPC that is in progress vs something > else.) It will also tell you if it is using TCP (or you can "netstat -a" > to see if TCP connections are there for the NFS mounts). >=20 > >=20 > > The machine is quite busy. The hangs seem to always occur > > in the night when lots of cron jobs are running. The machine > > has 221 NFS mounts and 26 nullfs mounts, and it has 26 jails, > > if that matters. All NFS shares are mounted from a virtual > > filer running on a NetApp filer. The mounts use the default > > settings, so they should be v3 TCP (this is the default, > > right?). The only extra option we use is -L in order to > > "fake" locking locally. > >=20 > > The machine is running FreeBSD 7.3-PRERELEASE-20100311 amd64. > > Updating is somewhat complicated in that server farm, so I > > haven't tried that so far because I'm not sure if it would > > help. > >=20 > I've only been working with 8/current, so I can't recall if > there have been any client fixes for 7 since then, except there > was a very recent change w.r.t. spawning of nfsiod threads to > avoid lor (potential deadlocks) related to creating new kernel > threads. I have no idea if one of these deadlocks might be involved. > (Someone familiar with that might be able to comment?) The changes for nfsiod creation are definitely not in 7.3-prerelease. To diagnose the issue, we could start with the output of ps axlHww (already suggested by Rick) and procstat -ka. --4gBflNtHT/MYzbiL Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAkzlIf4ACgkQC3+MBN1Mb4gwzwCdG+4agR3kKzOrppZjoEavVjQV of0AoNVqIQcvr44tjgDczQIDZCxHcq7q =ERog -----END PGP SIGNATURE----- --4gBflNtHT/MYzbiL--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101118125423.GD2392>