Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Nov 2010 14:54:23 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        freebsd-fs@freebsd.org, Oliver Fromme <olli@lurza.secnetix.de>
Subject:   Re: NFS hangs (7.3)
Message-ID:  <20101118125423.GD2392@deviant.kiev.zoral.com.ua>
In-Reply-To: <230979963.266261.1290084581845.JavaMail.root@erie.cs.uoguelph.ca>
References:  <201011171705.oAHH5age003849@lurza.secnetix.de> <230979963.266261.1290084581845.JavaMail.root@erie.cs.uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help

--4gBflNtHT/MYzbiL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Nov 18, 2010 at 07:49:41AM -0500, Rick Macklem wrote:
> > I've got a problem on a server farm. Every now and then,
> > some NFS mounts hang. This happens after a few days or
> > after a few weeks. All processes trying to access files
> > from the hanging mount go to state "D" and freeze. The
> > only way to resolve the problem is to reboot the server.
> >=20
> > "umount -f" als hangs and does not remove the hanging
> > mount (even though it disappears from the output of the
> > mount(8) command).
> >=20
> > Here's one example from an attempt to run df(1) which
> > also hangs:
> >=20
> > ps -uww:
> > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
> > root 61930 0.0 0.0 5728 1280 p4- D 5:15PM 0:00.01 /bin/df
> >=20
> > ps -lww:
> > UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
> > 0 61930 1 0 -4 0 5728 1280 nfs D p4- 0:00.01 /bin/df
> >=20
> It would appear that the root vnode for the client mount
> point is locked for some reason. Here are a couple of possible
> explanations:
> 1 - An infrequently executed code path doesn't VOP_UNLOCK()/vput()
>     as it should. This seems relatively unlikely, since others are
>     using the client without difficulties, but it might be an error
>     case that only shows up for your environment.
> 2 - Another thread is holding the lock while stuck waiting for something
>     else. The most obvious "something else" would be an RPC reply from
>     the server. (A locking deadlock as mentioned below w.r.t. the spawning
>     of new nfsiod threads, could be another?)
>=20
> I'd suggest a "ps axHl" when this happens, and then look for a thread that
> is waiting for an RPC reply. I'd also suggest "nfsstat -c" done several
> times over a few minutes, to see if any of the counts is changing.
> Also, you can do "tcpdump -w xxx -s 0 host <nfs-server>" on the client
> for a while and then look at "xxx" in wireshark (it knows NFS packets)
> and see if there is any net traffic to/from the server. (This will tell
> you if it is a problem related to an RPC that is in progress vs something
> else.) It will also tell you if it is using TCP (or you can "netstat -a"
> to see if TCP connections are there for the NFS mounts).
>=20
> >=20
> > The machine is quite busy. The hangs seem to always occur
> > in the night when lots of cron jobs are running. The machine
> > has 221 NFS mounts and 26 nullfs mounts, and it has 26 jails,
> > if that matters. All NFS shares are mounted from a virtual
> > filer running on a NetApp filer. The mounts use the default
> > settings, so they should be v3 TCP (this is the default,
> > right?). The only extra option we use is -L in order to
> > "fake" locking locally.
> >=20
> > The machine is running FreeBSD 7.3-PRERELEASE-20100311 amd64.
> > Updating is somewhat complicated in that server farm, so I
> > haven't tried that so far because I'm not sure if it would
> > help.
> >=20
> I've only been working with 8/current, so I can't recall if
> there have been any client fixes for 7 since then, except there
> was a very recent change w.r.t. spawning of nfsiod threads to
> avoid lor (potential deadlocks) related to creating new kernel
> threads. I have no idea if one of these deadlocks might be involved.
> (Someone familiar with that might be able to comment?)

The changes for nfsiod creation are definitely not in 7.3-prerelease.

To diagnose the issue, we could start with the output of ps axlHww
(already suggested by Rick) and procstat -ka.

--4gBflNtHT/MYzbiL
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (FreeBSD)

iEYEARECAAYFAkzlIf4ACgkQC3+MBN1Mb4gwzwCdG+4agR3kKzOrppZjoEavVjQV
of0AoNVqIQcvr44tjgDczQIDZCxHcq7q
=ERog
-----END PGP SIGNATURE-----

--4gBflNtHT/MYzbiL--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101118125423.GD2392>