Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 28 May 2022 00:12:41 +0200
From:      Andreas Kempe <kempe@lysator.liu.se>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: FreeBSD 12.3/13.1 NFS client hang
Message-ID:  <YpFM2bSMscG4ekc9@shipon.lysator.liu.se>
In-Reply-To: <YQBPR0101MB9742280313FC17543132A61CDDD89@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>
References:  <YpEwxdGCouUUFHiE@shipon.lysator.liu.se> <YQBPR0101MB9742280313FC17543132A61CDDD89@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, May 27, 2022 at 08:59:57PM +0000, Rick Macklem wrote:
> Andreas Kempe <kempe@lysator.liu.se> wrote:
> > Hello everyone!
> >
> > I'm having issues with the NFS clients on FreeBSD 12.3 and 13.1
> > systems hanging when using a CentOS 7 server.
> First, make sure you are using hard mounts. "soft" or "intr" mounts won't
> work and will mess up the session sooner or later. (A messed up session could
> result in no free slots on the session and that will wedge threads in
> nfsv4_sequencelookup() as you describe.
> (This is briefly described in the BUGS section of "man mount_nfs".)
> 

I had totally missed that soft and interruptible mounts have these
issues. I switched the FreeBSD-machines to soft and intr on purpose
to be able to fix hung mounts without having to restart the machine on
NFS hangs. Since they are shared machines, it is an inconvinience for
other users if one user causes a hang.

Switching our test machine back to hard mounts did prevent recursive
grep from immediately causing the slot type hang again.

> Do a:
> # nfsstat -m
> on the clients and look for "hard".
> 
> Next, is there anything logged on the console for the 13.1 client(s)?
> (13.1 has some diagnostics for things like a server replying with the
>  wrong session slot#.)
> 

The one thing we have seen logged are messages along the lines of:
kernel: newnfs: server 'mail' error: fileid changed. fsid 4240eca6003a052a:0: expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)

> Also, maybe I'm old fashioned, but I find "ps axHl" useful, since it shows
> where all the processes are sleeping.
> And "procstat -kk" covers all of the locks.
> 

I don't know if it is a matter of being old fashioned as much as one
of taste. :) In future dumps, I can provide both ps axHl and procstat -kk.

> > Below are procstat kstack $PID invocations showing where the processes
> > have hung. In the nfsv4_sequencelookup it seems hung waiting for
> > nfsess_slots to have an available slot. In the second nfs_lock case,
> > it seems the processes are stuck waiting on vnode locks.
> > 
> > These issues seem to appear seemingly at random, but also if
> > operations that open a lot of files or create a lot of file locks are
> > used. An example that can often provoke a hang is performing a
> > recursive grep through a large file hierarchy like the FreeBSD
> > codebase.
> >
> > The NFS code is large and complicated so any advice is appriciated!
> Yea. I'm the author and I don't know exactly what it all does;-)\
> 
> > Cordially,
> > Andreas Kempe
> >
>
> [...]
>
> Not very useful unless you have all the processes and their locks to try and figure out what is holding
> the vnode locks.
> 

Yes, I sent this mostly in the hope that it might be something that
someone has seen before. I understand that more verbose information is
needed to track down the lock contention.

I'll switch our machines back to using hard mounts and try to get as
much diagnostic information as possible when the next lockup happens.

Do you have any good suggestions for tracking down the issue? I've
been contemplating enabling WITNESS or building with debug information
to be able to hook in the kernel debugger.

Thank you very much for your reply!
Cordially,
Andreas Kempe

> rick
> 
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YpFM2bSMscG4ekc9>