Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 30 Nov 2025 18:19:57 -0500
From:      J David <j.david.lists@gmail.com>
To:        Rick Macklem <rick.macklem@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: NFSv4.2 hangs on 14.3
Message-ID:  <CABXB=RSX0sxD=vAGis156PZzMEu-m4Kd5nQZv-FbogkctkHddQ@mail.gmail.com>
In-Reply-To: <CAM5tNy5b7Eda2gwH-H9tzftqRcEsb07to1GD99ZPak4RQ9wYiA@mail.gmail.com>
References:  <CABXB=RQL0tqnE34G6PGLn6AmcwSpapm0-forQZ5vLBQBwcA12Q@mail.gmail.com> <CAM5tNy7eHH7qmTXLRQ9enDAwUzjUXtjugi093eUoRkDbGDCYVQ@mail.gmail.com> <CABXB=RQ6qSNp==Qa_m-=S8cKzxJU2pbuEDjeGfdr7L8Z0=dmGA@mail.gmail.com> <CABXB=RRHz20XwLDCz7qss1=0hXZK-SXz8X7pm4w8o8r2byxH2A@mail.gmail.com> <CAM5tNy6kQMtxe1Sdt_3yQv00ud-xMUsW1m52V2Gn6zy4tnka6Q@mail.gmail.com> <CABXB=RRDABxmgZMadGManyEO3ecy2x-myBZ8bbyjx7UePn%2BcLw@mail.gmail.com> <CAM5tNy65A7QzAS7Ww-dk9Eqx0_xvJAQDPnqEA4D8fWAyB%2BMU2Q@mail.gmail.com> <CABXB=RRH2QkkDiurNWZH8ZeJtCQHBz8XsKg9QjJ7Eg%2BoGSZguA@mail.gmail.com> <CAM5tNy5b7Eda2gwH-H9tzftqRcEsb07to1GD99ZPak4RQ9wYiA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Sun, Nov 30, 2025 at 4:09 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> Well "Initiate recovery.." means that the server replied with
> a NFSERR_BADSESSION. This does not normally happen
> unless the server reboots or the client does something that
> is normally done upon dismount.

Without knowing which server it's coming from, I don't know how we
could check that further.

> You don't use something the the automounter, do you?

Technically, yes. There is one small server that has a couple of
directories with shared config information that is rarely consulted
that the automounter is used for. However, there have not been any
problems with that that I'm aware of in many years. So if the
automounter is somehow causing these messages, it's a red herring for
the server hang issue.

> As for "expired locks lost", that means the client has
> received a NFSERR_EXPIRED reply from a server.
> (This normally only happens when the client is network
> partitioned from the server for > 1 minute and, for the
> FreeBSD server, another client makes a conflicting
> lock request.)

There's just no evidence that such a thing happened. If the client
were unable to reach the server for a full minute, there would be all
kinds of warnings and errors from the client code. (And I would
probably expect to see the good old "NFS server blahblah not
responding still trying" message somewhere.)

> All I can say is you shouldn't be seeing what you are seeing
> from what I know and I can only conjecture that some sort
> of network partitioning (or maybe repeated mounts/dismounts
> if you are using an automounter) is causing this?

We do have other repeated mounts/dismounts that aren't caused by the
automounter.

Some of our NFS servers have code and others have data. The ones that
hang are the "code" servers, which are continuously mounted.

Mounts are done against the "data" servers as needed. I.e., a job
comes in, the relevant directory from the data server is mounted, the
job runs, the directory is unmounted. I won't say we never have any
problems with that, but it's way less frequent and only hangs the one
job, whereas these "code" server hangs pretty much take down the whole
client node.

It might be important to restate that there is currently *no*
correlation established between the "Initiate recovery" messages and
our hanging mounts. They may very well be harmless.

It's only the "Wrong session" message that is demonstrably highly
correlated with incidents of hanging mounts.

> # tcpdump -s 0 -w out.pcap host <nfs-server>

This is probably not feasible because of the number of servers
involved and the relative rarity of hangs. For us to get a hang every
week or two means the individual nodes may go months between hangs.

> Do you have nullfs mounts sitting on top of the NFS mounts?

Yes, the "code" mounts use nullfs mounts, one per job.

> There is a known issue that occurs when nullfs mounts are on
> top of NFSv4 mounts,

Yes, that came up for us and killed my initial attempt to deploy NFSv4
in 2020. I thought it was fixed around FreeBSD 13? The OpenOwners
issue is definitely still there, which requires us to use oneopenown
which prohibits us from using delegations, but that isn't specific to
nullfs. Other than that... and this... NFS 4.2 has been pretty good to
us.

Thanks!



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABXB=RSX0sxD=vAGis156PZzMEu-m4Kd5nQZv-FbogkctkHddQ>