Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 30 Nov 2025 16:41:17 -0800
From:      Rick Macklem <rick.macklem@gmail.com>
To:        J David <j.david.lists@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: NFSv4.2 hangs on 14.3
Message-ID:  <CAM5tNy5rqJcfHuZiUh9Qy2k-O4n7wrm2dv4jRDSCfPGe3F0iQQ@mail.gmail.com>
In-Reply-To: <CABXB=RSX0sxD=vAGis156PZzMEu-m4Kd5nQZv-FbogkctkHddQ@mail.gmail.com>
References:  <CABXB=RQL0tqnE34G6PGLn6AmcwSpapm0-forQZ5vLBQBwcA12Q@mail.gmail.com> <CAM5tNy7eHH7qmTXLRQ9enDAwUzjUXtjugi093eUoRkDbGDCYVQ@mail.gmail.com> <CABXB=RQ6qSNp==Qa_m-=S8cKzxJU2pbuEDjeGfdr7L8Z0=dmGA@mail.gmail.com> <CABXB=RRHz20XwLDCz7qss1=0hXZK-SXz8X7pm4w8o8r2byxH2A@mail.gmail.com> <CAM5tNy6kQMtxe1Sdt_3yQv00ud-xMUsW1m52V2Gn6zy4tnka6Q@mail.gmail.com> <CABXB=RRDABxmgZMadGManyEO3ecy2x-myBZ8bbyjx7UePn%2BcLw@mail.gmail.com> <CAM5tNy65A7QzAS7Ww-dk9Eqx0_xvJAQDPnqEA4D8fWAyB%2BMU2Q@mail.gmail.com> <CABXB=RRH2QkkDiurNWZH8ZeJtCQHBz8XsKg9QjJ7Eg%2BoGSZguA@mail.gmail.com> <CAM5tNy5b7Eda2gwH-H9tzftqRcEsb07to1GD99ZPak4RQ9wYiA@mail.gmail.com> <CABXB=RSX0sxD=vAGis156PZzMEu-m4Kd5nQZv-FbogkctkHddQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Sun, Nov 30, 2025 at 3:20 PM J David <j.david.lists@gmail.com> wrote:
>
> On Sun, Nov 30, 2025 at 4:09 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > Well "Initiate recovery.." means that the server replied with
> > a NFSERR_BADSESSION. This does not normally happen
> > unless the server reboots or the client does something that
> > is normally done upon dismount.
>
> Without knowing which server it's coming from, I don't know how we
> could check that further.
>
> > You don't use something the the automounter, do you?
>
> Technically, yes. There is one small server that has a couple of
> directories with shared config information that is rarely consulted
> that the automounter is used for. However, there have not been any
> problems with that that I'm aware of in many years. So if the
> automounter is somehow causing these messages, it's a red herring for
> the server hang issue.
>
> > As for "expired locks lost", that means the client has
> > received a NFSERR_EXPIRED reply from a server.
> > (This normally only happens when the client is network
> > partitioned from the server for > 1 minute and, for the
> > FreeBSD server, another client makes a conflicting
> > lock request.)
>
> There's just no evidence that such a thing happened. If the client
> were unable to reach the server for a full minute, there would be all
> kinds of warnings and errors from the client code. (And I would
> probably expect to see the good old "NFS server blahblah not
> responding still trying" message somewhere.)
>
> > All I can say is you shouldn't be seeing what you are seeing
> > from what I know and I can only conjecture that some sort
> > of network partitioning (or maybe repeated mounts/dismounts
> > if you are using an automounter) is causing this?
>
> We do have other repeated mounts/dismounts that aren't caused by the
> automounter.
>
> Some of our NFS servers have code and others have data. The ones that
> hang are the "code" servers, which are continuously mounted.
>
> Mounts are done against the "data" servers as needed. I.e., a job
> comes in, the relevant directory from the data server is mounted, the
> job runs, the directory is unmounted. I won't say we never have any
> problems with that, but it's way less frequent and only hangs the one
> job, whereas these "code" server hangs pretty much take down the whole
> client node.
>
> It might be important to restate that there is currently *no*
> correlation established between the "Initiate recovery" messages and
> our hanging mounts. They may very well be harmless.
I now realize that the message "Initiate recovery. If server has not
rebooted, check NFS clients for unique /etc/hostid's" is not complete,
because the case where two NFSv4 mounts for the same server
file system results in two mounts with the same /etc/hostid for the
same file system and causes the same issue.

I suppose the message should be something like..
"Initiate recovery. If server has not rebooted, check that two NFS
clients do not have the same /etc/hostid or that one NFS client
does not have two mounts that access the file system"

rick

>
> It's only the "Wrong session" message that is demonstrably highly
> correlated with incidents of hanging mounts.
>
> > # tcpdump -s 0 -w out.pcap host <nfs-server>
>
> This is probably not feasible because of the number of servers
> involved and the relative rarity of hangs. For us to get a hang every
> week or two means the individual nodes may go months between hangs.
>
> > Do you have nullfs mounts sitting on top of the NFS mounts?
>
> Yes, the "code" mounts use nullfs mounts, one per job.
>
> > There is a known issue that occurs when nullfs mounts are on
> > top of NFSv4 mounts,
>
> Yes, that came up for us and killed my initial attempt to deploy NFSv4
> in 2020. I thought it was fixed around FreeBSD 13? The OpenOwners
> issue is definitely still there, which requires us to use oneopenown
> which prohibits us from using delegations, but that isn't specific to
> nullfs. Other than that... and this... NFS 4.2 has been pretty good to
> us.
>
> Thanks!



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM5tNy5rqJcfHuZiUh9Qy2k-O4n7wrm2dv4jRDSCfPGe3F0iQQ>