Date: Sun, 30 Nov 2025 16:41:17 -0800 From: Rick Macklem <rick.macklem@gmail.com> To: J David <j.david.lists@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: NFSv4.2 hangs on 14.3 Message-ID: <CAM5tNy5rqJcfHuZiUh9Qy2k-O4n7wrm2dv4jRDSCfPGe3F0iQQ@mail.gmail.com> In-Reply-To: <CABXB=RSX0sxD=vAGis156PZzMEu-m4Kd5nQZv-FbogkctkHddQ@mail.gmail.com> References: <CABXB=RQL0tqnE34G6PGLn6AmcwSpapm0-forQZ5vLBQBwcA12Q@mail.gmail.com> <CAM5tNy7eHH7qmTXLRQ9enDAwUzjUXtjugi093eUoRkDbGDCYVQ@mail.gmail.com> <CABXB=RQ6qSNp==Qa_m-=S8cKzxJU2pbuEDjeGfdr7L8Z0=dmGA@mail.gmail.com> <CABXB=RRHz20XwLDCz7qss1=0hXZK-SXz8X7pm4w8o8r2byxH2A@mail.gmail.com> <CAM5tNy6kQMtxe1Sdt_3yQv00ud-xMUsW1m52V2Gn6zy4tnka6Q@mail.gmail.com> <CABXB=RRDABxmgZMadGManyEO3ecy2x-myBZ8bbyjx7UePn%2BcLw@mail.gmail.com> <CAM5tNy65A7QzAS7Ww-dk9Eqx0_xvJAQDPnqEA4D8fWAyB%2BMU2Q@mail.gmail.com> <CABXB=RRH2QkkDiurNWZH8ZeJtCQHBz8XsKg9QjJ7Eg%2BoGSZguA@mail.gmail.com> <CAM5tNy5b7Eda2gwH-H9tzftqRcEsb07to1GD99ZPak4RQ9wYiA@mail.gmail.com> <CABXB=RSX0sxD=vAGis156PZzMEu-m4Kd5nQZv-FbogkctkHddQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Nov 30, 2025 at 3:20 PM J David <j.david.lists@gmail.com> wrote: > > On Sun, Nov 30, 2025 at 4:09 PM Rick Macklem <rick.macklem@gmail.com> wrote: > > Well "Initiate recovery.." means that the server replied with > > a NFSERR_BADSESSION. This does not normally happen > > unless the server reboots or the client does something that > > is normally done upon dismount. > > Without knowing which server it's coming from, I don't know how we > could check that further. > > > You don't use something the the automounter, do you? > > Technically, yes. There is one small server that has a couple of > directories with shared config information that is rarely consulted > that the automounter is used for. However, there have not been any > problems with that that I'm aware of in many years. So if the > automounter is somehow causing these messages, it's a red herring for > the server hang issue. > > > As for "expired locks lost", that means the client has > > received a NFSERR_EXPIRED reply from a server. > > (This normally only happens when the client is network > > partitioned from the server for > 1 minute and, for the > > FreeBSD server, another client makes a conflicting > > lock request.) > > There's just no evidence that such a thing happened. If the client > were unable to reach the server for a full minute, there would be all > kinds of warnings and errors from the client code. (And I would > probably expect to see the good old "NFS server blahblah not > responding still trying" message somewhere.) > > > All I can say is you shouldn't be seeing what you are seeing > > from what I know and I can only conjecture that some sort > > of network partitioning (or maybe repeated mounts/dismounts > > if you are using an automounter) is causing this? > > We do have other repeated mounts/dismounts that aren't caused by the > automounter. > > Some of our NFS servers have code and others have data. The ones that > hang are the "code" servers, which are continuously mounted. > > Mounts are done against the "data" servers as needed. I.e., a job > comes in, the relevant directory from the data server is mounted, the > job runs, the directory is unmounted. I won't say we never have any > problems with that, but it's way less frequent and only hangs the one > job, whereas these "code" server hangs pretty much take down the whole > client node. > > It might be important to restate that there is currently *no* > correlation established between the "Initiate recovery" messages and > our hanging mounts. They may very well be harmless. I now realize that the message "Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's" is not complete, because the case where two NFSv4 mounts for the same server file system results in two mounts with the same /etc/hostid for the same file system and causes the same issue. I suppose the message should be something like.. "Initiate recovery. If server has not rebooted, check that two NFS clients do not have the same /etc/hostid or that one NFS client does not have two mounts that access the file system" rick > > It's only the "Wrong session" message that is demonstrably highly > correlated with incidents of hanging mounts. > > > # tcpdump -s 0 -w out.pcap host <nfs-server> > > This is probably not feasible because of the number of servers > involved and the relative rarity of hangs. For us to get a hang every > week or two means the individual nodes may go months between hangs. > > > Do you have nullfs mounts sitting on top of the NFS mounts? > > Yes, the "code" mounts use nullfs mounts, one per job. > > > There is a known issue that occurs when nullfs mounts are on > > top of NFSv4 mounts, > > Yes, that came up for us and killed my initial attempt to deploy NFSv4 > in 2020. I thought it was fixed around FreeBSD 13? The OpenOwners > issue is definitely still there, which requires us to use oneopenown > which prohibits us from using delegations, but that isn't specific to > nullfs. Other than that... and this... NFS 4.2 has been pretty good to > us. > > Thanks!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM5tNy5rqJcfHuZiUh9Qy2k-O4n7wrm2dv4jRDSCfPGe3F0iQQ>
