Date: Sun, 30 Nov 2025 19:02:44 -0800 From: Rick Macklem <rick.macklem@gmail.com> To: J David <j.david.lists@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: NFSv4.2 hangs on 14.3 Message-ID: <CAM5tNy5SUtYSTne1EoQh12eiDyiJ7cdECD4%2BXbhvB5MnqjY%2BjA@mail.gmail.com> In-Reply-To: <CAM5tNy5rqJcfHuZiUh9Qy2k-O4n7wrm2dv4jRDSCfPGe3F0iQQ@mail.gmail.com> References: <CABXB=RQL0tqnE34G6PGLn6AmcwSpapm0-forQZ5vLBQBwcA12Q@mail.gmail.com> <CAM5tNy7eHH7qmTXLRQ9enDAwUzjUXtjugi093eUoRkDbGDCYVQ@mail.gmail.com> <CABXB=RQ6qSNp==Qa_m-=S8cKzxJU2pbuEDjeGfdr7L8Z0=dmGA@mail.gmail.com> <CABXB=RRHz20XwLDCz7qss1=0hXZK-SXz8X7pm4w8o8r2byxH2A@mail.gmail.com> <CAM5tNy6kQMtxe1Sdt_3yQv00ud-xMUsW1m52V2Gn6zy4tnka6Q@mail.gmail.com> <CABXB=RRDABxmgZMadGManyEO3ecy2x-myBZ8bbyjx7UePn%2BcLw@mail.gmail.com> <CAM5tNy65A7QzAS7Ww-dk9Eqx0_xvJAQDPnqEA4D8fWAyB%2BMU2Q@mail.gmail.com> <CABXB=RRH2QkkDiurNWZH8ZeJtCQHBz8XsKg9QjJ7Eg%2BoGSZguA@mail.gmail.com> <CAM5tNy5b7Eda2gwH-H9tzftqRcEsb07to1GD99ZPak4RQ9wYiA@mail.gmail.com> <CABXB=RSX0sxD=vAGis156PZzMEu-m4Kd5nQZv-FbogkctkHddQ@mail.gmail.com> <CAM5tNy5rqJcfHuZiUh9Qy2k-O4n7wrm2dv4jRDSCfPGe3F0iQQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Nov 30, 2025 at 4:41 PM Rick Macklem <rick.macklem@gmail.com> wrote: > > On Sun, Nov 30, 2025 at 3:20 PM J David <j.david.lists@gmail.com> wrote: > > > > On Sun, Nov 30, 2025 at 4:09 PM Rick Macklem <rick.macklem@gmail.com> wrote: > > > Well "Initiate recovery.." means that the server replied with > > > a NFSERR_BADSESSION. This does not normally happen > > > unless the server reboots or the client does something that > > > is normally done upon dismount. > > > > Without knowing which server it's coming from, I don't know how we > > could check that further. > > > > > You don't use something the the automounter, do you? > > > > Technically, yes. There is one small server that has a couple of > > directories with shared config information that is rarely consulted > > that the automounter is used for. However, there have not been any > > problems with that that I'm aware of in many years. So if the > > automounter is somehow causing these messages, it's a red herring for > > the server hang issue. > > > > > As for "expired locks lost", that means the client has > > > received a NFSERR_EXPIRED reply from a server. > > > (This normally only happens when the client is network > > > partitioned from the server for > 1 minute and, for the > > > FreeBSD server, another client makes a conflicting > > > lock request.) > > > > There's just no evidence that such a thing happened. If the client > > were unable to reach the server for a full minute, there would be all > > kinds of warnings and errors from the client code. (And I would > > probably expect to see the good old "NFS server blahblah not > > responding still trying" message somewhere.) > > > > > All I can say is you shouldn't be seeing what you are seeing > > > from what I know and I can only conjecture that some sort > > > of network partitioning (or maybe repeated mounts/dismounts > > > if you are using an automounter) is causing this? > > > > We do have other repeated mounts/dismounts that aren't caused by the > > automounter. > > > > Some of our NFS servers have code and others have data. The ones that > > hang are the "code" servers, which are continuously mounted. > > > > Mounts are done against the "data" servers as needed. I.e., a job > > comes in, the relevant directory from the data server is mounted, the > > job runs, the directory is unmounted. I won't say we never have any > > problems with that, but it's way less frequent and only hangs the one > > job, whereas these "code" server hangs pretty much take down the whole > > client node. > > > > It might be important to restate that there is currently *no* > > correlation established between the "Initiate recovery" messages and > > our hanging mounts. They may very well be harmless. > I now realize that the message "Initiate recovery. If server has not > rebooted, check NFS clients for unique /etc/hostid's" is not complete, > because the case where two NFSv4 mounts for the same server > file system results in two mounts with the same /etc/hostid for the > same file system and causes the same issue. > > I suppose the message should be something like.. > "Initiate recovery. If server has not rebooted, check that two NFS > clients do not have the same /etc/hostid or that one NFS client > does not have two mounts that access the file system" Oh, and although it indicates a serious problem in general, for the case where the mount is read-only and nlockd, it doesn't matter much, since the mount point isn't using the state anyhow. The underlying issue does probably cause the "Wrong session.." error which is probably causing the hangs. (Hopefully the patch does resolve the "Wrong session.." hangs.) rick > > rick > > > > > It's only the "Wrong session" message that is demonstrably highly > > correlated with incidents of hanging mounts. > > > > > # tcpdump -s 0 -w out.pcap host <nfs-server> > > > > This is probably not feasible because of the number of servers > > involved and the relative rarity of hangs. For us to get a hang every > > week or two means the individual nodes may go months between hangs. > > > > > Do you have nullfs mounts sitting on top of the NFS mounts? > > > > Yes, the "code" mounts use nullfs mounts, one per job. > > > > > There is a known issue that occurs when nullfs mounts are on > > > top of NFSv4 mounts, > > > > Yes, that came up for us and killed my initial attempt to deploy NFSv4 > > in 2020. I thought it was fixed around FreeBSD 13? The OpenOwners > > issue is definitely still there, which requires us to use oneopenown > > which prohibits us from using delegations, but that isn't specific to > > nullfs. Other than that... and this... NFS 4.2 has been pretty good to > > us. > > > > Thanks!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM5tNy5SUtYSTne1EoQh12eiDyiJ7cdECD4%2BXbhvB5MnqjY%2BjA>
