Date: Sun, 30 Nov 2025 13:15:52 -0800 From: Rick Macklem <rick.macklem@gmail.com> To: J David <j.david.lists@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: NFSv4.2 hangs on 14.3 Message-ID: <CAM5tNy7MzZYj8A8Em2P=7NPdMiMg7Rs0qBtsPnw0JwvwrW_=Tg@mail.gmail.com> In-Reply-To: <CABXB=RRH2QkkDiurNWZH8ZeJtCQHBz8XsKg9QjJ7Eg%2BoGSZguA@mail.gmail.com> References: <CABXB=RQL0tqnE34G6PGLn6AmcwSpapm0-forQZ5vLBQBwcA12Q@mail.gmail.com> <CAM5tNy7eHH7qmTXLRQ9enDAwUzjUXtjugi093eUoRkDbGDCYVQ@mail.gmail.com> <CABXB=RQ6qSNp==Qa_m-=S8cKzxJU2pbuEDjeGfdr7L8Z0=dmGA@mail.gmail.com> <CABXB=RRHz20XwLDCz7qss1=0hXZK-SXz8X7pm4w8o8r2byxH2A@mail.gmail.com> <CAM5tNy6kQMtxe1Sdt_3yQv00ud-xMUsW1m52V2Gn6zy4tnka6Q@mail.gmail.com> <CABXB=RRDABxmgZMadGManyEO3ecy2x-myBZ8bbyjx7UePn%2BcLw@mail.gmail.com> <CAM5tNy65A7QzAS7Ww-dk9Eqx0_xvJAQDPnqEA4D8fWAyB%2BMU2Q@mail.gmail.com> <CABXB=RRH2QkkDiurNWZH8ZeJtCQHBz8XsKg9QjJ7Eg%2BoGSZguA@mail.gmail.com>
index | next in thread | previous in thread | raw e-mail
On Sun, Nov 30, 2025 at 10:31 AM J David <j.david.lists@gmail.com> wrote:
>
> Fudge, I accidentally did reply earlier instead of reply all, so the
> bits about ZFS on the server didn't go to the list. (But I don't think
> they're super relevant.)
>
>
> On Sat, Nov 29, 2025 at 4:10 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> > Finally, I'll note that "Initiating.." should only occur when a server
> > reboots. (It might happen after a long enough network partition or
> > a realllyyy slllooowwww server response, but that would have to
> > be over a minute to have any chance of causing this.
>
> To follow up on this part, I did the promised central logging and got
> this on one client overnight:
>
> 2025-11-30 07:50:04.958 Initiate recovery. If server has not rebooted,
> check NFS clients for unique /etc/hostid's
> 2025-11-30 07:50:04.958 nfsv4 expired locks lost
> 2025-11-30 07:52:45.103 nfscl: never fnd open
> 2025-11-30 07:52:45.103 last message repeated 1 times
>
> None of those things (reboot, partition, or really slow responses)
> happened at that time, and this was logged on only one out of dozens
> of client nodes, none of the rest of which reported issues.
The only thing I can think of that would help diagnose this is a packet
capture when the above happens.
I realize that might mean capturing packets for an extended period
of time (hopefully restarting the capture repeatedly, so that the pcap
isn't absolutely huge) until you luck out and capture when one of the
above occurrences happens.
# tcpdump -s 0 -w out.pcap host <nfs-server>
should do it
rick
ps: I've also seen a case where the network seemed to work just
fine, except for one specific NFS packet, which was consistently
dropped by a network switch. Replacing the switch fixed the
problem. (The switch was sent back to the vendor, who returned
it claiming it was functioning correctly. I then threw the switch in
the recycle pile.)
>
> There were no other messages in the dmesg from the preceding ~10 hours
> since I started logging to a different machine. The last thing
> syslogged before that was atrun at 07:50:00, but there are no at jobs
> on that system. Network monitoring reported no issues (or, indeed,
> kernel log messages) at or around that time.
>
> As far as I can tell, this had no ill effects. That node continues to
> operate normally. So the only problem it points to is not knowing why
> it happened if it wasn't one of the things on your list.
>
> These servers do mount other filesystems from other servers, so I
> can't even promise you that this message pertains to the servers we're
> seeing hangs against. Would it be possible/simple to modify the
> "Initiate recovery" message to identify the server or mount point that
> triggered it?
>
> Thanks!
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM5tNy7MzZYj8A8Em2P=7NPdMiMg7Rs0qBtsPnw0JwvwrW_=Tg>
