Date: Mon, 6 Jan 2025 17:37:49 +0100 From: "Peter 'PMc' Much" <pmc@citylink.dinoex.sub.org> To: Rick Macklem <rick.macklem@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: system stalled, no I/O but 100% CPU from nfs Message-ID: <Z3wG3fEYjeE9f4nF@disp.intra.daemon.contact> In-Reply-To: <CAM5tNy5AzL9%2BWpjRV9N1Wzy94RpA2L93NqnYFjFvx38iAo1iyg@mail.gmail.com> References: <Z3tdPjxTE6GZmzwW@disp.intra.daemon.contact> <CAM5tNy5AzL9%2BWpjRV9N1Wzy94RpA2L93NqnYFjFvx38iAo1iyg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jan 06, 2025 at 05:53:38AM -0800, Rick Macklem wrote: ! On Sun, Jan 5, 2025 at 8:45=E2=80=AFPM Peter 'PMc' Much ! <pmc@citylink.dinoex.sub.org> wrote: ! > This doesn't look good. It goes on for hours. What can be done about i= t? ! > (13.4 client & server) ! > ! > ! > 44 processes: 4 running, 39 sleeping, 1 waiting ! > CPU: 0.4% user, 0.0% nice, 99.6% system, 0.0% interrupt, 0.0% idle ! > Mem: 21M Active, 198M Inact, 1190M Wired, 278M Buf, 3356M Free ! > ARC: 418M Total, 39M MFU, 327M MRU, 128K Anon, 7462K Header, 43M Other ! > 332M Compressed, 804M Uncompressed, 2.42:1 Ratio ! > Swap: 15G Total, 15G Free ! > ! > PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMM= AND ! > 417 root 4 52 0 12M 2148K RUN 20:55 99.12% nfsc= bd ! Do you have delegations enabled on your server ! (vfs.nfsd.issue_delegations not 0)? Not knowingly: # sysctl vfs.nfsd.issue_delegations vfs.nfsd.issue_delegations: 0 ! (If you do not, I have no idea why the server would be doing ! callbacks, which is what nfscbd ! handles.) Me neither. ;) The good news at this point is, it is a single event. At first I thought the whole cluster got slow (it is always too slow ;) ), but it was only this node - the others have no cpu consumption on nfscbd. The bad thing is, I cannot remember why I did switch that thing on. ! Also, "nfsstat -m" on the client shows you/us what your mount ! options are. It had to be destroyed, as effects got worse. What I figured is: it didn't issue any syscalls, and it didn't act on kill -9. Which means: most likely it found an infinite loop inside the kernel, aka a never-returning syscall. ! The above suggests that there is still some activity on the client, but t= he ! info. is limited. Yes, it got ever slower. The NFS mount is for /usr/ports, and I did fix some ports there. At some point a "make clean" would start to take minutes to complete, and there I noticed something is wrong. Finally it didn't even echo on the console (I had only one cpu available, and then when something is stuck within the kernel, all depends on preemption). ! If the client is still in this state, you can collect more info via: ! # tcpdump -s 0 -w out.pcap host <nfs-server> ! run for a little while. I had to destroy it. I tried to run dtrace to pinpoint exactly where that thing does execute, but it didn't startup. At that point I didn't consider it feasible to try further investigation. These are temporary building guests, they get destroyed after completion anyway. So, as apparently it was a single event, I might suggest we just remember that nfscbd /can do this/ (under yet unclear circumstances) and otherwise hope for the best. And probably I should get rid of that daemon altogether. I think I read something about these delegations, and it looked suitable for the usecase, but I didn't realize that it would need to be activated on the server also. (The usecase is, a snapshot + clone is created from the ports repo, then switched to a desired tag/branch, and that filetree is then used by a single guest, exclusively.) Thanks for Your help! cheerio, PMc
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Z3wG3fEYjeE9f4nF>