Date: Wed, 28 Feb 2024 00:29:43 -0500 From: Garrett Wollman <wollman@bimajority.org> To: stable@freebsd.org Cc: rmacklem@freebsd.org Subject: 13-stable NFS server hang Message-ID: <26078.50375.679881.64018@hergotha.csail.mit.edu>
next in thread | raw e-mail | index | archive | help
Hi, all, We've had some complaints of NFS hanging at unpredictable intervals. Our NFS servers are running a 13-stable from last December, and tonight I sat in front of the monitor watching `nfsstat -dW`. I was able to clearly see that there were periods when NFS activity would drop *instantly* from 30,000 ops/s to flat zero, which would last for about 25 seconds before resuming exactly as it was before. I wrote a little awk script to watch for this happening and run `procstat -k` on the nfsd process, and I saw that all but two of the service threads were idle. The three nfsd threads that had non-idle kstacks were: PID TID COMM TDNAME KSTACK 997 108481 nfsd nfsd: master mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_internal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_common 997 960918 nfsd nfsd: service mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline 997 962232 nfsd nfsd: service mi_switch _cv_wait txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_FILE_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline I'm suspicious of two things: first, the copy_file_range RPC; second, the "master" nfsd thread is actually servicing an RPC which requires obtaining a lock. The "master" getting stuck while performing client RPCs is, I believe, the reason NFS service grinds to a halt when a client tries to write into a near-full filesystem, so this problem would be more evidence that the dispatching function should not be mixed with actual operations. I don't know what the clients are doing, but is it possible that nfsrvd_copy_file_range is holding a lock that is needed by one or both of the other two threads? Near-term I could change nfsrvd_copy_file_range to just unconditionally return NFSERR_NOTSUP and force the clients to fall back, but I figured I would ask if anyone else has seen this. -GAWollman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?26078.50375.679881.64018>