Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 28 Feb 2024 00:29:43 -0500
From:      Garrett Wollman <wollman@bimajority.org>
To:        stable@freebsd.org
Cc:        rmacklem@freebsd.org
Subject:   13-stable NFS server hang
Message-ID:  <26078.50375.679881.64018@hergotha.csail.mit.edu>

next in thread | raw e-mail | index | archive | help
Hi, all,

We've had some complaints of NFS hanging at unpredictable intervals.
Our NFS servers are running a 13-stable from last December, and
tonight I sat in front of the monitor watching `nfsstat -dW`.  I was
able to clearly see that there were periods when NFS activity would
drop *instantly* from 30,000 ops/s to flat zero, which would last
for about 25 seconds before resuming exactly as it was before.

I wrote a little awk script to watch for this happening and run
`procstat -k` on the nfsd process, and I saw that all but two of the
service threads were idle.  The three nfsd threads that had non-idle
kstacks were:

  PID    TID COMM                TDNAME              KSTACK                       
  997 108481 nfsd                nfsd: master        mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_internal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_common 
  997 960918 nfsd                nfsd: service       mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline 
  997 962232 nfsd                nfsd: service       mi_switch _cv_wait txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_FILE_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline 

I'm suspicious of two things: first, the copy_file_range RPC; second,
the "master" nfsd thread is actually servicing an RPC which requires
obtaining a lock.  The "master" getting stuck while performing client
RPCs is, I believe, the reason NFS service grinds to a halt when a
client tries to write into a near-full filesystem, so this problem
would be more evidence that the dispatching function should not be
mixed with actual operations.  I don't know what the clients are
doing, but is it possible that nfsrvd_copy_file_range is holding a
lock that is needed by one or both of the other two threads?

Near-term I could change nfsrvd_copy_file_range to just
unconditionally return NFSERR_NOTSUP and force the clients to fall
back, but I figured I would ask if anyone else has seen this.

-GAWollman




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?26078.50375.679881.64018>