Date: Fri, 27 May 2022 20:59:57 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Andreas Kempe <kempe@lysator.liu.se>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: FreeBSD 12.3/13.1 NFS client hang Message-ID: <YQBPR0101MB9742280313FC17543132A61CDDD89@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <YpEwxdGCouUUFHiE@shipon.lysator.liu.se> References: <YpEwxdGCouUUFHiE@shipon.lysator.liu.se>
next in thread | previous in thread | raw e-mail | index | archive | help
Andreas Kempe <kempe@lysator.liu.se> wrote:=0A= > Hello everyone!=0A= >=0A= > I'm having issues with the NFS clients on FreeBSD 12.3 and 13.1=0A= > systems hanging when using a CentOS 7 server.=0A= First, make sure you are using hard mounts. "soft" or "intr" mounts won't= =0A= work and will mess up the session sooner or later. (A messed up session cou= ld=0A= result in no free slots on the session and that will wedge threads in=0A= nfsv4_sequencelookup() as you describe.=0A= (This is briefly described in the BUGS section of "man mount_nfs".)=0A= =0A= Do a:=0A= # nfsstat -m=0A= on the clients and look for "hard".=0A= =0A= Next, is there anything logged on the console for the 13.1 client(s)?=0A= (13.1 has some diagnostics for things like a server replying with the=0A= wrong session slot#.)=0A= =0A= Also, maybe I'm old fashioned, but I find "ps axHl" useful, since it shows= =0A= where all the processes are sleeping.=0A= And "procstat -kk" covers all of the locks.=0A= =0A= > Below are procstat kstack $PID invocations showing where the processes=0A= > have hung. In the nfsv4_sequencelookup it seems hung waiting for=0A= > nfsess_slots to have an available slot. In the second nfs_lock case,=0A= > it seems the processes are stuck waiting on vnode locks.=0A= > =0A= > These issues seem to appear seemingly at random, but also if=0A= > operations that open a lot of files or create a lot of file locks are=0A= > used. An example that can often provoke a hang is performing a=0A= > recursive grep through a large file hierarchy like the FreeBSD=0A= > codebase.=0A= >=0A= > The NFS code is large and complicated so any advice is appriciated!=0A= Yea. I'm the author and I don't know exactly what it all does;-)\=0A= =0A= > Cordially,=0A= > Andreas Kempe=0A= >=0A= > Hang provoked when calling "grep -R SOME_STRING ." in the FreeBSD code ba= se.=0A= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=0A= > =0A= > PID TID COMM TDNAME KSTACK=0A= > 35585 101045 python3.8 - mi_switch sleepq_tim= edwait _sleep nfsv4_sequencelookup nfsv4_setsequence nfscl_reqstart nfsrpc_= lock nfsrpc_advlock nfs_advlock VOP_ADVLOCK_APV kern_fcntl kern_fcntl_freeb= sd amd64_syscall fast_syscall_common=0A= As you noted, this is waiting for a session slot to become available. Norma= l, so long as other RPCs=0A= are "in progress" that will release slots when replies are received. (If yo= ur mount was not hard, sooner or=0A= later, the client will give up waiting for the reply and the session slot w= ill not be released. Once all slots=0A= are "not released" you are hung.=0A= =0A= > 35585 101045 python3.8 - mi_switch sleepq_tim= edwait _sleep nfsv4_sequencelookup nfsv4_setsequence nfscl_reqstart nfsrpc_= lock nfsrpc_advlock nfs_advlock VOP_ADVLOCK_APV kern_fcntl kern_fcntl_freeb= sd amd64_syscall fast_syscall_common=0A= Same as above.=0A= =0A= > 44046 101189 vim - mi_switch sleepq_tim= edwait _sleep nfsv4_sequencelookup nfsv4_setsequence nfscl_reqstart nfsrpc_= accessrpc nfs34_access_otw nfs_access VOP_ACCESS_APV vn_dir_check_exec nfs_= lookup VOP_LOOKUP_APV lookup namei kern_statat sys_fstatat amd64_syscall=0A= Again.=0A= > 44046 101764 vim - mi_switch sleepq_cat= ch_signals sleepq_timedwait_sig _cv_timedwait_sig_sbt seltdwait kern_select= sys_select amd64_syscall fast_syscall_common=0A= > 44046 101853 vim - mi_switch sleepq_cat= ch_signals sleepq_wait_sig _sleep umtxq_sleep __umtx_op_sem2_wait sys__umtx= _op amd64_syscall fast_syscall_common=0A= > 44046 102164 vim - mi_switch sleepq_cat= ch_signals sleepq_wait_sig _sleep umtxq_sleep __umtx_op_sem2_wait sys__umtx= _op amd64_syscall fast_syscall_common=0A= > 44046 102165 vim - mi_switch sleepq_cat= ch_signals sleepq_wait_sig _sleep umtxq_sleep __umtx_op_sem2_wait sys__umtx= _op amd64_syscall fast_syscall_common=0A= > 44046 102457 vim - mi_switch sleepq_cat= ch_signals sleepq_wait_sig _sleep umtxq_sleep __umtx_op_sem2_wait sys__umtx= _op amd64_syscall fast_syscall_common=0A= > 44046 102472 vim - mi_switch sleepq_cat= ch_signals sleepq_wait_sig _sleep umtxq_sleep __umtx_op_sem2_wait sys__umtx= _op amd64_syscall fast_syscall_common=0A= I know nothing about umtx, so can't help here.=0A= =0A= > 44172 101824 tmux - mi_switch sleepq_tim= edwait _sleep nfsv4_sequencelookup nfsv4_setsequence nfscl_reqstart nfsrpc_= accessrpc nfs34_access_otw nfs_access VOP_ACCESS_APV vn_dir_check_exec nfs_= lookup VOP_LOOKUP_APV lookup namei kern_chdir amd64_syscall fast_syscall_co= mmon=0A= Another one waiting for a session slot.=0A= =0A= > Hang provoked randomly when trying to save an image in kolourpaint.=0A= > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A= >=0A= > PID TID COMM TDNAME KSTACK=0A= > 58062 159450 kolourpaint - mi_switch sleeplk lo= ckmgr_slock_hard nfs_lock vop_sigdefer _vn_lock vfs_cache_root vfs_root_sig= defer lookup namei kern_statat sys_fstatat amd64_syscall fast_syscall_commo= n=0A= Yep, waiting for a vnode lock, I think.=0A= =0A= > 58062 176390 kolourpaint - mi_switch sleepq_cat= ch_signals sleepq_wait_sig _cv_wait_sig seltdwait kern_poll sys_poll amd64_= syscall fast_syscall_common=0A= > 58062 176678 kolourpaint - mi_switch sleepq_cat= ch_signals sleepq_wait_sig _sleep kqueue_kevent kern_kevent_fp kern_kevent_= generic sys_kevent amd64_syscall fast_syscall_common=0A= >=0A= > PID TID COMM TDNAME KSTACK=0A= > 34291 101005 fish - mi_switch sleepq_wai= t sleeplk lockmgr_slock_hard VOP_LOCK1_APV nfs_lock VOP_LOCK1_APV _vn_lock = vget vfs_hash_get ncl_nget nfs_root lookup namei vn_open_cred kern_openat a= md64_syscall fast_syscall_common=0A= Also waiting for a vnode.=0A= =0A= > 34291 102492 fish - mi_switch sleepq_wai= t sleeplk lockmgr_slock_hard VOP_LOCK1_APV nfs_lock VOP_LOCK1_APV _vn_lock = vget vfs_hash_get ncl_nget nfs_root lookup namei kern_accessat amd64_syscal= l fast_syscall_common=0A= Again.=0A= > 34291 102493 fish - mi_switch sleepq_wai= t sleeplk lockmgr_slock_hard VOP_LOCK1_APV nfs_lock VOP_LOCK1_APV _vn_lock = vget vfs_hash_get ncl_nget nfs_root lookup namei kern_accessat amd64_syscal= l fast_syscall_common=0A= And again.=0A= =0A= > PID TID COMM TDNAME KSTACK=0A= > 204 100923 autounmountd - mi_switch sleepq_wait= sleeplk lockmgr_xlock_hard VOP_LOCK1_APV nfs_lock VOP_LOCK1_APV _vn_lock v= get vfs_hash_get ncl_nget nfs_statfs __vfs_statfs kern_getfsstat sys_getfss= tat amd64_syscall fast_syscall_common=0A= And again.=0A= =0A= Not very useful unless you have all the processes and their locks to try an= d figure out what is holding=0A= the vnode locks.=0A= =0A= rick=0A= =0A= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB9742280313FC17543132A61CDDD89>