Date: Sat, 28 May 2022 00:12:59 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Andreas Kempe <kempe@lysator.liu.se> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: FreeBSD 12.3/13.1 NFS client hang Message-ID: <YQBPR0101MB97422DE573F6221689DC119DDDDB9@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <YpFM2bSMscG4ekc9@shipon.lysator.liu.se> References: <YpEwxdGCouUUFHiE@shipon.lysator.liu.se> <YQBPR0101MB9742280313FC17543132A61CDDD89@YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM> <YpFM2bSMscG4ekc9@shipon.lysator.liu.se>
next in thread | previous in thread | raw e-mail | index | archive | help
Andreas Kempe <kempe@lysator.liu.se> wrote:=0A= [stuff snipped]=0A= >=0A= > The one thing we have seen logged are messages along the lines of:=0A= > kernel: newnfs: server 'mail' error: fileid changed. fsid 4240eca6003a052= a:0: > expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)=0A= I think this can also happen if a Getattr operation fails with an error at= =0A= the server. It then has "default attributes" and a default of 0x2 (root ino= de#)=0A= can be expected. So, I suspect this is what is happening. Generally, failed= =0A= Getattrs will be problematic, but I'm not sure if they can cause hangs?=0A= =0A= If you can capture packets when these get logged, we can confirm if a=0A= Getattr operation has failed with an error.=0A= =0A= rick=0A= =0A= > Also, maybe I'm old fashioned, but I find "ps axHl" useful, since it show= s=0A= > where all the processes are sleeping.=0A= > And "procstat -kk" covers all of the locks.=0A= >=0A= =0A= I don't know if it is a matter of being old fashioned as much as one=0A= of taste. :) In future dumps, I can provide both ps axHl and procstat -kk.= =0A= =0A= > > Below are procstat kstack $PID invocations showing where the processes= =0A= > > have hung. In the nfsv4_sequencelookup it seems hung waiting for=0A= > > nfsess_slots to have an available slot. In the second nfs_lock case,=0A= > > it seems the processes are stuck waiting on vnode locks.=0A= > >=0A= > > These issues seem to appear seemingly at random, but also if=0A= > > operations that open a lot of files or create a lot of file locks are= =0A= > > used. An example that can often provoke a hang is performing a=0A= > > recursive grep through a large file hierarchy like the FreeBSD=0A= > > codebase.=0A= > >=0A= > > The NFS code is large and complicated so any advice is appriciated!=0A= > Yea. I'm the author and I don't know exactly what it all does;-)\=0A= >=0A= > > Cordially,=0A= > > Andreas Kempe=0A= > >=0A= >=0A= > [...]=0A= >=0A= > Not very useful unless you have all the processes and their locks to try = and figure out what is holding=0A= > the vnode locks.=0A= >=0A= =0A= Yes, I sent this mostly in the hope that it might be something that=0A= someone has seen before. I understand that more verbose information is=0A= needed to track down the lock contention.=0A= =0A= I'll switch our machines back to using hard mounts and try to get as=0A= much diagnostic information as possible when the next lockup happens.=0A= =0A= Do you have any good suggestions for tracking down the issue? I've=0A= been contemplating enabling WITNESS or building with debug information=0A= to be able to hook in the kernel debugger.=0A= =0A= Thank you very much for your reply!=0A= Cordially,=0A= Andreas Kempe=0A= =0A= > rick=0A= >=0A= >=0A= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB97422DE573F6221689DC119DDDDB9>