Date: Sun, 13 Dec 2020 17:08:48 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: J David <j.david.lists@gmail.com>, Konstantin Belousov <kostikbel@gmail.com> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: Major issues with nfsv4 Message-ID: <YQXPR0101MB0968B120A417AF69CEBB6A12DDC80@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CABXB=RRNnW9nNhFCJS1evNUTEX9LNnzyf2gOmZHHGkzAoQxbPw@mail.gmail.com> References: <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com> <YQXPR0101MB09680D155B6D685442B5E25EDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <CABXB=RSSE=yOwgOXsnbEYPqiWk5K5NfzLY=D%2BN9mXdVn%2B--qLQ@mail.gmail.com> <YQXPR0101MB0968B17010B3B36C8C41FDE1DDC90@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <X9Q9GAhNHbXGbKy7@kib.kiev.ua> <YQXPR0101MB0968C7629D57CA21319E50C2DDC90@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <X9UDArKjUqJVS035@kib.kiev.ua>, <CABXB=RRNnW9nNhFCJS1evNUTEX9LNnzyf2gOmZHHGkzAoQxbPw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
J David wrote:=0A= [stuff snipped]=0A= >The most common trace (91 times, or over 1/4 of all observed) is:=0A= >=0A= >__mtx_lock_sleep+0xf8 nfscl_nodeleg+0x207 nfs_lookup+0x314=0A= >VOP_LOOKUP_APV+0x75 null_lookup+0x98 VOP_LOOKUP_APV+0x75 >lookup+0x451=0A= >namei+0x414 kern_statat+0x72 freebsd11_stat+0x30 amd64_syscall+0x387=0A= >fast_syscall_common+0xf8=0A= This is just waiting on the mutex that protects the open/lock/delegation=0A= state.=0A= =0A= Now, is this just caused by heavy load and 130000 opens, or is there=0A= something in nullfs that results in more state handling than needed?=0A= [more stuff snipped]=0A= >It happens these were all Python-based jobs. Python seems predisposed=0A= >to trigger this, but non-Python jobs trigger it as well. Heavy use of=0A= >stat() does seem to be a common element regardless of job type.=0A= >=0A= >Here's the output of "nfsstat -E -c" 60 seconds after running it with -z:= =0A= >=0A= >Client Info:=0A= >RPC Counts:=0A= > Getattr Setattr Lookup Readlink Read W= rite=0A= > 1667 111 6376 0 42 = 153=0A= > Create Remove Rename Link Symlink M= kdir=0A= > 111 7 14 0 0 = 0=0A= > Rmdir Readdir RdirPlus Access Mknod Fs= stat=0A= > 0 0 0 2620 0 = 160=0A= > FSinfo pathConf Commit SetClId SetClIdCf = Lock=0A= > 0 0 113 0 0 = 48=0A= > LockT LockU Open OpenCfr=0A= > 0 48 320 0=0A= > OpenDownGr Close=0A= > 0 402=0A= --> So, during this time interval, there were 320 opens done and 402 closes= .=0A= > RelLckOwn FreeStateID PutRootFH DelegRet GetAcl Se= tAcl=0A= > 0 3 0 0 0 = 0=0A= > ExchangeId CreateSess DestroySess DestroyClId LayoutGet GetDev= Info=0A= > 0 0 0 0 0 = 0=0A= > LayoutCommit LayoutReturn ReclaimCompl ReadDataS WriteDataS >Commit= DataS=0A= > 0 0 0 0 0 = 0=0A= > OpenLayout CreateLayout=0A= > 0 0=0A= > OpenOwner Opens LockOwner Locks Delegs Loca= lOwn=0A= > 21175 130439 30 6 0 = 0=0A= So it has accumulated 130439 opens over 21175 different processes.=0A= (An openowner represents a process on the client.)=0A= Are there typically 20000+ processes/jobs running concurrently on this setu= p?=0A= =0A= That implies that any time a file is opened (and any time a vnode v_usecoun= t=0A= drops to 0) a linear traversal of a 130000 element linked list is done whil= e=0A= holding the mutex that the above common procstat entry is waiting on.=0A= (The nfscl_nodeleg() call to check for a delegation happens whenever lookup= =0A= has a name cache hit.)=0A= -->Also, every VOP_GETATTR() called from stat() will need to acquire the=0A= state mutex for a short period of time.=0A= =0A= -->So I am not surprised that the mount gets "constipated".=0A= To be honest, the NFSv4 state handling in the client was never designe= d=0A= to handle opens at this scale.=0A= --> Unlike the NFS server code, which uses hash tables of linked lists= , the client=0A= only uses a single linked list for all open structures.=0A= I have thought of changing the NFS client code to use hash tabl= es of lists=0A= and may work on doing so during the coming months, but who know= s=0A= when such a patch might be ready.=0A= So long as the lists are short, holding the mutex during list t= raversal=0A= seems to work ok, even on a heavily loaded NFS server, so doing= so=0A= should allow the client code to scale to larger numbers of open= =0A= structures.=0A= =0A= What you are seeing may just be contention for the client state mutex=0A= (called NFSLOCKCLSTATE() in the code), caused by repeated traversals=0A= of the 130000 element linked list.=0A= =0A= The other question is...=0A= Is having 130000 opens normal or is nullfs somehow delaying the=0A= VOP_INACTIVE() call on the underlying NFS vnode enough to cause=0A= these opens to accumulate?=0A= (Do to how these Windows oplock like opens are defined, they cannot=0A= be closed/discarded by the NFS client until VOP_INACTIVE() is called on=0A= the open file's vnode.)=0A= =0A= Counting how many processes are in __mtx_lock_sleep in the procstat=0A= output should give us a better idea if contention on the NFSv4 client=0A= state mutex is causing the problem.=0A= =0A= rick=0A= =0A= > LocalOpen LocalLown LocalLock=0A= > 0 0 0=0A= Rpc Info:=0A= TimedOut Invalid X Replies Retries Requests=0A= 0 0 0 0 12247=0A= Cache Info:=0A= Attr Hits Attr Misses Lkup Hits Lkup Misses=0A= 1110054 858 1002829 6361=0A= BioR Hits BioR Misses BioW Hits BioW Misses=0A= 2000 54 292 153=0A= BioRL Hits BioRL Misses BioD Hits BioD Misses=0A= 6911 0 208 0=0A= DirE Hits DirE Misses=0A= 104 0=0A= =0A= This does reflect the whole machine, not just those four processes.=0A= =0A= Finally, when I attempted to kill those four processes with ktrace=0A= running on all of them, the system panicked:=0A= =0A= Fatal trap 12: page fault while in kernel mode=0A= cpuid =3D 2; apic id =3D 02=0A= fault virtual address =3D 0x10=0A= fault code =3D supervisor read data, page not present=0A= instruction pointer =3D 0x20:0xffffffff8272d28f=0A= stack pointer =3D 0x28:0xfffffe008a5f24c0=0A= frame pointer =3D 0x28:0xfffffe008a5f25f0=0A= code segment =3D base 0x0, limit 0xfffff, type 0x1b=0A= =3D DPL 0, pres 1, long 1, def32 0, gran 1=0A= processor eflags =3D interrupt enabled, resume, IOPL =3D 0=0A= current process =3D 5627 (node)=0A= trap number =3D 12=0A= panic: page fault=0A= cpuid =3D 2=0A= time =3D 1607845622=0A= KDB: stack backtrace:=0A= db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008a5f2= 180=0A= vpanic() at vpanic+0x17b/frame 0xfffffe008a5f21d0=0A= panic() at panic+0x43/frame 0xfffffe008a5f2230=0A= trap_fatal() at trap_fatal+0x391/frame 0xfffffe008a5f2290=0A= trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008a5f22e0=0A= trap() at trap+0x286/frame 0xfffffe008a5f23f0=0A= calltrap() at calltrap+0x8/frame 0xfffffe008a5f23f0=0A= --- trap 0xc, rip =3D 0xffffffff8272d28f, rsp =3D 0xfffffe008a5f24c0, rbp= =0A= =3D 0xfffffe008a5f25f0 ---=0A= null_bypass() at null_bypass+0xaf/frame 0xfffffe008a5f25f0=0A= VOP_ADVLOCK_APV() at VOP_ADVLOCK_APV+0x80/frame 0xfffffe008a5f2620=0A= closef() at closef+0x8f/frame 0xfffffe008a5f26b0=0A= fdescfree_fds() at fdescfree_fds+0x3c/frame 0xfffffe008a5f2700=0A= fdescfree() at fdescfree+0x466/frame 0xfffffe008a5f27c0=0A= exit1() at exit1+0x488/frame 0xfffffe008a5f2820=0A= sigexit() at sigexit+0x159/frame 0xfffffe008a5f2b00=0A= postsig() at postsig+0x2fd/frame 0xfffffe008a5f2bc0=0A= ast() at ast+0x317/frame 0xfffffe008a5f2bf0=0A= doreti_ast() at doreti_ast+0x1f/frame 0x7fffffff67a0=0A= =0A= The fsck ate the ktrace.out file on reboot. This system does not have=0A= kernel debug symbols. If there's a way to try to figure out what=0A= null_bypass+0xaf corresponds to using the machine that built this=0A= kernel, which has the kernel, objects, debug symbols, and source,=0A= please let me know. It's built from releng/12.2 r368515.=0A= =0A= Thanks!=0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR0101MB0968B120A417AF69CEBB6A12DDC80>