Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 13 Dec 2020 17:08:48 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        J David <j.david.lists@gmail.com>, Konstantin Belousov <kostikbel@gmail.com>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: Major issues with nfsv4
Message-ID:  <YQXPR0101MB0968B120A417AF69CEBB6A12DDC80@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <CABXB=RRNnW9nNhFCJS1evNUTEX9LNnzyf2gOmZHHGkzAoQxbPw@mail.gmail.com>
References:  <CABXB=RRB2nUk0pPDisBQPdicUA3ooHpg8QvBwjG_nFU4cHvCYw@mail.gmail.com> <YQXPR0101MB096849ADF24051F7479E565CDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <CABXB=RSyN%2Bo2yXcpmYw8sCSUUDhN-w28Vu9v_cCWa-2=pLZmHg@mail.gmail.com> <YQXPR0101MB09680D155B6D685442B5E25EDDCA0@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <CABXB=RSSE=yOwgOXsnbEYPqiWk5K5NfzLY=D%2BN9mXdVn%2B--qLQ@mail.gmail.com> <YQXPR0101MB0968B17010B3B36C8C41FDE1DDC90@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <X9Q9GAhNHbXGbKy7@kib.kiev.ua> <YQXPR0101MB0968C7629D57CA21319E50C2DDC90@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> <X9UDArKjUqJVS035@kib.kiev.ua>, <CABXB=RRNnW9nNhFCJS1evNUTEX9LNnzyf2gOmZHHGkzAoQxbPw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
J David wrote:=0A=
[stuff snipped]=0A=
>The most common trace (91 times, or over 1/4 of all observed) is:=0A=
>=0A=
>__mtx_lock_sleep+0xf8 nfscl_nodeleg+0x207 nfs_lookup+0x314=0A=
>VOP_LOOKUP_APV+0x75 null_lookup+0x98 VOP_LOOKUP_APV+0x75 >lookup+0x451=0A=
>namei+0x414 kern_statat+0x72 freebsd11_stat+0x30 amd64_syscall+0x387=0A=
>fast_syscall_common+0xf8=0A=
This is just waiting on the mutex that protects the open/lock/delegation=0A=
state.=0A=
=0A=
Now, is this just caused by heavy load and 130000 opens, or is there=0A=
something in nullfs that results in more state handling than needed?=0A=
[more stuff snipped]=0A=
>It happens these were all Python-based jobs.  Python seems predisposed=0A=
>to trigger this, but non-Python jobs trigger it as well.  Heavy use of=0A=
>stat() does seem to be a common element regardless of job type.=0A=
>=0A=
>Here's the output of "nfsstat -E -c" 60 seconds after running it with -z:=
=0A=
>=0A=
>Client Info:=0A=
>RPC Counts:=0A=
>      Getattr      Setattr       Lookup     Readlink         Read        W=
rite=0A=
>         1667          111         6376            0           42         =
 153=0A=
>       Create       Remove       Rename         Link      Symlink        M=
kdir=0A=
>          111            7           14            0            0         =
   0=0A=
>        Rmdir      Readdir     RdirPlus       Access        Mknod       Fs=
stat=0A=
>            0            0            0         2620            0         =
 160=0A=
>       FSinfo     pathConf       Commit      SetClId    SetClIdCf         =
Lock=0A=
>            0            0          113            0            0         =
  48=0A=
>        LockT        LockU         Open      OpenCfr=0A=
>            0           48          320            0=0A=
>   OpenDownGr        Close=0A=
>            0          402=0A=
--> So, during this time interval, there were 320 opens done and 402 closes=
.=0A=
>    RelLckOwn  FreeStateID    PutRootFH     DelegRet       GetAcl       Se=
tAcl=0A=
>            0            3            0            0            0         =
   0=0A=
>   ExchangeId   CreateSess  DestroySess  DestroyClId    LayoutGet   GetDev=
Info=0A=
>            0            0            0            0            0         =
   0=0A=
> LayoutCommit LayoutReturn ReclaimCompl    ReadDataS   WriteDataS  >Commit=
DataS=0A=
>            0            0            0            0            0         =
   0=0A=
>   OpenLayout CreateLayout=0A=
>            0            0=0A=
>    OpenOwner        Opens    LockOwner        Locks       Delegs     Loca=
lOwn=0A=
>        21175       130439           30            6            0         =
   0=0A=
So it has accumulated 130439 opens over 21175 different processes.=0A=
(An openowner represents a process on the client.)=0A=
Are there typically 20000+ processes/jobs running concurrently on this setu=
p?=0A=
=0A=
That implies that any time a file is opened (and any time a vnode v_usecoun=
t=0A=
drops to 0) a linear traversal of a 130000 element linked list is done whil=
e=0A=
holding the mutex that the above common procstat entry is waiting on.=0A=
(The nfscl_nodeleg() call to check for a delegation happens whenever lookup=
=0A=
 has a name cache hit.)=0A=
-->Also, every VOP_GETATTR() called from stat() will need to acquire the=0A=
     state mutex for a short period of time.=0A=
=0A=
-->So I am not surprised that the mount gets "constipated".=0A=
     To be honest, the NFSv4 state handling in the client was never designe=
d=0A=
     to handle opens at this scale.=0A=
     --> Unlike the NFS server code, which uses hash tables of linked lists=
, the client=0A=
            only uses a single linked list for all open structures.=0A=
            I have thought of changing the NFS client code to use hash tabl=
es of lists=0A=
            and may work on doing so during the coming months, but who know=
s=0A=
            when such a patch might be ready.=0A=
            So long as the lists are short, holding the mutex during list t=
raversal=0A=
            seems to work ok, even on a heavily loaded NFS server, so doing=
 so=0A=
            should allow the client code to scale to larger numbers of open=
=0A=
            structures.=0A=
=0A=
What you are seeing may just be contention for the client state mutex=0A=
(called NFSLOCKCLSTATE() in the code), caused by repeated traversals=0A=
of the 130000 element linked list.=0A=
=0A=
The other question is...=0A=
Is having 130000 opens normal or is nullfs somehow delaying the=0A=
VOP_INACTIVE() call on the underlying NFS vnode enough to cause=0A=
these opens to accumulate?=0A=
(Do to how these Windows oplock like opens are defined, they cannot=0A=
 be closed/discarded by the NFS client until VOP_INACTIVE() is called on=0A=
 the open file's vnode.)=0A=
=0A=
Counting how many processes are in __mtx_lock_sleep in the procstat=0A=
output should give us a better idea if contention on the NFSv4 client=0A=
state mutex is causing the problem.=0A=
=0A=
rick=0A=
=0A=
>    LocalOpen    LocalLown    LocalLock=0A=
>            0            0            0=0A=
Rpc Info:=0A=
     TimedOut      Invalid    X Replies      Retries     Requests=0A=
            0            0            0            0        12247=0A=
Cache Info:=0A=
    Attr Hits  Attr Misses    Lkup Hits  Lkup Misses=0A=
      1110054          858      1002829         6361=0A=
    BioR Hits  BioR Misses    BioW Hits  BioW Misses=0A=
         2000           54          292          153=0A=
   BioRL Hits BioRL Misses    BioD Hits  BioD Misses=0A=
         6911            0          208            0=0A=
    DirE Hits  DirE Misses=0A=
          104            0=0A=
=0A=
This does reflect the whole machine, not just those four processes.=0A=
=0A=
Finally, when I attempted to kill those four processes with ktrace=0A=
running on all of them, the system panicked:=0A=
=0A=
Fatal trap 12: page fault while in kernel mode=0A=
cpuid =3D 2; apic id =3D 02=0A=
fault virtual address =3D 0x10=0A=
fault code =3D supervisor read data, page not present=0A=
instruction pointer =3D 0x20:0xffffffff8272d28f=0A=
stack pointer         =3D 0x28:0xfffffe008a5f24c0=0A=
frame pointer         =3D 0x28:0xfffffe008a5f25f0=0A=
code segment =3D base 0x0, limit 0xfffff, type 0x1b=0A=
=3D DPL 0, pres 1, long 1, def32 0, gran 1=0A=
processor eflags =3D interrupt enabled, resume, IOPL =3D 0=0A=
current process =3D 5627 (node)=0A=
trap number =3D 12=0A=
panic: page fault=0A=
cpuid =3D 2=0A=
time =3D 1607845622=0A=
KDB: stack backtrace:=0A=
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008a5f2=
180=0A=
vpanic() at vpanic+0x17b/frame 0xfffffe008a5f21d0=0A=
panic() at panic+0x43/frame 0xfffffe008a5f2230=0A=
trap_fatal() at trap_fatal+0x391/frame 0xfffffe008a5f2290=0A=
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008a5f22e0=0A=
trap() at trap+0x286/frame 0xfffffe008a5f23f0=0A=
calltrap() at calltrap+0x8/frame 0xfffffe008a5f23f0=0A=
--- trap 0xc, rip =3D 0xffffffff8272d28f, rsp =3D 0xfffffe008a5f24c0, rbp=
=0A=
=3D 0xfffffe008a5f25f0 ---=0A=
null_bypass() at null_bypass+0xaf/frame 0xfffffe008a5f25f0=0A=
VOP_ADVLOCK_APV() at VOP_ADVLOCK_APV+0x80/frame 0xfffffe008a5f2620=0A=
closef() at closef+0x8f/frame 0xfffffe008a5f26b0=0A=
fdescfree_fds() at fdescfree_fds+0x3c/frame 0xfffffe008a5f2700=0A=
fdescfree() at fdescfree+0x466/frame 0xfffffe008a5f27c0=0A=
exit1() at exit1+0x488/frame 0xfffffe008a5f2820=0A=
sigexit() at sigexit+0x159/frame 0xfffffe008a5f2b00=0A=
postsig() at postsig+0x2fd/frame 0xfffffe008a5f2bc0=0A=
ast() at ast+0x317/frame 0xfffffe008a5f2bf0=0A=
doreti_ast() at doreti_ast+0x1f/frame 0x7fffffff67a0=0A=
=0A=
The fsck ate the ktrace.out file on reboot.  This system does not have=0A=
kernel debug symbols.  If there's a way to try to figure out what=0A=
null_bypass+0xaf corresponds to using the machine that built this=0A=
kernel, which has the kernel, objects, debug symbols, and source,=0A=
please let me know.  It's built from releng/12.2 r368515.=0A=
=0A=
Thanks!=0A=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR0101MB0968B120A417AF69CEBB6A12DDC80>