Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 5 Mar 2024 06:43:05 -0800
From:      Rick Macklem <rick.macklem@gmail.com>
To:        Ronald Klop <ronald-lists@klop.ws>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@bimajority.org>, stable@freebsd.org
Subject:   Re: 13-stable NFS server hang
Message-ID:  <CAM5tNy45ovMMrGHx0_tKUPC3in5WfjrtbBpv23k99CuxFxY21w@mail.gmail.com>
In-Reply-To: <CAM5tNy7W6tZxiTRWoyx=CafAA9SE_fgrW3mjGRY1%2BJ=89QYZ%2Bg@mail.gmail.com>
References:  <CAM5tNy6v3N-uiULGA0vb_2s0GK1atRh6TYNDGfYMK0PkP46BbQ@mail.gmail.com> <1020651467.1592.1709280020993@localhost> <CAM5tNy4ras1NN%2BLC7=gpyFqEefHpWCrSV-_aSyn-D6Kt8Fvw6Q@mail.gmail.com> <1608503215.4731.1709633602802@localhost> <CAM5tNy7W6tZxiTRWoyx=CafAA9SE_fgrW3mjGRY1%2BJ=89QYZ%2Bg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Mar 5, 2024 at 6:34=E2=80=AFAM Rick Macklem <rick.macklem@gmail.com=
> wrote:
>
> On Tue, Mar 5, 2024 at 2:13=E2=80=AFAM Ronald Klop <ronald-lists@klop.ws>=
 wrote:
> >
> >
> > Van: Rick Macklem <rick.macklem@gmail.com>
> > Datum: vrijdag, 1 maart 2024 15:23
> > Aan: Ronald Klop <ronald-lists@klop.ws>
> > CC: Garrett Wollman <wollman@bimajority.org>, stable@freebsd.org, rmack=
lem@freebsd.org
> > Onderwerp: Re: 13-stable NFS server hang
> >
> > On Fri, Mar 1, 2024 at 12:00AM Ronald Klop <ronald-lists@klop.ws> wrote=
:
> > >
> > > Interesting read.
> > >
> > >  Would it be possible to separate locking for admin actions like a cl=
ient mounting an fs from traffic flowing for file operations?
> > Well, the NFS server does not really have any concept of a mount.
> > What I am referring to is the ClientID maintained for NFSv4 mounts,
> > which all the open/lock/session/layout state hangs off of.
> >
> > For most cases, this state information can safely be accessed/modified
> > via a mutex, but there are three exceptions:
> > - creating a new ClientID (which is done by the ExchangeID operation)
> >   and typically happens when a NFS client does a mount.
> > - delegation Recall (which only happens when delegations are enabled)
> >   One of the reasons delegations are not enabled by default on the
> > FreeBSD server.
> > - the DestroyClientID which is typically done by a NFS client during di=
smount.
> > For these cases, it is just too difficult to do them without sleeping.
> > As such, there is a sleep lock which the nfsd threads normally acquire =
shared
> > when doing NFSv4 operations, but for the above cases the lock is aquire=
d
> > exclusive.
> > - I had to give the exclusive lock priority over shared lock
> > acquisition (it is a
> >   custom locking mechanism with assorted weirdnesses) because without
> >   that someone reported that new mounts took up to 1/2hr to occur.
> >   (The exclusive locker waited for 30min before all the other nfsd thre=
ads
> >    were not busy.)
> >   Because of this priority, once a nfsd thread requests the exclusive l=
ock,
> >   all other nfsd threads executing NFSv4 RPCs block after releasing the=
ir
> >   shared lock, until the exclusive locker releases the exclusive lock.
> >
> > In summary, NFSv4 has certain advantages over NFSv3, but it comes
> > with a lot of state complexity. It just is not feasible to manipulate a=
ll that
> > state with only mutex locking.
> >
> > rick
> >
> > >
> > > Like ongoing file operations could have a read only view/copy of the =
mount table. Only new operations will have to wait.
> > > But the mount never needs to wait for ongoing operations before locki=
ng the structure.
> > >
> > > Just a thought in the morning
> > >
> > > Regards,
> > > Ronald.
> > >
> > > Van: Rick Macklem <rick.macklem@gmail.com>
> > > Datum: 1 maart 2024 00:31
> > > Aan: Garrett Wollman <wollman@bimajority.org>
> > > CC: stable@freebsd.org, rmacklem@freebsd.org
> > > Onderwerp: Re: 13-stable NFS server hang
> > >
> > > On Wed, Feb 28, 2024 at 4:04PM Rick Macklem wrote:
> > > >
> > > > On Tue, Feb 27, 2024 at 9:30PM Garrett Wollman wrote:
> > > > >
> > > > > Hi, all,
> > > > >
> > > > > We've had some complaints of NFS hanging at unpredictable interva=
ls.
> > > > > Our NFS servers are running a 13-stable from last December, and
> > > > > tonight I sat in front of the monitor watching `nfsstat -dW`.  I =
was
> > > > > able to clearly see that there were periods when NFS activity wou=
ld
> > > > > drop *instantly* from 30,000 ops/s to flat zero, which would last
> > > > > for about 25 seconds before resuming exactly as it was before.
> > > > >
> > > > > I wrote a little awk script to watch for this happening and run
> > > > > `procstat -k` on the nfsd process, and I saw that all but two of =
the
> > > > > service threads were idle.  The three nfsd threads that had non-i=
dle
> > > > > kstacks were:
> > > > >
> > > > >   PID    TID COMM                TDNAME              KSTACK
> > > > >   997 108481 nfsd                nfsd: master        mi_switch sl=
eepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_intern=
al svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_co=
mmon
> > > > >   997 960918 nfsd                nfsd: service       mi_switch sl=
eepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd_d=
orpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampo=
line
> > > > >   997 962232 nfsd                nfsd: service       mi_switch _c=
v_wait txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs_f=
reebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_FIL=
E_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_progr=
am svc_run_internal svc_thread_start fork_exit fork_trampoline
> > > > >
> > > > > I'm suspicious of two things: first, the copy_file_range RPC; sec=
ond,
> > > > > the "master" nfsd thread is actually servicing an RPC which requi=
res
> > > > > obtaining a lock.  The "master" getting stuck while performing cl=
ient
> > > > > RPCs is, I believe, the reason NFS service grinds to a halt when =
a
> > > > > client tries to write into a near-full filesystem, so this proble=
m
> > > > > would be more evidence that the dispatching function should not b=
e
> > > > > mixed with actual operations.  I don't know what the clients are
> > > > > doing, but is it possible that nfsrvd_copy_file_range is holding =
a
> > > > > lock that is needed by one or both of the other two threads?
> > > > >
> > > > > Near-term I could change nfsrvd_copy_file_range to just
> > > > > unconditionally return NFSERR_NOTSUP and force the clients to fal=
l
> > > > > back, but I figured I would ask if anyone else has seen this.
> > > > I have attached a little patch that should limit the server's Copy =
size
> > > > to vfs.nfsd.maxcopyrange (default of 10Mbytes).
> > > > Hopefully this makes sure that the Copy does not take too long.
> > > >
> > > > You could try this instead of disabling Copy. It would be nice to k=
now if
> > > > this is suffciient? (If not, I'll probably add a sysctl to disable =
Copy.)
> > > I did a quick test without/with this patch,where I copied a 1Gbyte fi=
le.
> > >
> > > Without this patch, the Copy RPCs mostly replied in just under 1sec
> > > (which is what the flag requests), but took over 4sec for one of the =
Copy
> > > operations. This implies that one Read/Write of 1Mbyte on the server
> > > took over 3 seconds.
> > > I noticed the first Copy did over 600Mbytes, but the rest did about 1=
00Mbytes
> > > each and it was one of these 100Mbyte Copy operations that took over =
4sec.
> > >
> > > With the patch, there were a lot more Copy RPCs (as expected) of 10Mb=
ytes
> > > each and they took a consistent 0.25-0.3sec to reply. (This is a test=
 of a local
> > > mount on an old laptop, so nowhere near a server hardware config.)
> > >
> > > So, the patch might be sufficient?
> > >
> > > It would be nice to avoid disabling Copy, since it avoids reading the=
 data
> > > into the client and then writing it back to the server.
> > >
> > > I will probably commit both patches (10Mbyte clip of Copy size and
> > > disabling Copy) to main soon, since I cannot say if clipping the size
> > > of the Copy will always be sufficient.
> > >
> > > Pleas let us know how trying these patches goes, rick
> > >
> > > >
> > > > rick
> > > >
> > > > >
> > > > > -GAWollman
> > > > >
> > > > >
> > >
> > > ________________________________
> > >
> > >
> > >
> > >
> > ________________________________
> >
> >
> >
> > Hi Rick,
> >
> > You are much more into the NFS code than I am so excuse me if what I'm =
speaking about does not make sense.
> >
> > I was reading nfsrvd_compound() which calls nfsrvd_copy_file_range() vi=
a the nfsrv4_ops2 structure.
> > Nfsrvd_compound holds a lock or refcount on nfsv4rootfs_lock during the=
 whole operation. Which is why nfsrv_setclient() is waiting in this specifi=
c case of "NFS server hang".
> >
> > But I don't see what is being modified on the nfsdstate after the IO op=
eration ends. Or why the IO operation itself needs the lock to the nfsdstat=
e. IMHO the in-progress IOs will happen anyway regardless of the nfsdstate.=
 Changes to the nfsdstate during an IO operation would not affect the ongoi=
ng IO operation.
> > Wouldn't it be possible to lock the nfsv4rootfs_lock, do checks on or m=
odify the nfsdstate as needed, unlock and then do the IO operation? That wo=
uld remove a lot of the possible lock contention during (u)mount.
> > Otherwise, if we do modify the nfsdstate after the IO operation, isn't =
it possible to relock nfsv4rootfs_lock after the IO operation finishes?
> Well, there are a couple of reasons. Every implementation has design trad=
eoffs:
> 1 - A NFSv4 RPC is a compound, which can be a pretty arbitrary list of
> operations.
>      As such, the NFSv4 server does not know if an open/byte range
> lock is coming
>      after the operation it is currently performing, since the
> implementation does not
>      pre-parse the entire compound. (I had a discussion w.r.t.
> pre-parsing with one of
>      the main Linux knfsd server maintainers and he noted that he was
> not aware of
>      any extant server that did pre-parse the compound. Although it
> would be useful
>      for the server to have the ordered list of operations before
> commencing the RPC,
>      we both agreed it was too hard to implement.
>      --> It could possibly unlock/relock later, but see #2. Also, if
> relocking took a long time,
>           it would result in the compound RPC taking too long (see below)=
.
> 2 - For NFSv4.1/4.2 almost all RPCs are handled by a session. One non-arb=
itrary
>      part of almost all NFSv4.1/4.2 RPCs is that the Sequence
> operation (the one that
>      handles the session) must come first.
>      Session(s) are associated with the ClientID, which means the
> ClientID and the
>      session must not go away while the compound RPC is in progress.
>      - This is ensured by the shared lock on the ClientID (that
> nfsv4rootfh_lock).
> Since 99.99% of operations can be done with the shared lock, I do not thi=
nk
> there is a lot of contention.
>
> Although there is nothing wired down in the RFCs, there is an understandi=
ng
> in the NFSv4 community that a server should reply to an RPC in a reasonab=
le
> time. Typically assumed to be 1-2sec. If the server does this, then a del=
ay for
> the rare case of a new ClientID shouldn't be a big problem?
> (The is also delegation recall, which is one reason why delegations
> are not enabled
> by default.)
>
> Btw, the RFC does define an asynchronous Copy, where the operation replie=
s
> as soon as the copy is started and the server notifies the client of comp=
letion
> later. I have not implemented this, because it introduces complexities th=
at
> I do not want to deal with.
> For example, what happens when the server crashes/reboots while the copy
> is in progress? The file is left in a non-deterministic state, depending =
on what
> the client does when it does not receive the completion notify.
>
Oh, I should also note that the "shared lock" is actually called a
reference count
in the code and is there to ensure that the ClientID/Session does not go aw=
ay
during execution of the compound.

The problem in this case (which I should revisit) was that I could not figu=
re
out how to safely add a new ClientID while other nfsd threads were in progr=
ess
performing other RPCs. Due to retries etc, there might be another RPC
in progress
using the ClientID.

One thing to note here is that the identity of the ClientID
is not known until the Sequence operation has been performed. (And there is
cruft for NFSv4.0, since it does not have a Sequence operation.)
As such, the RPC must be in progress before it is known.

> rick
> >
> > I hope this makes any sense and thanks for all your work on the NFS cod=
e.
> >
> > Regards,
> > Ronald.
> >
> >
> >



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM5tNy45ovMMrGHx0_tKUPC3in5WfjrtbBpv23k99CuxFxY21w>