Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 6 Mar 2024 12:46:16 +0100 (CET)
From:      Ronald Klop <ronald-lists@klop.ws>
To:        Rick Macklem <rick.macklem@gmail.com>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@bimajority.org>, stable@freebsd.org
Subject:   Re: 13-stable NFS server hang
Message-ID:  <1277770972.4770.1709725576695@localhost>
In-Reply-To: <CAM5tNy45ovMMrGHx0_tKUPC3in5WfjrtbBpv23k99CuxFxY21w@mail.gmail.com>
References:  <CAM5tNy6v3N-uiULGA0vb_2s0GK1atRh6TYNDGfYMK0PkP46BbQ@mail.gmail.com> <1020651467.1592.1709280020993@localhost> <CAM5tNy4ras1NN%2BLC7=gpyFqEefHpWCrSV-_aSyn-D6Kt8Fvw6Q@mail.gmail.com> <1608503215.4731.1709633602802@localhost> <CAM5tNy7W6tZxiTRWoyx=CafAA9SE_fgrW3mjGRY1%2BJ=89QYZ%2Bg@mail.gmail.com> <CAM5tNy45ovMMrGHx0_tKUPC3in5WfjrtbBpv23k99CuxFxY21w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
------=_Part_4769_153177038.1709725576674
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Van: Rick Macklem <rick.macklem@gmail.com>
Datum: dinsdag, 5 maart 2024 15:43
Aan: Ronald Klop <ronald-lists@klop.ws>
CC: rmacklem@freebsd.org, Garrett Wollman <wollman@bimajority.org>, stable@freebsd.org
Onderwerp: Re: 13-stable NFS server hang
> 
> On Tue, Mar 5, 2024 at 6:34AM Rick Macklem <rick.macklem@gmail.com> wrote:
> >
> > On Tue, Mar 5, 2024 at 2:13AM Ronald Klop <ronald-lists@klop.ws> wrote:
> > >
> > >
> > > Van: Rick Macklem <rick.macklem@gmail.com>
> > > Datum: vrijdag, 1 maart 2024 15:23
> > > Aan: Ronald Klop <ronald-lists@klop.ws>
> > > CC: Garrett Wollman <wollman@bimajority.org>, stable@freebsd.org, rmacklem@freebsd.org
> > > Onderwerp: Re: 13-stable NFS server hang
> > >
> > > On Fri, Mar 1, 2024 at 12:00AM Ronald Klop <ronald-lists@klop.ws> wrote:
> > > >
> > > > Interesting read.
> > > >
> > > >  Would it be possible to separate locking for admin actions like a client mounting an fs from traffic flowing for file operations?
> > > Well, the NFS server does not really have any concept of a mount.
> > > What I am referring to is the ClientID maintained for NFSv4 mounts,
> > > which all the open/lock/session/layout state hangs off of.
> > >
> > > For most cases, this state information can safely be accessed/modified
> > > via a mutex, but there are three exceptions:
> > > - creating a new ClientID (which is done by the ExchangeID operation)
> > >   and typically happens when a NFS client does a mount.
> > > - delegation Recall (which only happens when delegations are enabled)
> > >   One of the reasons delegations are not enabled by default on the
> > > FreeBSD server.
> > > - the DestroyClientID which is typically done by a NFS client during dismount.
> > > For these cases, it is just too difficult to do them without sleeping.
> > > As such, there is a sleep lock which the nfsd threads normally acquire shared
> > > when doing NFSv4 operations, but for the above cases the lock is aquired
> > > exclusive.
> > > - I had to give the exclusive lock priority over shared lock
> > > acquisition (it is a
> > >   custom locking mechanism with assorted weirdnesses) because without
> > >   that someone reported that new mounts took up to 1/2hr to occur.
> > >   (The exclusive locker waited for 30min before all the other nfsd threads
> > >    were not busy.)
> > >   Because of this priority, once a nfsd thread requests the exclusive lock,
> > >   all other nfsd threads executing NFSv4 RPCs block after releasing their
> > >   shared lock, until the exclusive locker releases the exclusive lock.
> > >
> > > In summary, NFSv4 has certain advantages over NFSv3, but it comes
> > > with a lot of state complexity. It just is not feasible to manipulate all that
> > > state with only mutex locking.
> > >
> > > rick
> > >
> > > >
> > > > Like ongoing file operations could have a read only view/copy of the mount table. Only new operations will have to wait.
> > > > But the mount never needs to wait for ongoing operations before locking the structure.
> > > >
> > > > Just a thought in the morning
> > > >
> > > > Regards,
> > > > Ronald.
> > > >
> > > > Van: Rick Macklem <rick.macklem@gmail.com>
> > > > Datum: 1 maart 2024 00:31
> > > > Aan: Garrett Wollman <wollman@bimajority.org>
> > > > CC: stable@freebsd.org, rmacklem@freebsd.org
> > > > Onderwerp: Re: 13-stable NFS server hang
> > > >
> > > > On Wed, Feb 28, 2024 at 4:04PM Rick Macklem wrote:
> > > > >
> > > > > On Tue, Feb 27, 2024 at 9:30PM Garrett Wollman wrote:
> > > > > >
> > > > > > Hi, all,
> > > > > >
> > > > > > We've had some complaints of NFS hanging at unpredictable intervals.
> > > > > > Our NFS servers are running a 13-stable from last December, and
> > > > > > tonight I sat in front of the monitor watching `nfsstat -dW`.  I was
> > > > > > able to clearly see that there were periods when NFS activity would
> > > > > > drop *instantly* from 30,000 ops/s to flat zero, which would last
> > > > > > for about 25 seconds before resuming exactly as it was before.
> > > > > >
> > > > > > I wrote a little awk script to watch for this happening and run
> > > > > > `procstat -k` on the nfsd process, and I saw that all but two of the
> > > > > > service threads were idle.  The three nfsd threads that had non-idle
> > > > > > kstacks were:
> > > > > >
> > > > > >   PID    TID COMM                TDNAME              KSTACK
> > > > > >   997 108481 nfsd                nfsd: master        mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_internal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_common
> > > > > >   997 960918 nfsd                nfsd: service       mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline
> > > > > >   997 962232 nfsd                nfsd: service       mi_switch _cv_wait txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_FILE_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline
> > > > > >
> > > > > > I'm suspicious of two things: first, the copy_file_range RPC; second,
> > > > > > the "master" nfsd thread is actually servicing an RPC which requires
> > > > > > obtaining a lock.  The "master" getting stuck while performing client
> > > > > > RPCs is, I believe, the reason NFS service grinds to a halt when a
> > > > > > client tries to write into a near-full filesystem, so this problem
> > > > > > would be more evidence that the dispatching function should not be
> > > > > > mixed with actual operations.  I don't know what the clients are
> > > > > > doing, but is it possible that nfsrvd_copy_file_range is holding a
> > > > > > lock that is needed by one or both of the other two threads?
> > > > > >
> > > > > > Near-term I could change nfsrvd_copy_file_range to just
> > > > > > unconditionally return NFSERR_NOTSUP and force the clients to fall
> > > > > > back, but I figured I would ask if anyone else has seen this.
> > > > > I have attached a little patch that should limit the server's Copy size
> > > > > to vfs.nfsd.maxcopyrange (default of 10Mbytes).
> > > > > Hopefully this makes sure that the Copy does not take too long.
> > > > >
> > > > > You could try this instead of disabling Copy. It would be nice to know if
> > > > > this is suffciient? (If not, I'll probably add a sysctl to disable Copy.)
> > > > I did a quick test without/with this patch,where I copied a 1Gbyte file.
> > > >
> > > > Without this patch, the Copy RPCs mostly replied in just under 1sec
> > > > (which is what the flag requests), but took over 4sec for one of the Copy
> > > > operations. This implies that one Read/Write of 1Mbyte on the server
> > > > took over 3 seconds.
> > > > I noticed the first Copy did over 600Mbytes, but the rest did about 100Mbytes
> > > > each and it was one of these 100Mbyte Copy operations that took over 4sec.
> > > >
> > > > With the patch, there were a lot more Copy RPCs (as expected) of 10Mbytes
> > > > each and they took a consistent 0.25-0.3sec to reply. (This is a test of a local
> > > > mount on an old laptop, so nowhere near a server hardware config.)
> > > >
> > > > So, the patch might be sufficient?
> > > >
> > > > It would be nice to avoid disabling Copy, since it avoids reading the data
> > > > into the client and then writing it back to the server.
> > > >
> > > > I will probably commit both patches (10Mbyte clip of Copy size and
> > > > disabling Copy) to main soon, since I cannot say if clipping the size
> > > > of the Copy will always be sufficient.
> > > >
> > > > Pleas let us know how trying these patches goes, rick
> > > >
> > > > >
> > > > > rick
> > > > >
> > > > > >
> > > > > > -GAWollman
> > > > > >
> > > > > >
> > > >
> > > > ________________________________
> > > >
> > > >
> > > >
> > > >
> > > ________________________________
> > >
> > >
> > >
> > > Hi Rick,
> > >
> > > You are much more into the NFS code than I am so excuse me if what I'm speaking about does not make sense.
> > >
> > > I was reading nfsrvd_compound() which calls nfsrvd_copy_file_range() via the nfsrv4_ops2 structure.
> > > Nfsrvd_compound holds a lock or refcount on nfsv4rootfs_lock during the whole operation. Which is why nfsrv_setclient() is waiting in this specific case of "NFS server hang".
> > >
> > > But I don't see what is being modified on the nfsdstate after the IO operation ends. Or why the IO operation itself needs the lock to the nfsdstate. IMHO the in-progress IOs will happen anyway regardless of the nfsdstate. Changes to the nfsdstate during an IO operation would not affect the ongoing IO operation.
> > > Wouldn't it be possible to lock the nfsv4rootfs_lock, do checks on or modify the nfsdstate as needed, unlock and then do the IO operation? That would remove a lot of the possible lock contention during (u)mount.
> > > Otherwise, if we do modify the nfsdstate after the IO operation, isn't it possible to relock nfsv4rootfs_lock after the IO operation finishes?
> > Well, there are a couple of reasons. Every implementation has design tradeoffs:
> > 1 - A NFSv4 RPC is a compound, which can be a pretty arbitrary list of
> > operations.
> >      As such, the NFSv4 server does not know if an open/byte range
> > lock is coming
> >      after the operation it is currently performing, since the
> > implementation does not
> >      pre-parse the entire compound. (I had a discussion w.r.t.
> > pre-parsing with one of
> >      the main Linux knfsd server maintainers and he noted that he was
> > not aware of
> >      any extant server that did pre-parse the compound. Although it
> > would be useful
> >      for the server to have the ordered list of operations before
> > commencing the RPC,
> >      we both agreed it was too hard to implement.
> >      --> It could possibly unlock/relock later, but see #2. Also, if
> > relocking took a long time,
> >           it would result in the compound RPC taking too long (see below).
> > 2 - For NFSv4.1/4.2 almost all RPCs are handled by a session. One non-arbitrary
> >      part of almost all NFSv4.1/4.2 RPCs is that the Sequence
> > operation (the one that
> >      handles the session) must come first.
> >      Session(s) are associated with the ClientID, which means the
> > ClientID and the
> >      session must not go away while the compound RPC is in progress.
> >      - This is ensured by the shared lock on the ClientID (that
> > nfsv4rootfh_lock).
> > Since 99.99% of operations can be done with the shared lock, I do not think
> > there is a lot of contention.
> >
> > Although there is nothing wired down in the RFCs, there is an understanding
> > in the NFSv4 community that a server should reply to an RPC in a reasonable
> > time. Typically assumed to be 1-2sec. If the server does this, then a delay for
> > the rare case of a new ClientID shouldn't be a big problem?
> > (The is also delegation recall, which is one reason why delegations
> > are not enabled
> > by default.)
> >
> > Btw, the RFC does define an asynchronous Copy, where the operation replies
> > as soon as the copy is started and the server notifies the client of completion
> > later. I have not implemented this, because it introduces complexities that
> > I do not want to deal with.
> > For example, what happens when the server crashes/reboots while the copy
> > is in progress? The file is left in a non-deterministic state, depending on what
> > the client does when it does not receive the completion notify.
> >
> Oh, I should also note that the "shared lock" is actually called a
> reference count
> in the code and is there to ensure that the ClientID/Session does not go away
> during execution of the compound.
> 
> The problem in this case (which I should revisit) was that I could not figure
> out how to safely add a new ClientID while other nfsd threads were in progress
> performing other RPCs. Due to retries etc, there might be another RPC
> in progress
> using the ClientID.
> 
> One thing to note here is that the identity of the ClientID
> is not known until the Sequence operation has been performed. (And there is
> cruft for NFSv4.0, since it does not have a Sequence operation.)
> As such, the RPC must be in progress before it is known.
> 
> > rick
> > >
> > > I hope this makes any sense and thanks for all your work on the NFS code.
> > >
> > > Regards,
> > > Ronald.
> > >
> > >
> > >
> 
> 
> 


Hi Rick,

Thanks for the elaborate answer.

Would it make sense to have the current RPC/compound have a lock on its ClientID/session, but not on the whole nfsdstate (nfsv4rootfs_lock)?

So concurrent requests like a new mount creating a new ClientID can go on in parallel, but removing or modifying the locked ClientID will wait for the lock.

Or am I thinking too simple still?

Regards,
Ronald.
 
------=_Part_4769_153177038.1709725576674
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<html><head></head><body><br>
<p><strong>Van:</strong> Rick Macklem &lt;rick.macklem@gmail.com&gt;<br>
<strong>Datum:</strong> dinsdag, 5 maart 2024 15:43<br>
<strong>Aan:</strong> Ronald Klop &lt;ronald-lists@klop.ws&gt;<br>
<strong>CC:</strong> rmacklem@freebsd.org, Garrett Wollman &lt;wollman@bimajority.org&gt;, stable@freebsd.org<br>
<strong>Onderwerp:</strong> Re: 13-stable NFS server hang</p>

<blockquote style="padding-right: 0px; padding-left: 5px; margin-left: 5px; border-left: #000000 2px solid; margin-right: 0px">
<div class="MessageRFC822Viewer" id="P">
<div class="TextPlainViewer" id="P.P">On Tue, Mar 5, 2024 at 6:34AM Rick Macklem &lt;rick.macklem@gmail.com&gt; wrote:<br>
&gt;<br>
&gt; On Tue, Mar 5, 2024 at 2:13AM Ronald Klop &lt;ronald-lists@klop.ws&gt; wrote:<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; Van: Rick Macklem &lt;rick.macklem@gmail.com&gt;<br>
&gt; &gt; Datum: vrijdag, 1 maart 2024 15:23<br>
&gt; &gt; Aan: Ronald Klop &lt;ronald-lists@klop.ws&gt;<br>
&gt; &gt; CC: Garrett Wollman &lt;wollman@bimajority.org&gt;, stable@freebsd.org, rmacklem@freebsd.org<br>
&gt; &gt; Onderwerp: Re: 13-stable NFS server hang<br>
&gt; &gt;<br>
&gt; &gt; On Fri, Mar 1, 2024 at 12:00AM Ronald Klop &lt;ronald-lists@klop.ws&gt; wrote:<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Interesting read.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; &nbsp;Would it be possible to separate locking for admin actions like a client mounting an fs from traffic flowing for file operations?<br>
&gt; &gt; Well, the NFS server does not really have any concept of a mount.<br>
&gt; &gt; What I am referring to is the ClientID maintained for NFSv4 mounts,<br>
&gt; &gt; which all the open/lock/session/layout state hangs off of.<br>
&gt; &gt;<br>
&gt; &gt; For most cases, this state information can safely be accessed/modified<br>
&gt; &gt; via a mutex, but there are three exceptions:<br>
&gt; &gt; - creating a new ClientID (which is done by the ExchangeID operation)<br>
&gt; &gt; &nbsp;&nbsp;and typically happens when a NFS client does a mount.<br>
&gt; &gt; - delegation Recall (which only happens when delegations are enabled)<br>
&gt; &gt; &nbsp;&nbsp;One of the reasons delegations are not enabled by default on the<br>
&gt; &gt; FreeBSD server.<br>
&gt; &gt; - the DestroyClientID which is typically done by a NFS client during dismount.<br>
&gt; &gt; For these cases, it is just too difficult to do them without sleeping.<br>
&gt; &gt; As such, there is a sleep lock which the nfsd threads normally acquire shared<br>
&gt; &gt; when doing NFSv4 operations, but for the above cases the lock is aquired<br>
&gt; &gt; exclusive.<br>
&gt; &gt; - I had to give the exclusive lock priority over shared lock<br>
&gt; &gt; acquisition (it is a<br>
&gt; &gt; &nbsp;&nbsp;custom locking mechanism with assorted weirdnesses) because without<br>
&gt; &gt; &nbsp;&nbsp;that someone reported that new mounts took up to 1/2hr to occur.<br>
&gt; &gt; &nbsp;&nbsp;(The exclusive locker waited for 30min before all the other nfsd threads<br>
&gt; &gt; &nbsp;&nbsp;&nbsp;were not busy.)<br>
&gt; &gt; &nbsp;&nbsp;Because of this priority, once a nfsd thread requests the exclusive lock,<br>
&gt; &gt; &nbsp;&nbsp;all other nfsd threads executing NFSv4 RPCs block after releasing their<br>
&gt; &gt; &nbsp;&nbsp;shared lock, until the exclusive locker releases the exclusive lock.<br>
&gt; &gt;<br>
&gt; &gt; In summary, NFSv4 has certain advantages over NFSv3, but it comes<br>
&gt; &gt; with a lot of state complexity. It just is not feasible to manipulate all that<br>
&gt; &gt; state with only mutex locking.<br>
&gt; &gt;<br>
&gt; &gt; rick<br>
&gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Like ongoing file operations could have a read only view/copy of the mount table. Only new operations will have to wait.<br>
&gt; &gt; &gt; But the mount never needs to wait for ongoing operations before locking the structure.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Just a thought in the morning<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Regards,<br>
&gt; &gt; &gt; Ronald.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Van: Rick Macklem &lt;rick.macklem@gmail.com&gt;<br>
&gt; &gt; &gt; Datum: 1 maart 2024 00:31<br>
&gt; &gt; &gt; Aan: Garrett Wollman &lt;wollman@bimajority.org&gt;<br>
&gt; &gt; &gt; CC: stable@freebsd.org, rmacklem@freebsd.org<br>
&gt; &gt; &gt; Onderwerp: Re: 13-stable NFS server hang<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; On Wed, Feb 28, 2024 at 4:04PM Rick Macklem wrote:<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; On Tue, Feb 27, 2024 at 9:30PM Garrett Wollman wrote:<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; Hi, all,<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; We've had some complaints of NFS hanging at unpredictable intervals.<br>
&gt; &gt; &gt; &gt; &gt; Our NFS servers are running a 13-stable from last December, and<br>
&gt; &gt; &gt; &gt; &gt; tonight I sat in front of the monitor watching `nfsstat -dW`. &nbsp;I was<br>
&gt; &gt; &gt; &gt; &gt; able to clearly see that there were periods when NFS activity would<br>
&gt; &gt; &gt; &gt; &gt; drop *instantly* from 30,000 ops/s to flat zero, which would last<br>
&gt; &gt; &gt; &gt; &gt; for about 25 seconds before resuming exactly as it was before.<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; I wrote a little awk script to watch for this happening and run<br>
&gt; &gt; &gt; &gt; &gt; `procstat -k` on the nfsd process, and I saw that all but two of the<br>
&gt; &gt; &gt; &gt; &gt; service threads were idle. &nbsp;The three nfsd threads that had non-idle<br>
&gt; &gt; &gt; &gt; &gt; kstacks were:<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; &nbsp;&nbsp;PID &nbsp;&nbsp;&nbsp;TID COMM &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;TDNAME &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;KSTACK<br>
&gt; &gt; &gt; &gt; &gt; &nbsp;&nbsp;997 108481 nfsd &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;nfsd: master &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_internal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_common<br>
&gt; &gt; &gt; &gt; &gt; &nbsp;&nbsp;997 960918 nfsd &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;nfsd: service &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;mi_switch sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline<br>
&gt; &gt; &gt; &gt; &gt; &nbsp;&nbsp;997 962232 nfsd &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;nfsd: service &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;mi_switch _cv_wait txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_FILE_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; I'm suspicious of two things: first, the copy_file_range RPC; second,<br>
&gt; &gt; &gt; &gt; &gt; the "master" nfsd thread is actually servicing an RPC which requires<br>
&gt; &gt; &gt; &gt; &gt; obtaining a lock. &nbsp;The "master" getting stuck while performing client<br>
&gt; &gt; &gt; &gt; &gt; RPCs is, I believe, the reason NFS service grinds to a halt when a<br>
&gt; &gt; &gt; &gt; &gt; client tries to write into a near-full filesystem, so this problem<br>
&gt; &gt; &gt; &gt; &gt; would be more evidence that the dispatching function should not be<br>
&gt; &gt; &gt; &gt; &gt; mixed with actual operations. &nbsp;I don't know what the clients are<br>
&gt; &gt; &gt; &gt; &gt; doing, but is it possible that nfsrvd_copy_file_range is holding a<br>
&gt; &gt; &gt; &gt; &gt; lock that is needed by one or both of the other two threads?<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; Near-term I could change nfsrvd_copy_file_range to just<br>
&gt; &gt; &gt; &gt; &gt; unconditionally return NFSERR_NOTSUP and force the clients to fall<br>
&gt; &gt; &gt; &gt; &gt; back, but I figured I would ask if anyone else has seen this.<br>
&gt; &gt; &gt; &gt; I have attached a little patch that should limit the server's Copy size<br>
&gt; &gt; &gt; &gt; to vfs.nfsd.maxcopyrange (default of 10Mbytes).<br>
&gt; &gt; &gt; &gt; Hopefully this makes sure that the Copy does not take too long.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; You could try this instead of disabling Copy. It would be nice to know if<br>
&gt; &gt; &gt; &gt; this is suffciient? (If not, I'll probably add a sysctl to disable Copy.)<br>
&gt; &gt; &gt; I did a quick test without/with this patch,where I copied a 1Gbyte file.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Without this patch, the Copy RPCs mostly replied in just under 1sec<br>
&gt; &gt; &gt; (which is what the flag requests), but took over 4sec for one of the Copy<br>
&gt; &gt; &gt; operations. This implies that one Read/Write of 1Mbyte on the server<br>
&gt; &gt; &gt; took over 3 seconds.<br>
&gt; &gt; &gt; I noticed the first Copy did over 600Mbytes, but the rest did about 100Mbytes<br>
&gt; &gt; &gt; each and it was one of these 100Mbyte Copy operations that took over 4sec.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; With the patch, there were a lot more Copy RPCs (as expected) of 10Mbytes<br>
&gt; &gt; &gt; each and they took a consistent 0.25-0.3sec to reply. (This is a test of a local<br>
&gt; &gt; &gt; mount on an old laptop, so nowhere near a server hardware config.)<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; So, the patch might be sufficient?<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; It would be nice to avoid disabling Copy, since it avoids reading the data<br>
&gt; &gt; &gt; into the client and then writing it back to the server.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; I will probably commit both patches (10Mbyte clip of Copy size and<br>
&gt; &gt; &gt; disabling Copy) to main soon, since I cannot say if clipping the size<br>
&gt; &gt; &gt; of the Copy will always be sufficient.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Pleas let us know how trying these patches goes, rick<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; rick<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; -GAWollman<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; ________________________________<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; ________________________________<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; Hi Rick,<br>
&gt; &gt;<br>
&gt; &gt; You are much more into the NFS code than I am so excuse me if what I'm speaking about does not make sense.<br>
&gt; &gt;<br>
&gt; &gt; I was reading nfsrvd_compound() which calls nfsrvd_copy_file_range() via the nfsrv4_ops2 structure.<br>
&gt; &gt; Nfsrvd_compound holds a lock or refcount on nfsv4rootfs_lock during the whole operation. Which is why nfsrv_setclient() is waiting in this specific case of "NFS server hang".<br>
&gt; &gt;<br>
&gt; &gt; But I don't see what is being modified on the nfsdstate after the IO operation ends. Or why the IO operation itself needs the lock to the nfsdstate. IMHO the in-progress IOs will happen anyway regardless of the nfsdstate. Changes to the nfsdstate during an IO operation would not affect the ongoing IO operation.<br>
&gt; &gt; Wouldn't it be possible to lock the nfsv4rootfs_lock, do checks on or modify the nfsdstate as needed, unlock and then do the IO operation? That would remove a lot of the possible lock contention during (u)mount.<br>
&gt; &gt; Otherwise, if we do modify the nfsdstate after the IO operation, isn't it possible to relock nfsv4rootfs_lock after the IO operation finishes?<br>
&gt; Well, there are a couple of reasons. Every implementation has design tradeoffs:<br>
&gt; 1 - A NFSv4 RPC is a compound, which can be a pretty arbitrary list of<br>
&gt; operations.<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As such, the NFSv4 server does not know if an open/byte range<br>
&gt; lock is coming<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;after the operation it is currently performing, since the<br>
&gt; implementation does not<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pre-parse the entire compound. (I had a discussion w.r.t.<br>
&gt; pre-parsing with one of<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the main Linux knfsd server maintainers and he noted that he was<br>
&gt; not aware of<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;any extant server that did pre-parse the compound. Although it<br>
&gt; would be useful<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for the server to have the ordered list of operations before<br>
&gt; commencing the RPC,<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;we both agreed it was too hard to implement.<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;--&gt; It could possibly unlock/relock later, but see #2. Also, if<br>
&gt; relocking took a long time,<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;it would result in the compound RPC taking too long (see below).<br>
&gt; 2 - For NFSv4.1/4.2 almost all RPCs are handled by a session. One non-arbitrary<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;part of almost all NFSv4.1/4.2 RPCs is that the Sequence<br>
&gt; operation (the one that<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;handles the session) must come first.<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Session(s) are associated with the ClientID, which means the<br>
&gt; ClientID and the<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;session must not go away while the compound RPC is in progress.<br>
&gt; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- This is ensured by the shared lock on the ClientID (that<br>
&gt; nfsv4rootfh_lock).<br>
&gt; Since 99.99% of operations can be done with the shared lock, I do not think<br>
&gt; there is a lot of contention.<br>
&gt;<br>
&gt; Although there is nothing wired down in the RFCs, there is an understanding<br>
&gt; in the NFSv4 community that a server should reply to an RPC in a reasonable<br>
&gt; time. Typically assumed to be 1-2sec. If the server does this, then a delay for<br>
&gt; the rare case of a new ClientID shouldn't be a big problem?<br>
&gt; (The is also delegation recall, which is one reason why delegations<br>
&gt; are not enabled<br>
&gt; by default.)<br>
&gt;<br>
&gt; Btw, the RFC does define an asynchronous Copy, where the operation replies<br>
&gt; as soon as the copy is started and the server notifies the client of completion<br>
&gt; later. I have not implemented this, because it introduces complexities that<br>
&gt; I do not want to deal with.<br>
&gt; For example, what happens when the server crashes/reboots while the copy<br>
&gt; is in progress? The file is left in a non-deterministic state, depending on what<br>
&gt; the client does when it does not receive the completion notify.<br>
&gt;<br>
Oh, I should also note that the "shared lock" is actually called a<br>
reference count<br>
in the code and is there to ensure that the ClientID/Session does not go away<br>
during execution of the compound.<br>
<br>
The problem in this case (which I should revisit) was that I could not figure<br>
out how to safely add a new ClientID while other nfsd threads were in progress<br>
performing other RPCs. Due to retries etc, there might be another RPC<br>
in progress<br>
using the ClientID.<br>
<br>
One thing to note here is that the identity of the ClientID<br>
is not known until the Sequence operation has been performed. (And there is<br>
cruft for NFSv4.0, since it does not have a Sequence operation.)<br>
As such, the RPC must be in progress before it is known.<br>
<br>
&gt; rick<br>
&gt; &gt;<br>
&gt; &gt; I hope this makes any sense and thanks for all your work on the NFS code.<br>
&gt; &gt;<br>
&gt; &gt; Regards,<br>
&gt; &gt; Ronald.<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt;</div>

<hr></div>
</blockquote>
<br>
<br>
Hi Rick,<br>
<br>
Thanks for the elaborate answer.<br>
<br>
Would it make sense to have the current RPC/compound have a lock on its ClientID/session, but not on the whole nfsdstate (nfsv4rootfs_lock)?<br>
<br>
So concurrent requests like a new mount creating a new ClientID can go on in parallel, but removing or modifying the locked ClientID will wait for the lock.<br>
<br>
Or am I thinking too simple still?<br>
<br>
Regards,<br>
Ronald.<br>
&nbsp;</body></html>
------=_Part_4769_153177038.1709725576674--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1277770972.4770.1709725576695>