Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Sep 2012 19:45:28 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        FS List <freebsd-fs@freebsd.org>
Subject:   Re: testing/review of atomic export update patch
Message-ID:  <1745669351.1062583.1348443928746.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <20120923160847.GN37286@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote:
> On Fri, Sep 21, 2012 at 06:38:41PM -0400, Rick Macklem wrote:
> > Konstantin Belousov wrote:
> > > On Thu, Sep 20, 2012 at 05:55:26PM -0400, Rick Macklem wrote:
> > > > Konstantin Belousov wrote:
> > > > > On Tue, Sep 18, 2012 at 09:34:54AM -0400, Rick Macklem wrote:
> > > > > > Konstantin Belousov wrote:
> > > > > > > On Mon, Sep 17, 2012 at 05:32:44PM -0400, Rick Macklem
> > > > > > > wrote:
> > > > > > > > Konstantin Belousov wrote:
> > > > > > > > > On Sun, Sep 16, 2012 at 05:41:25PM -0400, Rick Macklem
> > > > > > > > > wrote:
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > There is a simple patch at:
> > > > > > > > > >   http://people.freebsd.org/~rmacklem/atomic-export.patch
> > > > > > > > > > that can be applied to a kernel + mountd, so that
> > > > > > > > > > the
> > > > > > > > > > new
> > > > > > > > > > nfsd can be suspended by mountd while the exports
> > > > > > > > > > are
> > > > > > > > > > being
> > > > > > > > > > reloaded. It adds a new "-S" flag to mountd to
> > > > > > > > > > enable
> > > > > > > > > > this.
> > > > > > > > > > (This avoids the long standing bug where clients
> > > > > > > > > > receive
> > > > > > > > > > ESTALE
> > > > > > > > > >  replies to RPCs while mountd is reloading exports.)
> > > > > > > > >
> > > > > > > > > This looks simple, but also somewhat worrisome. What
> > > > > > > > > would
> > > > > > > > > happen
> > > > > > > > > if the mountd crashes after nfsd suspension is
> > > > > > > > > requested,
> > > > > > > > > but
> > > > > > > > > before
> > > > > > > > > resume was performed ?
> > > > > > > > >
> > > > > > > > > Might be, mountd should check for suspended nfsd on
> > > > > > > > > start
> > > > > > > > > and
> > > > > > > > > unsuspend
> > > > > > > > > it, if some flag is specified ?
> > > > > > > > Well, I think that happens with the patch as it stands.
> > > > > > > >
> > > > > > > > suspend is done if the "-S" option is specified, but
> > > > > > > > that is
> > > > > > > > a
> > > > > > > > no op
> > > > > > > > if it is already suspended. The resume is done no matter
> > > > > > > > what
> > > > > > > > flags
> > > > > > > > are provided, so mountd will always try and do a
> > > > > > > > "resume".
> > > > > > > > --> get_exportlist() is always called when mountd is
> > > > > > > > started
> > > > > > > > up
> > > > > > > > and
> > > > > > > >     it does the resume unconditionally when it
> > > > > > > >     completes.
> > > > > > > >     If mountd repeatedly crashes before completing
> > > > > > > >     get_exportlist()
> > > > > > > >     when it is started up, the exports will be all
> > > > > > > >     messed
> > > > > > > >     up, so
> > > > > > > >     having the nfsd threads suspended doesn't seem so
> > > > > > > >     bad
> > > > > > > >     for
> > > > > > > >     this
> > > > > > > >     case (which hopefully never happens;-).
> > > > > > > >
> > > > > > > > Both suspend and resume are just no ops for unpatched
> > > > > > > > kernels.
> > > > > > > >
> > > > > > > > Maybe the comment in front of "resume" should explicitly
> > > > > > > > explain
> > > > > > > > this, instead of saying resume is harmless to do under
> > > > > > > > all
> > > > > > > > conditions?
> > > > > > > >
> > > > > > > > Thanks for looking at it, rick
> > > > > > > I see.
> > > > > > >
> > > > > > > My another note is that there is no any protection against
> > > > > > > parallel
> > > > > > > instances of suspend/resume happen. For instance, one
> > > > > > > thread
> > > > > > > could
> > > > > > > set
> > > > > > > suspend_nfsd = 1 and be descheduled, while another
> > > > > > > executes
> > > > > > > resume
> > > > > > > code sequence meantime. Then it would see suspend_nfsd !=
> > > > > > > 0,
> > > > > > > while
> > > > > > > nfsv4rootfs_lock not held, and tries to unlock it. It
> > > > > > > seems
> > > > > > > that
> > > > > > > nfsv4_unlock would silently exit. The suspending thread
> > > > > > > resumes,
> > > > > > > and obtains the lock. You end up with suspend_nfsd == 0
> > > > > > > but
> > > > > > > lock
> > > > > > > held.
> > > > > > Yes. I had assumed that mountd would be the only thing using
> > > > > > these
> > > > > > syscalls
> > > > > > and it is single threaded. (The syscalls can only be done by
> > > > > > root
> > > > > > for the
> > > > > > obvious reasons.;-)
> > > > > >
> > > > > > Maybe the following untested version of the syscalls would
> > > > > > be
> > > > > > better, since
> > > > > > they would allow multiple concurrent calls to either suspend
> > > > > > or
> > > > > > resume.
> > > > > > (There would still be an indeterminate case if one thread
> > > > > > called
> > > > > > resume
> > > > > >  concurrently with another few calling suspend, but that is
> > > > > >  unavoidable,
> > > > > >  I think?)
> > > > > >
> > > > > > Again, thanks for the comments, rick
> > > > > > --- untested version of syscalls ---
> > > > > > 	} else if ((uap->flag & NFSSVC_SUSPENDNFSD) != 0) {
> > > > > > 		NFSLOCKV4ROOTMUTEX();
> > > > > > 		if (suspend_nfsd == 0) {
> > > > > > 			/* Lock out all nfsd threads */
> > > > > > 			igotlock = 0;
> > > > > > 			while (igotlock == 0 && suspend_nfsd == 0) {
> > > > > > 				igotlock = nfsv4_lock(&nfsv4rootfs_lock, 1,
> > > > > > 				    NULL, NFSV4ROOTLOCKMUTEXPTR, NULL);
> > > > > > 			}
> > > > > > 			suspend_nfsd = 1;
> > > > > > 		}
> > > > > > 		NFSUNLOCKV4ROOTMUTEX();
> > > > > > 		error = 0;
> > > > > > 	} else if ((uap->flag & NFSSVC_RESUMENFSD) != 0) {
> > > > > > 		NFSLOCKV4ROOTMUTEX();
> > > > > > 		if (suspend_nfsd != 0) {
> > > > > > 			nfsv4_unlock(&nfsv4rootfs_lock, 0);
> > > > > > 			suspend_nfsd = 0;
> > > > > > 		}
> > > > > > 		NFSUNLOCKV4ROOTMUTEX();
> > > > > > 		error = 0;
> > > > > > 	}
> > > > >
> > > > > From the cursory look, this variant is an improvement, mostly
> > > > > by
> > > > > taking
> > > > > the interlock before testing suspend_nfsd, and using the while
> > > > > loop.
> > > > >
> > > > > Is it possible to also make the sleep for the lock
> > > > > interruptible ?
> > > > > So that blocked mountd could be killed by a signal ?
> > > > Well, it would require some coding. An extra argument to
> > > > nfsv4_lock()
> > > > to indicate to do so and then either the caller would have to
> > > > check
> > > > for a pending termination signal when it returns 0 (indicates
> > > > didn't
> > > > get
> > > > lock) or a new return value to indicate EINTR. The latter would
> > > > require
> > > > all the calls to it to be changed to recognize the new 3rd
> > > > return
> > > > case.
> > > > Because there are a lot of these calls, I'd tend towards just
> > > > having
> > > > the
> > > > caller check for a pending signal.
> > > >
> > > > Not sure if it would make much difference though. The only time
> > > > it
> > > > would get stuck in nfsv4_lock() is if the nfsd threads are all
> > > > wedged
> > > > and in that case having mountd wedged too probably doesn't make
> > > > much
> > > > difference, since the NFS service is toast in that case anyhow.
> > > >
> > > > If you think it is worth doing, I can add that. I basically see
> > > > this
> > > > as a "stop-gap" fix until such time as something like nfse is
> > > > done,
> > > > but since I haven't the time to look at nfse right now, I have
> > > > no
> > > > idea when/if that might happen.
> > >
> > > Ok, please go ahead with the patch. Having the patch even in its
> > > current
> > > form is obviously better then not to have it. If the wedged mountd
> > > appears to be annoying enough for me, I would do the change.
> > >
> > > Thanks.
> > Oops, I spoke too soon. When I took a look at the code, I realized
> > that
> > having nfsv4_lock() return when a pending signal interrupts the
> > msleep()
> > isn't easy. As such, I think I'll leave it out of the patch for now.
> >
> > For those who find these things interesting, the reason the above is
> > hard is the funny intentional semantics that nfsv4_lock()
> > implements.
> > Most of the time, the nfsd threads can concurrently handle the NFSv4
> > state structures, using a mutex to serialize access to the lists and
> > never sleeping while doing so. However, there are a few case (mainly
> > delegation recall) where sleeping and knowing that no other thread
> > is modifying the lists is necessary.
> >
> > As such, nfsv4_lock() will be called by potentially many nfsd
> > threads
> > (up to 200+) wanting this exclusive sleep lock. However, it is coded
> > so that only the first one that wakes up after the shared locks (I
> > call
> > it a ref count in the code) have been released, gets the exclusive
> > lock. The rest of the threads wake up after the first thread
> > releases
> > the exclusive lock, but simply return without getting the lock.
> > This avoids up to 200+ threads getting the exclusive lock in turn
> > and
> > then going "oh, I don't need it since that other thread already got
> > the work done" so it releases the exclusive lock and gets a shared
> > one.
> > (It also implements the exclusive lock as having priority over the
> > shared
> >  lock request, so that nfsv4_lock() won't wait indefinitely for the
> >  exclusive lock.)
> >
> > The above is done by having the first thread that wakes up once the
> > shared locks are released clear the "want an exclusive lock" flag
> > as it acquires it.
> >
> > If a call were to return due to a signal, it wouldn't know if it
> > should clear the "want an exclusive lock" flag or not, since it
> > wouldn't know if other threads currently want it or not.
> My take is that the thread which got EINTR from msleep() shall not
> clear
> the flag. It shall not get the lock anyway.
> 
The problem is that the call will have set this flag (NFSv4LOCK_LOCKWANTED)
before it did the msleep(). If no other thread currently msleep()ing on
this lock also set it, then exiting without clearing it will leave it
bogusly set. This means that other callers (ones that call with iwantlock == 0)
will get stuck in the function. I know it sounds weird that threads call
nfsv4_lock() when they don't want the lock, but they do so to block while
any other thread that does want the lock gets/releases it to avoid starving
the thread(s) that do want the lock.

> >
> > This could probably be fixed by adding a count of how many threads
> > are currently sleeping, waiting for the "want an exclusive flag",
> > but that's too scary for me to do, unless it really is needed.
> > (As you might have guessed, it's pretty easy to break this in
> >  subtle ways and I'm a chicken;-)
> 
> BTW, nfsv4_lock/unlock desperately need assertion that the mutex is
> locked
> on entry.
Yep, I can do that, rick




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1745669351.1062583.1348443928746.JavaMail.root>