Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 May 2016 12:23:48 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        fs@freebsd.org
Subject:   Re: fix for per-mount i/o counting in ffs
Message-ID:  <20160520092348.GV89104@kib.kiev.ua>
In-Reply-To: <20160520074427.W1151@besplex.bde.org>
References:  <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, May 20, 2016 at 09:27:38AM +1000, Bruce Evans wrote:
> On Thu, 19 May 2016, Konstantin Belousov wrote:
> 
> > On Thu, May 19, 2016 at 12:20:19PM +1000, Bruce Evans wrote:
> >> On Thu, 19 May 2016, Bruce Evans wrote:
> >>
> >>> On Thu, 19 May 2016, Bruce Evans wrote:
> >>>
> >>>>  ...
> >>>> I think the following works to prevent multiple mounts via all of the
> >>>> known buggy paths: early in every fsmount():
> >>
> >> Here is a lightly tested version:
> >
> > There is no need to protect the si_mountpt with any locking, the field
> > itself serves as a lock good enough, also preventing the parallel mounts
> > of the same devices.  I changed the assignement to atomic_cmpset, which
> > is enough there.  It is somewhat pity that this would reliably disable
> > multiple ro mounts of the same volume.
> 
> I used a mutex since it is simpler.
> 
> I think your version needs atomic ops for resetting the pointer, and
> maybe acquire/release too.  It has locking that is very similar to a
> mutex.  Mutexes use _mtx_obtain_lock = atomic_cmpset_acq_ptr and
> _mtx_release_lock = atomic_store_rel_ptr.  This is already delicately
> weak -- full sequential consistency is not required.  Then on x86,
> we (you) only recently finished optimizing atomic_store_rel so that
> it is as weak as possible (just a compiler membar before an ordinary
> store).
> 
> Maybe even weaker locking is enough here, but this is too hard to
> understand.
Well, I do not think that barriers would add much there, since we really
do not care about two almost parallel mounts to fail, and other locking
provides enough synchronization points.  On the other hand, having
explicit barriers makes si_mountpt act as the real semaphore.  Unlike
mutex, it attributes the ownership of the device to a mount point, and
not to the locking thread.

So I added acq/rel.

> 
> > There is no need to move assignment of NULL to dev->si_mountpt later
> > in ffs_unmount(), the moment where the assignment is performed is safe
> > for other thread to start another mount.
> 
> I already noticed that it was almost as late as possible (could be moved
> 1 statement later) and not worth moving.
> 
> But to even reason about orders, you need atomic releases with acquire/
> release semantics.  There are dev_ref() and dev_rel() calls nearby.  The
> implementation of these probably has to and in fact does give some ordering.
> The details are too hard to understand.  In ffs_unmount() I think it is
> actually ordering given by vrele() that makes things work:
> 
> > 	PICKUP_GIANT();
> > -	if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL)
> > -		ump->um_devvp->v_rdev->si_mountpt = NULL;
> > +	ump->um_dev->si_mountpt = NULL;
> > 	vrele(ump->um_devvp);
> > 	dev_rel(ump->um_dev);
> 
> We want the store to si_mountpt to become visible before the vnode is
> unlocked.  Otherwise, a new mount can lock the vnode and fail with
> EBUSY because it sees si_mountpt != NULL.  We have to know implementation
> details of vrele() to know that this happens.
Yes, we do not care about this window. For this to happen, mount request
must be issued before the unmount request returned, and the mount caller
is not able to prove that his mount attempt was started before the
unmount progressed enough.

> 
> > I still want to keep devvp locked for long enough to cover the bufobj
> > hacking, and I do not want to move bufobj.bo_ops change before
> > g_vfs_open() succeed.
> 
> I didn't move it before g_vfs_open(), but before VOP_UNLOCK().  I think
> v_bufobj is never cleared, but garbage in it is harmless except in the
> multiple-mounts case which is now disallowed.
So did I.

> 
> > I also wanted to remove GIANT dances, but this requires geom patch,
> > which I will mail separately.
> 
> OK.  I saw the other mail.
> 
> > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
> > index 712fc21..21425f5 100644
> > --- a/sys/ufs/ffs/ffs_vfsops.c
> > +++ b/sys/ufs/ffs/ffs_vfsops.c
> > @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td)
> > 	cred = td ? td->td_ucred : NOCRED;
> > 	ronly = (mp->mnt_flag & MNT_RDONLY) != 0;
> >
> > +	KASSERT(devvp->v_type == VCHR, ("reclaimed devvp"));
> > 	dev = devvp->v_rdev;
> > 	dev_ref(dev);
> > +	if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) {
> > +		dev_rel(dev);
> > +		VOP_UNLOCK(devvp, 0);
> > +		return (EBUSY);
> > +	}
> 
> This is cleaner and safer than my version.
> 
> > 	DROP_GIANT();
> > 	g_topology_lock();
> > 	error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1);
> 
> g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless
> it fails.  This clobbered our setting in the buggy multiple-mount case.
> But with multiple mounts not allowed, this cleans up any garbage in
> v_bufobj.
Yes, and this orders things.  g_vfs_open() shoudl have devvp locked,
both fo bo manipulations and for vnode_create_vobject() call.
We can only assign to bo_ops after g_vfs_open() was done successfully.

> 
> g_vfs_open() has 2 failures for non-exclusive access.  It starts by
> checking v_bufobj.bo_private == devvp (this is after translating its
> pointers to the ones passed here).  This is avg's fix for the multiple-
> mounts problem (r206130).  It doesn't work in all cases.  I think this
> is unecessary now.
At least it weeds out other devfs mounts.

> 
> Later, g_vfs_open() does a g_access() check for exclusive-enough access.
> This is supposed to allow multiple mounts at least when all are ro.  I
> thought that avg modified this, but he actually did something different.
> I think this check only failed in buggy cases where multiple mounts were
> allowed.  Our changes should make it never fail.  It still returns the
> wrong errno (some general one return by g_access() instead of the one
> documented for mount() --  this is EBUSY).
> 
> > 	g_topology_unlock();
> > 	PICKUP_GIANT();
> > -	VOP_UNLOCK(devvp, 0);
> 
> I don't like moving this below devvp accesses.  It locks devvp, not dev.
> 
> > -	if (error)
> > +	if (error != 0) {
> > +		VOP_UNLOCK(devvp, 0);
> > 		goto out;
> > -	if (devvp->v_rdev->si_iosize_max != 0)
> > -		mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max;
> > +	}
> > +	if (dev->si_iosize_max != 0)
> > +		mp->mnt_iosize_max = dev->si_iosize_max;
> 
> dev->si_iosize_max is locked by its undocumented lifetime.  It is invariant
> since some previous time.
> 
> > 	if (mp->mnt_iosize_max > MAXPHYS)
> > 		mp->mnt_iosize_max = MAXPHYS;
> > -
> > 	devvp->v_bufobj.bo_ops = &ffs_ops;
> 
> This needs to be before the vnode unlock of course.
> 
> I don't like the complication to avoid setting this if we g_vfs_open_fails,
> but this at least makes it obvious that we don't set it to garbage when
> g_vfs_open_fails.  In other error cases, and even after unmount, I think
> v_bufobj is left as garbage.
> 
> I now see another cleanup: don't goto out when g_vfs_open() fails.  This
> depends on it setting cp to NULL and leaving nothing to clean when it
> fails.  It has no man page and this detail is documented in its source
> code.
Then I would need to add another NULL assignment, VOP_UNLOCK etc.

> 
> > -	if (devvp->v_type == VCHR)
> > -		devvp->v_rdev->si_mountpt = mp;
> > +	VOP_UNLOCK(devvp, 0);
> >
> > 	fs = NULL;
> > 	sblockloc = 0;
> > @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td)
> > out:
> > 	if (bp)
> > 		brelse(bp);
> > -	if (devvp->v_type == VCHR && devvp->v_rdev != NULL)
> > -		devvp->v_rdev->si_mountpt = NULL;
> > 	if (cp != NULL) {
> > 		DROP_GIANT();
> > 		g_topology_lock();
> > @@ -1102,6 +1105,7 @@ out:
> > 		free(ump, M_UFSMNT);
> > 		mp->mnt_data = NULL;
> > 	}
> > +	dev->si_mountpt = NULL;
> 
> This should remain before the vnode unlock.  Otherwise a new mount can
> fail unnecessarily.
See above.

> 
> > 	dev_rel(dev);
> > 	return (error);
> > }
> > @@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags)
> > 	g_vfs_close(ump->um_cp);
> > 	g_topology_unlock();
> > 	PICKUP_GIANT();
> > -	if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL)
> > -		ump->um_devvp->v_rdev->si_mountpt = NULL;
> > +	ump->um_dev->si_mountpt = NULL;
> > 	vrele(ump->um_devvp);
> > 	dev_rel(ump->um_dev);
> 
> This order is better for avoiding unnecessary failure for new mounts, but
> now I'm not sure if it is right.  Anyway, it matters less to get an
> unnecessary failures for a new mount after a long-lived old mount than
> after a failed mount, so the cleanup shouldn't be stricter than here.
> 
> > 	mtx_destroy(UFS_MTX(ump));
> >

Updated patch to add acq/rel.

diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
index 712fc21..670bb15 100644
--- a/sys/ufs/ffs/ffs_vfsops.c
+++ b/sys/ufs/ffs/ffs_vfsops.c
@@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td)
 	cred = td ? td->td_ucred : NOCRED;
 	ronly = (mp->mnt_flag & MNT_RDONLY) != 0;
 
+	KASSERT(devvp->v_type == VCHR, ("reclaimed devvp"));
 	dev = devvp->v_rdev;
 	dev_ref(dev);
+	if (atomic_cmpset_acq_ptr(&dev->si_mountpt, 0, mp) != 0) {
+		dev_rel(dev);
+		VOP_UNLOCK(devvp, 0);
+		return (EBUSY);
+	}
 	DROP_GIANT();
 	g_topology_lock();
 	error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1);
 	g_topology_unlock();
 	PICKUP_GIANT();
-	VOP_UNLOCK(devvp, 0);
-	if (error)
+	if (error != 0) {
+		VOP_UNLOCK(devvp, 0);
 		goto out;
-	if (devvp->v_rdev->si_iosize_max != 0)
-		mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max;
+	}
+	if (dev->si_iosize_max != 0)
+		mp->mnt_iosize_max = dev->si_iosize_max;
 	if (mp->mnt_iosize_max > MAXPHYS)
 		mp->mnt_iosize_max = MAXPHYS;
-
 	devvp->v_bufobj.bo_ops = &ffs_ops;
-	if (devvp->v_type == VCHR)
-		devvp->v_rdev->si_mountpt = mp;
+	VOP_UNLOCK(devvp, 0);
 
 	fs = NULL;
 	sblockloc = 0;
@@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td)
 out:
 	if (bp)
 		brelse(bp);
-	if (devvp->v_type == VCHR && devvp->v_rdev != NULL)
-		devvp->v_rdev->si_mountpt = NULL;
 	if (cp != NULL) {
 		DROP_GIANT();
 		g_topology_lock();
@@ -1102,6 +1105,7 @@ out:
 		free(ump, M_UFSMNT);
 		mp->mnt_data = NULL;
 	}
+	atomic_store_rel_ptr(&dev->si_mountpt, 0);
 	dev_rel(dev);
 	return (error);
 }
@@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags)
 	g_vfs_close(ump->um_cp);
 	g_topology_unlock();
 	PICKUP_GIANT();
-	if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL)
-		ump->um_devvp->v_rdev->si_mountpt = NULL;
+	atomic_store_rel_ptr(&ump->um_dev->si_mountpt, 0);
 	vrele(ump->um_devvp);
 	dev_rel(ump->um_dev);
 	mtx_destroy(UFS_MTX(ump));



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160520092348.GV89104>