From owner-freebsd-fs@freebsd.org  Thu May 19 23:27:49 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D4423B42384
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 19 May 2016 23:27:49 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org
 [IPv6:2001:1900:2254:206a::50:5])
 by mx1.freebsd.org (Postfix) with ESMTP id BF5321D43
 for <freebsd-fs@freebsd.org>; Thu, 19 May 2016 23:27:49 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: by mailman.ysv.freebsd.org (Postfix)
 id BAEA7B42381; Thu, 19 May 2016 23:27:49 +0000 (UTC)
Delivered-To: fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id BA8A3B4237E
 for <fs@mailman.ysv.freebsd.org>; Thu, 19 May 2016 23:27:49 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 6BED21D42
 for <fs@freebsd.org>; Thu, 19 May 2016 23:27:48 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c110-21-42-169.carlnfd1.nsw.optusnet.com.au
 [110.21.42.169])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 967091049C54;
 Fri, 20 May 2016 09:27:39 +1000 (AEST)
Date: Fri, 20 May 2016 09:27:38 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
cc: fs@freebsd.org
Subject: Re: fix for per-mount i/o counting in ffs
In-Reply-To: <20160519104128.GN89104@kib.kiev.ua>
Message-ID: <20160520074427.W1151@besplex.bde.org>
References: <20160517072104.I2137@besplex.bde.org>
 <20160517084241.GY89104@kib.kiev.ua>
 <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org>
 <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org>
 <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org>
 <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org>
 <20160519104128.GN89104@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0
 a=kDyANCGC9fy361NNEb9EQQ==:117 a=kDyANCGC9fy361NNEb9EQQ==:17
 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10
 a=138hlgp83k9Wl2frAKkA:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 May 2016 23:27:49 -0000

On Thu, 19 May 2016, Konstantin Belousov wrote:

> On Thu, May 19, 2016 at 12:20:19PM +1000, Bruce Evans wrote:
>> On Thu, 19 May 2016, Bruce Evans wrote:
>>
>>> On Thu, 19 May 2016, Bruce Evans wrote:
>>>
>>>>  ...
>>>> I think the following works to prevent multiple mounts via all of the
>>>> known buggy paths: early in every fsmount():
>>
>> Here is a lightly tested version:
>
> There is no need to protect the si_mountpt with any locking, the field
> itself serves as a lock good enough, also preventing the parallel mounts
> of the same devices.  I changed the assignement to atomic_cmpset, which
> is enough there.  It is somewhat pity that this would reliably disable
> multiple ro mounts of the same volume.

I used a mutex since it is simpler.

I think your version needs atomic ops for resetting the pointer, and
maybe acquire/release too.  It has locking that is very similar to a
mutex.  Mutexes use _mtx_obtain_lock = atomic_cmpset_acq_ptr and
_mtx_release_lock = atomic_store_rel_ptr.  This is already delicately
weak -- full sequential consistency is not required.  Then on x86,
we (you) only recently finished optimizing atomic_store_rel so that
it is as weak as possible (just a compiler membar before an ordinary
store).

Maybe even weaker locking is enough here, but this is too hard to
understand.

> There is no need to move assignment of NULL to dev->si_mountpt later
> in ffs_unmount(), the moment where the assignment is performed is safe
> for other thread to start another mount.

I already noticed that it was almost as late as possible (could be moved
1 statement later) and not worth moving.

But to even reason about orders, you need atomic releases with acquire/
release semantics.  There are dev_ref() and dev_rel() calls nearby.  The
implementation of these probably has to and in fact does give some ordering.
The details are too hard to understand.  In ffs_unmount() I think it is
actually ordering given by vrele() that makes things work:

> 	PICKUP_GIANT();
> -	if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL)
> -		ump->um_devvp->v_rdev->si_mountpt = NULL;
> +	ump->um_dev->si_mountpt = NULL;
> 	vrele(ump->um_devvp);
> 	dev_rel(ump->um_dev);

We want the store to si_mountpt to become visible before the vnode is
unlocked.  Otherwise, a new mount can lock the vnode and fail with
EBUSY because it sees si_mountpt != NULL.  We have to know implementation
details of vrele() to know that this happens.

> I still want to keep devvp locked for long enough to cover the bufobj
> hacking, and I do not want to move bufobj.bo_ops change before
> g_vfs_open() succeed.

I didn't move it before g_vfs_open(), but before VOP_UNLOCK().  I think
v_bufobj is never cleared, but garbage in it is harmless except in the
multiple-mounts case which is now disallowed.

> I also wanted to remove GIANT dances, but this requires geom patch,
> which I will mail separately.

OK.  I saw the other mail.

> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
> index 712fc21..21425f5 100644
> --- a/sys/ufs/ffs/ffs_vfsops.c
> +++ b/sys/ufs/ffs/ffs_vfsops.c
> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td)
> 	cred = td ? td->td_ucred : NOCRED;
> 	ronly = (mp->mnt_flag & MNT_RDONLY) != 0;
>
> +	KASSERT(devvp->v_type == VCHR, ("reclaimed devvp"));
> 	dev = devvp->v_rdev;
> 	dev_ref(dev);
> +	if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) {
> +		dev_rel(dev);
> +		VOP_UNLOCK(devvp, 0);
> +		return (EBUSY);
> +	}

This is cleaner and safer than my version.

> 	DROP_GIANT();
> 	g_topology_lock();
> 	error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1);

g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless
it fails.  This clobbered our setting in the buggy multiple-mount case.
But with multiple mounts not allowed, this cleans up any garbage in
v_bufobj.

g_vfs_open() has 2 failures for non-exclusive access.  It starts by
checking v_bufobj.bo_private == devvp (this is after translating its
pointers to the ones passed here).  This is avg's fix for the multiple-
mounts problem (r206130).  It doesn't work in all cases.  I think this
is unecessary now.

Later, g_vfs_open() does a g_access() check for exclusive-enough access.
This is supposed to allow multiple mounts at least when all are ro.  I
thought that avg modified this, but he actually did something different.
I think this check only failed in buggy cases where multiple mounts were
allowed.  Our changes should make it never fail.  It still returns the
wrong errno (some general one return by g_access() instead of the one
documented for mount() --  this is EBUSY).

> 	g_topology_unlock();
> 	PICKUP_GIANT();
> -	VOP_UNLOCK(devvp, 0);

I don't like moving this below devvp accesses.  It locks devvp, not dev.

> -	if (error)
> +	if (error != 0) {
> +		VOP_UNLOCK(devvp, 0);
> 		goto out;
> -	if (devvp->v_rdev->si_iosize_max != 0)
> -		mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max;
> +	}
> +	if (dev->si_iosize_max != 0)
> +		mp->mnt_iosize_max = dev->si_iosize_max;

dev->si_iosize_max is locked by its undocumented lifetime.  It is invariant
since some previous time.

> 	if (mp->mnt_iosize_max > MAXPHYS)
> 		mp->mnt_iosize_max = MAXPHYS;
> -
> 	devvp->v_bufobj.bo_ops = &ffs_ops;

This needs to be before the vnode unlock of course.

I don't like the complication to avoid setting this if we g_vfs_open_fails,
but this at least makes it obvious that we don't set it to garbage when
g_vfs_open_fails.  In other error cases, and even after unmount, I think
v_bufobj is left as garbage.

I now see another cleanup: don't goto out when g_vfs_open() fails.  This
depends on it setting cp to NULL and leaving nothing to clean when it
fails.  It has no man page and this detail is documented in its source
code.

> -	if (devvp->v_type == VCHR)
> -		devvp->v_rdev->si_mountpt = mp;
> +	VOP_UNLOCK(devvp, 0);
>
> 	fs = NULL;
> 	sblockloc = 0;
> @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td)
> out:
> 	if (bp)
> 		brelse(bp);
> -	if (devvp->v_type == VCHR && devvp->v_rdev != NULL)
> -		devvp->v_rdev->si_mountpt = NULL;
> 	if (cp != NULL) {
> 		DROP_GIANT();
> 		g_topology_lock();
> @@ -1102,6 +1105,7 @@ out:
> 		free(ump, M_UFSMNT);
> 		mp->mnt_data = NULL;
> 	}
> +	dev->si_mountpt = NULL;

This should remain before the vnode unlock.  Otherwise a new mount can
fail unnecessarily.

> 	dev_rel(dev);
> 	return (error);
> }
> @@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags)
> 	g_vfs_close(ump->um_cp);
> 	g_topology_unlock();
> 	PICKUP_GIANT();
> -	if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL)
> -		ump->um_devvp->v_rdev->si_mountpt = NULL;
> +	ump->um_dev->si_mountpt = NULL;
> 	vrele(ump->um_devvp);
> 	dev_rel(ump->um_dev);

This order is better for avoiding unnecessary failure for new mounts, but
now I'm not sure if it is right.  Anyway, it matters less to get an
unnecessary failures for a new mount after a long-lived old mount than
after a failed mount, so the cleanup shouldn't be stricter than here.

> 	mtx_destroy(UFS_MTX(ump));
>

Bruce