From owner-freebsd-fs@freebsd.org  Wed May 18 00:00:42 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C5721B3FD81
 for <freebsd-fs@mailman.ysv.freebsd.org>; Wed, 18 May 2016 00:00:42 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3])
 by mx1.freebsd.org (Postfix) with ESMTP id AF019149C
 for <freebsd-fs@freebsd.org>; Wed, 18 May 2016 00:00:42 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: by mailman.ysv.freebsd.org (Postfix)
 id A98A8B3FD7B; Wed, 18 May 2016 00:00:42 +0000 (UTC)
Delivered-To: fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A91AEB3FD78
 for <fs@mailman.ysv.freebsd.org>; Wed, 18 May 2016 00:00:42 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au
 [211.29.132.246])
 by mx1.freebsd.org (Postfix) with ESMTP id 991681475
 for <fs@freebsd.org>; Wed, 18 May 2016 00:00:40 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au
 (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109])
 by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id C347F428F06;
 Wed, 18 May 2016 10:00:31 +1000 (AEST)
Date: Wed, 18 May 2016 10:00:09 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
cc: fs@freebsd.org
Subject: Re: fix for per-mount i/o counting in ffs
In-Reply-To: <20160517220055.GF89104@kib.kiev.ua>
Message-ID: <20160518084931.T6534@besplex.bde.org>
References: <20160517072104.I2137@besplex.bde.org>
 <20160517084241.GY89104@kib.kiev.ua>
 <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org>
 <20160517220055.GF89104@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=c+ZWOkJl c=1 sm=1 tr=0
 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10
 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=_FYEsZlHC8cxAzq_7eoA:9
 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 May 2016 00:00:42 -0000

On Wed, 18 May 2016, Konstantin Belousov wrote:

> On Wed, May 18, 2016 at 07:30:25AM +1000, Bruce Evans wrote:
>> Further cleanups:
>> - the null pointer check is bogus since we already dereferenced
>>    devvp->v_rdev.  We also assigned devvp->v_rdev to the variable
>>    dev but spelled out devvp->v_rdev in a couple of other places.
>> - the VCHR check is bogus since we only work for VCHR and have
>>    already checked for VCHR in vn_isdisk().
> No, these are not bogus.  The checks are incorrect because they are
> racy, but they are needed with the proper locking.  I intended to look
> at this tomorrow, since the fixes are not related to the current changes,
> but you forced me.

You are too efficient :-).

> VCHR check ensures that the devvp vnode is not reclaimed. I do not want
> to remove the check and rely on the caller of ffs_mountfs() to always do
> the right thing for it without unlocking devvp, this is too subtle.

Surely the caller must lock devvp?  Otherwise none of the uses of devvp
can be trusted, and there are several others.

> We are safe from devvp being reclaimed when io is in progress, since
> our reference prevents cdev memory from free, which ensures that v_rdev
> is valid if non-NULL. Unmount is not supposed to finish until all io is
> finished (but we had bugs there).
>>
>> Similarly in ffs_umount() except there is no dev variable there.
> There is ump->um_dev.

There is also ump->um_devvvp, but this seems to be unusable since it
might go away.

So using the devvp->v_rdev instead of the dev variable is not just a
style bug.

> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
> index 712fc21..da61c15 100644
> --- a/sys/ufs/ffs/ffs_vfsops.c
> +++ b/sys/ufs/ffs/ffs_vfsops.c
> @@ -771,17 +771,18 @@ ffs_mountfs(devvp, mp, td)
> 	error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1);
> 	g_topology_unlock();
> 	PICKUP_GIANT();
> -	VOP_UNLOCK(devvp, 0);
> -	if (error)
> +	if (error) {
> +		VOP_UNLOCK(devvp, 0);
> 		goto out;
> -	if (devvp->v_rdev->si_iosize_max != 0)
> +	}
> +	if (dev->si_iosize_max != 0)
> 		mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max;
> 	if (mp->mnt_iosize_max > MAXPHYS)
> 		mp->mnt_iosize_max = MAXPHYS;
> -
> 	devvp->v_bufobj.bo_ops = &ffs_ops;
> 	if (devvp->v_type == VCHR)

devvp must be still VCHR since this is now under the vnode lock, and we
depend on dev remaining a character device for the disk described by
devvp at the time of the vn_isdisk() check.

> -		devvp->v_rdev->si_mountpt = mp;
> +		dev->si_mountpt = mp;
> +	VOP_UNLOCK(devvp, 0);

The unlocking could be a little earlier since dev is still for a disk even
if devvp went away and you changed this to not used devvp->v_rdev.

>
> 	fs = NULL;
> 	sblockloc = 0;

Unlocking and then using devvp sure looks like a race.

You only needed to move the unlocking to fix.  devvp->v_bufobj.  How does
that work?  The write is now locked, but if devvp goes away, then don't
we lose its bufobj?

> @@ -1083,8 +1084,10 @@ ffs_mountfs(devvp, mp, td)
> out:
> 	if (bp)
> 		brelse(bp);
> +	VOP_LOCK(devvp, LK_EXCLUSIVE | LK_RETRY);
> 	if (devvp->v_type == VCHR && devvp->v_rdev != NULL)
> 		devvp->v_rdev->si_mountpt = NULL;
> +	VOP_UNLOCK(devvp, 0);
> 	if (cp != NULL) {
> 		DROP_GIANT();
> 		g_topology_lock();

Why not just dev->si_mountpt = NULL unconditionally?  We must do this
even if devvp went away, and we can easily do it using dev alone, as
above.

> @@ -1287,9 +1290,11 @@ ffs_unmount(mp, mntflags)
> 	g_vfs_close(ump->um_cp);
> 	g_topology_unlock();
> 	PICKUP_GIANT();
> -	if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL)
> -		ump->um_devvp->v_rdev->si_mountpt = NULL;
> -	vrele(ump->um_devvp);
> +	VOP_LOCK(ump->um_devvp, LK_EXCLUSIVE | LK_RETRY);
> +	if (ump->um_devvp->v_type == VCHR &&
> +	    ump->um_devvp->v_rdev == ump->um_dev)
> +		ump->um_dev->si_mountpt = NULL;
> +	vput(ump->um_devvp);

As above.  We don't care if um_devvp went away, at least for clearing
si_mountpt, and must use ump->um_dev to clear si_mountpt.

> 	dev_rel(ump->um_dev);

Presumably ump->um_dev was reference throughout until here, and this is
the only thing keeping the device from going away too.

> 	mtx_destroy(UFS_MTX(ump));
> 	if (mp->mnt_gjprovider != NULL) {
>

How does any use of ump->um_devvp work?

I tried revoke(2) on the devvp of a mounted file system.  This worked
to give v_type = VBAD and v_rdev = NULL, but didn't crash.  ffs_unmount()
checked for the bad vnode, unlike most places, and failed to clear
si_mountpt.

Normal use doesn't have revokes, but if the vnode is reclaimed instead
of just becoming bad, then worse things probably happen.  I think vnode
cache resizing gives very unstable storage so the pointer becomes very
invalid.  But even revoke followed by setting kern.numvnodes to 1 didn't
crash (15 vnodes remained).  So devvp must be referenced throughout.
It seems to have reference count 2, since umounting reduced kern.numvnodes
from 15 to 13.  (It is surprising how much works with kern.maxvnodes=1.
I was able to run revoke, sysctl and umount.)  It is still a mystery
that the VBAD vnode doesn't crash soon.

Bruce