From owner-freebsd-fs@freebsd.org Thu May 19 23:27:49 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D4423B42384 for ; Thu, 19 May 2016 23:27:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id BF5321D43 for ; Thu, 19 May 2016 23:27:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id BAEA7B42381; Thu, 19 May 2016 23:27:49 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BA8A3B4237E for ; Thu, 19 May 2016 23:27:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 6BED21D42 for ; Thu, 19 May 2016 23:27:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c110-21-42-169.carlnfd1.nsw.optusnet.com.au [110.21.42.169]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 967091049C54; Fri, 20 May 2016 09:27:39 +1000 (AEST) Date: Fri, 20 May 2016 09:27:38 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160519104128.GN89104@kib.kiev.ua> Message-ID: <20160520074427.W1151@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=kDyANCGC9fy361NNEb9EQQ==:117 a=kDyANCGC9fy361NNEb9EQQ==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=138hlgp83k9Wl2frAKkA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 May 2016 23:27:49 -0000 On Thu, 19 May 2016, Konstantin Belousov wrote: > On Thu, May 19, 2016 at 12:20:19PM +1000, Bruce Evans wrote: >> On Thu, 19 May 2016, Bruce Evans wrote: >> >>> On Thu, 19 May 2016, Bruce Evans wrote: >>> >>>> ... >>>> I think the following works to prevent multiple mounts via all of the >>>> known buggy paths: early in every fsmount(): >> >> Here is a lightly tested version: > > There is no need to protect the si_mountpt with any locking, the field > itself serves as a lock good enough, also preventing the parallel mounts > of the same devices. I changed the assignement to atomic_cmpset, which > is enough there. It is somewhat pity that this would reliably disable > multiple ro mounts of the same volume. I used a mutex since it is simpler. I think your version needs atomic ops for resetting the pointer, and maybe acquire/release too. It has locking that is very similar to a mutex. Mutexes use _mtx_obtain_lock = atomic_cmpset_acq_ptr and _mtx_release_lock = atomic_store_rel_ptr. This is already delicately weak -- full sequential consistency is not required. Then on x86, we (you) only recently finished optimizing atomic_store_rel so that it is as weak as possible (just a compiler membar before an ordinary store). Maybe even weaker locking is enough here, but this is too hard to understand. > There is no need to move assignment of NULL to dev->si_mountpt later > in ffs_unmount(), the moment where the assignment is performed is safe > for other thread to start another mount. I already noticed that it was almost as late as possible (could be moved 1 statement later) and not worth moving. But to even reason about orders, you need atomic releases with acquire/ release semantics. There are dev_ref() and dev_rel() calls nearby. The implementation of these probably has to and in fact does give some ordering. The details are too hard to understand. In ffs_unmount() I think it is actually ordering given by vrele() that makes things work: > PICKUP_GIANT(); > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > - ump->um_devvp->v_rdev->si_mountpt = NULL; > + ump->um_dev->si_mountpt = NULL; > vrele(ump->um_devvp); > dev_rel(ump->um_dev); We want the store to si_mountpt to become visible before the vnode is unlocked. Otherwise, a new mount can lock the vnode and fail with EBUSY because it sees si_mountpt != NULL. We have to know implementation details of vrele() to know that this happens. > I still want to keep devvp locked for long enough to cover the bufobj > hacking, and I do not want to move bufobj.bo_ops change before > g_vfs_open() succeed. I didn't move it before g_vfs_open(), but before VOP_UNLOCK(). I think v_bufobj is never cleared, but garbage in it is harmless except in the multiple-mounts case which is now disallowed. > I also wanted to remove GIANT dances, but this requires geom patch, > which I will mail separately. OK. I saw the other mail. > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index 712fc21..21425f5 100644 > --- a/sys/ufs/ffs/ffs_vfsops.c > +++ b/sys/ufs/ffs/ffs_vfsops.c > @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) > cred = td ? td->td_ucred : NOCRED; > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > dev = devvp->v_rdev; > dev_ref(dev); > + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { > + dev_rel(dev); > + VOP_UNLOCK(devvp, 0); > + return (EBUSY); > + } This is cleaner and safer than my version. > DROP_GIANT(); > g_topology_lock(); > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless it fails. This clobbered our setting in the buggy multiple-mount case. But with multiple mounts not allowed, this cleans up any garbage in v_bufobj. g_vfs_open() has 2 failures for non-exclusive access. It starts by checking v_bufobj.bo_private == devvp (this is after translating its pointers to the ones passed here). This is avg's fix for the multiple- mounts problem (r206130). It doesn't work in all cases. I think this is unecessary now. Later, g_vfs_open() does a g_access() check for exclusive-enough access. This is supposed to allow multiple mounts at least when all are ro. I thought that avg modified this, but he actually did something different. I think this check only failed in buggy cases where multiple mounts were allowed. Our changes should make it never fail. It still returns the wrong errno (some general one return by g_access() instead of the one documented for mount() -- this is EBUSY). > g_topology_unlock(); > PICKUP_GIANT(); > - VOP_UNLOCK(devvp, 0); I don't like moving this below devvp accesses. It locks devvp, not dev. > - if (error) > + if (error != 0) { > + VOP_UNLOCK(devvp, 0); > goto out; > - if (devvp->v_rdev->si_iosize_max != 0) > - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > + } > + if (dev->si_iosize_max != 0) > + mp->mnt_iosize_max = dev->si_iosize_max; dev->si_iosize_max is locked by its undocumented lifetime. It is invariant since some previous time. > if (mp->mnt_iosize_max > MAXPHYS) > mp->mnt_iosize_max = MAXPHYS; > - > devvp->v_bufobj.bo_ops = &ffs_ops; This needs to be before the vnode unlock of course. I don't like the complication to avoid setting this if we g_vfs_open_fails, but this at least makes it obvious that we don't set it to garbage when g_vfs_open_fails. In other error cases, and even after unmount, I think v_bufobj is left as garbage. I now see another cleanup: don't goto out when g_vfs_open() fails. This depends on it setting cp to NULL and leaving nothing to clean when it fails. It has no man page and this detail is documented in its source code. > - if (devvp->v_type == VCHR) > - devvp->v_rdev->si_mountpt = mp; > + VOP_UNLOCK(devvp, 0); > > fs = NULL; > sblockloc = 0; > @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td) > out: > if (bp) > brelse(bp); > - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) > - devvp->v_rdev->si_mountpt = NULL; > if (cp != NULL) { > DROP_GIANT(); > g_topology_lock(); > @@ -1102,6 +1105,7 @@ out: > free(ump, M_UFSMNT); > mp->mnt_data = NULL; > } > + dev->si_mountpt = NULL; This should remain before the vnode unlock. Otherwise a new mount can fail unnecessarily. > dev_rel(dev); > return (error); > } > @@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags) > g_vfs_close(ump->um_cp); > g_topology_unlock(); > PICKUP_GIANT(); > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > - ump->um_devvp->v_rdev->si_mountpt = NULL; > + ump->um_dev->si_mountpt = NULL; > vrele(ump->um_devvp); > dev_rel(ump->um_dev); This order is better for avoiding unnecessary failure for new mounts, but now I'm not sure if it is right. Anyway, it matters less to get an unnecessary failures for a new mount after a long-lived old mount than after a failed mount, so the cleanup shouldn't be stricter than here. > mtx_destroy(UFS_MTX(ump)); > Bruce