From owner-freebsd-fs@freebsd.org Wed May 18 23:03:27 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AA0A0B41494 for ; Wed, 18 May 2016 23:03:27 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 95EC516D8 for ; Wed, 18 May 2016 23:03:27 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 953BCB41493; Wed, 18 May 2016 23:03:27 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 94DC5B41492 for ; Wed, 18 May 2016 23:03:27 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 43CEE16D5 for ; Wed, 18 May 2016 23:03:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id A41E6D6568E; Thu, 19 May 2016 09:03:21 +1000 (AEST) Date: Thu, 19 May 2016 09:03:19 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160518110834.GJ89104@kib.kiev.ua> Message-ID: <20160519065714.H1393@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=FF35ox4yeJKPj-cm5okA:9 a=ZiNstujT2j9NvtWX:21 a=EXmI02U5j4sSZ_lM:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 23:03:27 -0000 On Wed, 18 May 2016, Konstantin Belousov wrote: > On Wed, May 18, 2016 at 10:00:09AM +1000, Bruce Evans wrote: >> On Wed, 18 May 2016, Konstantin Belousov wrote: >>> VCHR check ensures that the devvp vnode is not reclaimed. I do not want >>> to remove the check and rely on the caller of ffs_mountfs() to always do >>> the right thing for it without unlocking devvp, this is too subtle. >> >> Surely the caller must lock devvp? Otherwise none of the uses of devvp >> can be trusted, and there are several others. > It must lock, but the interface of ffs_mountfs() would then require > that there is no relock between vn_isdisk() check and call. > > I think I know how to make a good compromise there. I converted the > check for VCHR into the assert. But it is very clear there is no re-lock, and that there must be no re-lock to work ("very clear" relative other complications). ffs_mountfs() is only called once and only exists to make the function more readable and debuggable (and auto-inlining it breaks debugging). Its nearby logic is: namei(); // lock vnode vn_isdisk(); // return if not if (MNT_UPDATE) fail_sometimes(); // locking problems -- see below else ffs_mountfs(); // clearly guaranteed still VCHR I found another locking problem for revoke. After mounting /i and revoking its device, mount -u fails. This is clearly because its rdev has gone away. This makes devvp->v_rdev != ump->um_devvp->v_rdev. The new devvp has the old rdev and the old devvp has a null rdev. This is not really a locking problem, but the correct behaviour. Most places just don't check. >> There is also ump->um_devvvp, but this seems to be unusable since it >> might go away. > Go away as in being reclaimed, yes. The vnode itself is there, since > we keep a reference. I think "reclaimed" is the wrong terminology. The reference prevents it being reclaimed by vnlrureclaim(), but doesn't prevent it being revoked (or vgone()d by a forced unmount of the devfs instance that it is on). The reference prevents it being reclaimed even if it is revoked. When it is revoked, some but apparently not all of the pointers in it are cleared or become garbage. None of them should be used, but some are. v_rdev is cleared and we are fairly careful not to follow it, but we depend on it being cleared and not garbage. Pointers that are not cleared include v_bufobj (apparently) and GEOM's hooks related to v_bufobj, and si_mountpt. si_mountpt is in the cdev and not in the vnode. >> So using the devvp->v_rdev instead of the dev variable is not just a >> style bug. > Might be. In some places. ump->um_devvp->v_rdev gives the old rdev, and devvp->v_rdev gives the current rdev provided devvp is locked. These can be compared to see if the old rdev was revoked. Otherwise, devvp->v_rdev is garbage and both ump->um_dev and ump->um_devvp are close to garbage -- they are both old and the only correct use of this is to check that they are still current, but then you have the current devvp (locked) and can use it instead. >> ... >> You only needed to move the unlocking to fix. devvp->v_bufobj. How does >> that work? The write is now locked, but if devvp goes away, then don't >> we lose its bufobj? > The buffer queues are flushed, and BO_DEAD flag is set. But the flag > does very little. > >> How does any use of ump->um_devvp work? The problems are similar to the ones with ttys that we are still working on. When the device is revoked, there may be many i/o's in progess on it. We don't want to block waiting for these, but they should be aborted before doing any more. But there are enough stale pointers to even allow new i/o's. Enough for tar cf of a complete small file system. >> I tried revoke(2) on the devvp of a mounted file system. This worked >> to give v_type = VBAD and v_rdev = NULL, but didn't crash. ffs_unmount() >> checked for the bad vnode, unlike most places, and failed to clear >> si_mountpt. >> >> Normal use doesn't have revokes, but if the vnode is reclaimed instead >> of just becoming bad, then worse things probably happen. I think vnode >> cache resizing gives very unstable storage so the pointer becomes very >> invalid. But even revoke followed by setting kern.numvnodes to 1 didn't >> crash (15 vnodes remained). So devvp must be referenced throughout. >> It seems to have reference count 2, since umounting reduced kern.numvnodes >> from 15 to 13. (It is surprising how much works with kern.maxvnodes=1. >> I was able to run revoke, sysctl and umount.) It is still a mystery >> that the VBAD vnode doesn't crash soon. >> > I believe that bo_ops assignment is the reason why UFS mounts survive the > reclamation of the devvp vnode. Take a look at the ffs_geom_strategy(), > which is the place where UFS io is tunneled directly into geom. It does > not pass io requests through devfs. As result, revocation does not > change much except doing unneccessary buf queue flush. > > It might be telling to try the same experiment, as conducted in your > next message, on msdosfs instead of UFS. Everything seems to work exactly the same for msdosfs. I retried: - mount; tar; revoke; mount; tar # (2nd mount succeeds due to revoke) - mount; tar; mount-another-devfs; mount-using-other-devfs; tar # (2nd mount succeeds due to separate devvp). No crashes. I didn't risk any rw mounts. > Below is the simplified patch. > > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index 712fc21..412b000 100644 > --- a/sys/ufs/ffs/ffs_vfsops.c > +++ b/sys/ufs/ffs/ffs_vfsops.c > @@ -764,6 +764,7 @@ ffs_mountfs(devvp, mp, td) > cred = td ? td->td_ucred : NOCRED; > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > dev = devvp->v_rdev; > dev_ref(dev); > DROP_GIANT(); Not needed. > @@ -771,17 +772,17 @@ ffs_mountfs(devvp, mp, td) > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > g_topology_unlock(); > PICKUP_GIANT(); > - VOP_UNLOCK(devvp, 0); > - if (error) > + if (error != 0) { > + VOP_UNLOCK(devvp, 0); > goto out; Needed. > - if (devvp->v_rdev->si_iosize_max != 0) > - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > + } > + if (dev->si_iosize_max != 0) > + mp->mnt_iosize_max = dev->si_iosize_max; > if (mp->mnt_iosize_max > MAXPHYS) > mp->mnt_iosize_max = MAXPHYS; > - > devvp->v_bufobj.bo_ops = &ffs_ops; > - if (devvp->v_type == VCHR) > - devvp->v_rdev->si_mountpt = mp; > + dev->si_mountpt = mp; > + VOP_UNLOCK(devvp, 0); > > fs = NULL; > sblockloc = 0; I would keep the unlock as early as possible. Just move the initialization of v_bufobj before it. BTW, I don't like the fixup for > MAXPHYS. This is removed from all file systems in my version. dev->si_iosize_max should be clamped to MAXPHYS unless larger sizes work, and if larger sizes work then individual file systems don't know enough to kill using them. The check for si_iosize_max != 0 is bogus too, but not removed in my version. mp->mnt_iosize_max defaults to DFLTPHYS and the check avoids changing that, but if si_iosize_max remains at 0 then i/o won't actually work, and if some bug results in si_iosize_max being initialized later but early enough for some i/o to work, then the default of DFLTPHYS still won't work if it is larger than the driver size. g_dev_taste() actually defaults si_iosize_max to MAXPHYS and I think GEOM hides the driver iosize_max from file systems so I think si_iosize_max is actually always MAXPHYS here. > @@ -1083,8 +1084,7 @@ ffs_mountfs(devvp, mp, td) > out: > if (bp) > brelse(bp); > - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) > - devvp->v_rdev->si_mountpt = NULL; > + dev->si_mountpt = NULL; > if (cp != NULL) { > DROP_GIANT(); > g_topology_lock(); I think this is still racy, but the race is more harmless than most of the problems from revokes. I think the following can happen: - after we unlock, another mount starts and and clobbers our si_mountpt with a nonzero value. Then this clobbers the other mount's si_mountpt with a zero value. The zero value is relatively harmless. It takes either a revoke, a separate devfs instance, or the old multiple-mount- allowing code for another mount to start. The old code has a smaller race window: - since the vnode is unlocked, it gives a null pointer panic if the v_rdev becomes null after it is tested to be non-null, or if it is still non-null then using it may clobber another mount's si_mountpt if the other mount set si_mountpt races with us. It takes a revoke to get the null pointer. Clobbering only takes a separate devfs instance or the old multiple-mount code. > @@ -1287,8 +1287,7 @@ ffs_unmount(mp, mntflags) > g_vfs_close(ump->um_cp); > g_topology_unlock(); > PICKUP_GIANT(); > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > - ump->um_devvp->v_rdev->si_mountpt = NULL; > + ump->um_dev->si_mountpt = NULL; > vrele(ump->um_devvp); > dev_rel(ump->um_dev); > mtx_destroy(UFS_MTX(ump)); This has the same problems as cleaning up after an error in mount. I think the following works to prevent multiple mounts via all of the known buggy paths: early in every fsmount(): dev = devvp->v_rdev; if (dev->si_mountpt != NULL) { cleanup(); return (EBUSY); } dev->si_mountpt = mp; This also prevents other mounts racing with us before we complete. Too bad if we fail but the other mount would have succeeded. In fsunmount(), move clearing si_mountpt to near the end. I hope si_mountpt is locked by the device reference and that this makes si_mountpt robust enough to use as an exclusive access flag. GEOM's exclusive access counters somehow don't prevent the multiple mounts. I think they are too closely associated with the vnode via v_bufobj. Bruce