Date: Fri, 8 Jun 2018 20:37:55 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Ryan Libby <rlibby@freebsd.org> Cc: Mateusz Guzik <mjguzik@gmail.com>, Mark Johnston <markj@freebsd.org>, Justin Hibbits <jhibbits@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r334708 - head/sys/kern Message-ID: <20180608173755.GJ2450@kib.kiev.ua> In-Reply-To: <CAHgpiFzMQjHfnQLQrWc86FSxB2veHZeAc44qmROkaJugpGoU=g@mail.gmail.com> References: <201806061257.w56CvCwq089369@repo.freebsd.org> <20180606140311.GU2450@kib.kiev.ua> <20180608033242.GA54099@pesky> <CAHgpiFyOQf6B3oGFGMz3avXqrrP0i6Puy9HqLER1XG5xE67BeQ@mail.gmail.com> <CAGudoHENGpCxn_omxfaRLOAH5fP6qdFcmmqZ7He%2BpcC=-1HFKQ@mail.gmail.com> <CAHgpiFzMQjHfnQLQrWc86FSxB2veHZeAc44qmROkaJugpGoU=g@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 07, 2018 at 11:02:29PM -0700, Ryan Libby wrote: > On Thu, Jun 7, 2018 at 10:03 PM, Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Fri, Jun 8, 2018 at 6:29 AM, Ryan Libby <rlibby@freebsd.org> wrote: > >> > >> On Thu, Jun 7, 2018 at 8:32 PM, Mark Johnston <markj@freebsd.org> wrote: > >> > On Wed, Jun 06, 2018 at 05:03:11PM +0300, Konstantin Belousov wrote: > >> >> On Wed, Jun 06, 2018 at 12:57:12PM +0000, Justin Hibbits wrote: > >> >> > Author: jhibbits > >> >> > Date: Wed Jun 6 12:57:11 2018 > >> >> > New Revision: 334708 > >> >> > URL: https://svnweb.freebsd.org/changeset/base/334708 > >> >> > > >> >> > Log: > >> >> > Add a memory barrier after taking a reference on the vnode holdcnt > >> >> > in _vhold > >> >> > > >> >> > This is needed to avoid a race between the VNASSERT() below, and > >> >> > another > >> >> > thread updating the VI_FREE flag, on weakly-ordered architectures. > >> >> > > >> >> > On a 72-thread POWER9, without this barrier a 'make -j72 > >> >> > buildworld' would > >> >> > panic on the assert regularly. > >> >> > > >> >> > It may be possible to use a weaker barrier, and I'll investigate > >> >> > that once > >> >> > all stability issues are worked out on POWER9. > >> >> > > >> >> > Modified: > >> >> > head/sys/kern/vfs_subr.c > >> >> > > >> >> > Modified: head/sys/kern/vfs_subr.c > >> >> > > >> >> > ============================================================================== > >> >> > --- head/sys/kern/vfs_subr.c Wed Jun 6 10:46:24 2018 > >> >> > (r334707) > >> >> > +++ head/sys/kern/vfs_subr.c Wed Jun 6 12:57:11 2018 > >> >> > (r334708) > >> >> > @@ -2807,6 +2807,9 @@ _vhold(struct vnode *vp, bool locked) > >> >> > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > >> >> > if (!locked) { > >> >> > if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) { > >> >> > +#if !defined(__amd64__) && !defined(__i386__) > >> >> > + mb(); > >> >> > +#endif > >> >> > VNASSERT((vp->v_iflag & VI_FREE) == 0, vp, > >> >> > ("_vhold: vnode with holdcnt is free")); > >> >> > return; > >> >> First, mb() must not be used in the FreeBSD code at all. > >> >> Look at atomic_thread_fenceXXX(9) KPI. > >> >> > >> >> Second, you need the reciprocal fence between clearing of VI_FREE and > >> >> refcount_acquire(), otherwise the added barrier is nop. Most likely, > >> >> you got away with it because there is a mutex unlock between clearing > >> >> of VI_FREE and acquire, which release semantic you abused. > >> > > >> > I note that vnlru_free_locked() clears VI_FREE and increments v_holdcnt > >> > without an intervening release fence. At this point the caller has not > >> > purged the vnode from the name cache, so it seems possible that the > >> > panicking thread observed the two stores out of order. In particular, it > >> > seems to me that the patch below is necessary, but possibly (probably?) > >> > not sufficient: > >> > >> Mark, Justin, and I looked at this. > >> > >> I think that the VNASSERT itself is not quite valid, since it is > >> checking the value of v_iflag without the vnode interlock held (and > >> without the vop lock also). If the rule is that we normally need the > >> vnode interlock to check VI_FREE, then I don't think that possible > >> reordering between the refcount_acquire and VI_FREE clearing in > >> vnlru_free_locked is necessarily a problem in production. > > > > > > Checking it without any locks is perfectly valid in this case. It is done > > after v_holdcnt gets bumped from a non-zero value. So at that time it > > is at least two. Of course that result is stale as an arbitrary number of > > other threads could have bumped and dropped the ref past that point. > > The minimum value is 1 since we hold the ref. But this means the > > vnode must not be on the free list and that's what the assertion is > > verifying. > > > > The problem is indeed lack of ordering against the code clearing the > > flag for the case where 2 threads to vhold and one does the 0->1 > > transition. > > > > That said, the fence is required for the assertion to work. > > > > Yeah, I agree with this logic. What I mean is that reordering between > v_holdcnt 0->1 and v_iflag is normally settled by the release and > acquisition of the vnode interlock, which we are supposed to hold for > v_*i*flag. A quick scan seems to show all of the checks of VI_FREE that > are not asserts do hold the vnode interlock. So, I'm just saying that I > don't think the possible reordering affects them. But do we know that only VI_FREE checks are affected ? My concern is that users of _vhold() rely on seeing up to date state of the vnode, and VI_FREE is only an example of the problem. Most likely, the code which fetched the vnode pointer before _vhold() call, should guarantee visibility. > > >> > >> > >> It might just be that unlocked assertions about v_iflag actually need > >> additional synchronization (although it would be a little unfortunate to > >> add synchronization only to INVARIANTS builds). > >> > >> > > >> >> Does the fence needed for the non-invariants case ? > >> >> > >> >> Fourth, doesn't v_usecount has the same issues WRT inactivation ? > >> > > >> > diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c > >> > index 286a871c3631..c97a8ba63612 100644 > >> > --- a/sys/kern/vfs_subr.c > >> > +++ b/sys/kern/vfs_subr.c > >> > @@ -1018,6 +1018,7 @@ vnlru_free_locked(int count, struct vfsops > >> > *mnt_op) > >> > */ > >> > freevnodes--; > >> > vp->v_iflag &= ~VI_FREE; > >> > + atomic_thread_fence_rel(); > > > > > > This probably has no added value for non-debug kernels. > > > >> > >> > refcount_acquire(&vp->v_holdcnt); > >> > > >> > mtx_unlock(&vnode_free_list_mtx); > >> > @@ -2807,9 +2808,7 @@ _vhold(struct vnode *vp, bool locked) > >> > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > >> > if (!locked) { > >> > if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) { > >> > -#if !defined(__amd64__) && !defined(__i386__) > >> > - mb(); > >> > -#endif > >> > + atomic_thread_fence_acq(); > > > > > > Same as above. > > > >> > >> > VNASSERT((vp->v_iflag & VI_FREE) == 0, vp, > >> > ("_vhold: vnode with holdcnt is free")); > >> > return; > > > > > > The main codepath which runs into this (... -> cache_lookup -> vhold) most > > definitely does not need the fence for production operation. > > > > What is unclear without audit is whether there are vhold users which need > > one. I think the patch is fine to go in if the other VI_FREE place gets a > > comment about a fence stemming from mtx_unlock and there is another one > > prior to the assertion explaining that this orders against v_iflag tests and > > may > > or may not be needed for other consumers. > > > > In general the code is *full* of data races and accidental reliance of > > ordering > > provided by amd64. Weeding this all out will be a painful exercise. > > > > Part of the problem is lack of primitives like READ_ONCE/WRITE_ONCE as > > seen in the linux kernel, someone should hack up equivalents. > > > > -- > > Mateusz Guzik <mjguzik gmail.com> > > Yeah, I think we should understand whether the problem is just the > unsynchronized assertions or whether there are some true bugs. It > doesn't seem great to me to add synchronization which as you said > > > This probably has no added value for non-debug kernels.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180608173755.GJ2450>