Date: Fri, 8 Jun 2018 07:03:04 +0200 From: Mateusz Guzik <mjguzik@gmail.com> To: Ryan Libby <rlibby@freebsd.org> Cc: Mark Johnston <markj@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>, Justin Hibbits <jhibbits@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r334708 - head/sys/kern Message-ID: <CAGudoHENGpCxn_omxfaRLOAH5fP6qdFcmmqZ7He%2BpcC=-1HFKQ@mail.gmail.com> In-Reply-To: <CAHgpiFyOQf6B3oGFGMz3avXqrrP0i6Puy9HqLER1XG5xE67BeQ@mail.gmail.com> References: <201806061257.w56CvCwq089369@repo.freebsd.org> <20180606140311.GU2450@kib.kiev.ua> <20180608033242.GA54099@pesky> <CAHgpiFyOQf6B3oGFGMz3avXqrrP0i6Puy9HqLER1XG5xE67BeQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jun 8, 2018 at 6:29 AM, Ryan Libby <rlibby@freebsd.org> wrote: > On Thu, Jun 7, 2018 at 8:32 PM, Mark Johnston <markj@freebsd.org> wrote: > > On Wed, Jun 06, 2018 at 05:03:11PM +0300, Konstantin Belousov wrote: > >> On Wed, Jun 06, 2018 at 12:57:12PM +0000, Justin Hibbits wrote: > >> > Author: jhibbits > >> > Date: Wed Jun 6 12:57:11 2018 > >> > New Revision: 334708 > >> > URL: https://svnweb.freebsd.org/changeset/base/334708 > >> > > >> > Log: > >> > Add a memory barrier after taking a reference on the vnode holdcnt > in _vhold > >> > > >> > This is needed to avoid a race between the VNASSERT() below, and > another > >> > thread updating the VI_FREE flag, on weakly-ordered architectures. > >> > > >> > On a 72-thread POWER9, without this barrier a 'make -j72 > buildworld' would > >> > panic on the assert regularly. > >> > > >> > It may be possible to use a weaker barrier, and I'll investigate > that once > >> > all stability issues are worked out on POWER9. > >> > > >> > Modified: > >> > head/sys/kern/vfs_subr.c > >> > > >> > Modified: head/sys/kern/vfs_subr.c > >> > ============================================================ > ================== > >> > --- head/sys/kern/vfs_subr.c Wed Jun 6 10:46:24 2018 > (r334707) > >> > +++ head/sys/kern/vfs_subr.c Wed Jun 6 12:57:11 2018 > (r334708) > >> > @@ -2807,6 +2807,9 @@ _vhold(struct vnode *vp, bool locked) > >> > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > >> > if (!locked) { > >> > if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) { > >> > +#if !defined(__amd64__) && !defined(__i386__) > >> > + mb(); > >> > +#endif > >> > VNASSERT((vp->v_iflag & VI_FREE) == 0, vp, > >> > ("_vhold: vnode with holdcnt is free")); > >> > return; > >> First, mb() must not be used in the FreeBSD code at all. > >> Look at atomic_thread_fenceXXX(9) KPI. > >> > >> Second, you need the reciprocal fence between clearing of VI_FREE and > >> refcount_acquire(), otherwise the added barrier is nop. Most likely, > >> you got away with it because there is a mutex unlock between clearing > >> of VI_FREE and acquire, which release semantic you abused. > > > > I note that vnlru_free_locked() clears VI_FREE and increments v_holdcnt > > without an intervening release fence. At this point the caller has not > > purged the vnode from the name cache, so it seems possible that the > > panicking thread observed the two stores out of order. In particular, it > > seems to me that the patch below is necessary, but possibly (probably?) > > not sufficient: > > Mark, Justin, and I looked at this. > > I think that the VNASSERT itself is not quite valid, since it is > checking the value of v_iflag without the vnode interlock held (and > without the vop lock also). If the rule is that we normally need the > vnode interlock to check VI_FREE, then I don't think that possible > reordering between the refcount_acquire and VI_FREE clearing in > vnlru_free_locked is necessarily a problem in production. > Checking it without any locks is perfectly valid in this case. It is done after v_holdcnt gets bumped from a non-zero value. So at that time it is at least two. Of course that result is stale as an arbitrary number of other threads could have bumped and dropped the ref past that point. The minimum value is 1 since we hold the ref. But this means the vnode must not be on the free list and that's what the assertion is verifying. The problem is indeed lack of ordering against the code clearing the flag for the case where 2 threads to vhold and one does the 0->1 transition. That said, the fence is required for the assertion to work. > > It might just be that unlocked assertions about v_iflag actually need > additional synchronization (although it would be a little unfortunate to > add synchronization only to INVARIANTS builds). > > > > >> Does the fence needed for the non-invariants case ? > >> > >> Fourth, doesn't v_usecount has the same issues WRT inactivation ? > > > > diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c > > index 286a871c3631..c97a8ba63612 100644 > > --- a/sys/kern/vfs_subr.c > > +++ b/sys/kern/vfs_subr.c > > @@ -1018,6 +1018,7 @@ vnlru_free_locked(int count, struct vfsops *mnt_op) > > */ > > freevnodes--; > > vp->v_iflag &= ~VI_FREE; > > + atomic_thread_fence_rel(); > This probably has no added value for non-debug kernels. > > refcount_acquire(&vp->v_holdcnt); > > > > mtx_unlock(&vnode_free_list_mtx); > > @@ -2807,9 +2808,7 @@ _vhold(struct vnode *vp, bool locked) > > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > > if (!locked) { > > if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) { > > -#if !defined(__amd64__) && !defined(__i386__) > > - mb(); > > -#endif > > + atomic_thread_fence_acq(); > Same as above. > > VNASSERT((vp->v_iflag & VI_FREE) == 0, vp, > > ("_vhold: vnode with holdcnt is free")); > > return; > The main codepath which runs into this (... -> cache_lookup -> vhold) most definitely does not need the fence for production operation. What is unclear without audit is whether there are vhold users which need one. I think the patch is fine to go in if the other VI_FREE place gets a comment about a fence stemming from mtx_unlock and there is another one prior to the assertion explaining that this orders against v_iflag tests and may or may not be needed for other consumers. In general the code is *full* of data races and accidental reliance of ordering provided by amd64. Weeding this all out will be a painful exercise. Part of the problem is lack of primitives like READ_ONCE/WRITE_ONCE as seen in the linux kernel, someone should hack up equivalents. -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHENGpCxn_omxfaRLOAH5fP6qdFcmmqZ7He%2BpcC=-1HFKQ>