From owner-svn-src-all@freebsd.org Fri Jun 8 17:38:07 2018 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 14C1A10201F0; Fri, 8 Jun 2018 17:38:07 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 927CD6C45B; Fri, 8 Jun 2018 17:38:06 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTP id w58HbtVK013373; Fri, 8 Jun 2018 20:37:58 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua w58HbtVK013373 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id w58Hbtbb013372; Fri, 8 Jun 2018 20:37:55 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 8 Jun 2018 20:37:55 +0300 From: Konstantin Belousov To: Ryan Libby Cc: Mateusz Guzik , Mark Johnston , Justin Hibbits , src-committers , svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r334708 - head/sys/kern Message-ID: <20180608173755.GJ2450@kib.kiev.ua> References: <201806061257.w56CvCwq089369@repo.freebsd.org> <20180606140311.GU2450@kib.kiev.ua> <20180608033242.GA54099@pesky> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.0 (2018-05-17) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jun 2018 17:38:07 -0000 On Thu, Jun 07, 2018 at 11:02:29PM -0700, Ryan Libby wrote: > On Thu, Jun 7, 2018 at 10:03 PM, Mateusz Guzik wrote: > > On Fri, Jun 8, 2018 at 6:29 AM, Ryan Libby wrote: > >> > >> On Thu, Jun 7, 2018 at 8:32 PM, Mark Johnston wrote: > >> > On Wed, Jun 06, 2018 at 05:03:11PM +0300, Konstantin Belousov wrote: > >> >> On Wed, Jun 06, 2018 at 12:57:12PM +0000, Justin Hibbits wrote: > >> >> > Author: jhibbits > >> >> > Date: Wed Jun 6 12:57:11 2018 > >> >> > New Revision: 334708 > >> >> > URL: https://svnweb.freebsd.org/changeset/base/334708 > >> >> > > >> >> > Log: > >> >> > Add a memory barrier after taking a reference on the vnode holdcnt > >> >> > in _vhold > >> >> > > >> >> > This is needed to avoid a race between the VNASSERT() below, and > >> >> > another > >> >> > thread updating the VI_FREE flag, on weakly-ordered architectures. > >> >> > > >> >> > On a 72-thread POWER9, without this barrier a 'make -j72 > >> >> > buildworld' would > >> >> > panic on the assert regularly. > >> >> > > >> >> > It may be possible to use a weaker barrier, and I'll investigate > >> >> > that once > >> >> > all stability issues are worked out on POWER9. > >> >> > > >> >> > Modified: > >> >> > head/sys/kern/vfs_subr.c > >> >> > > >> >> > Modified: head/sys/kern/vfs_subr.c > >> >> > > >> >> > ============================================================================== > >> >> > --- head/sys/kern/vfs_subr.c Wed Jun 6 10:46:24 2018 > >> >> > (r334707) > >> >> > +++ head/sys/kern/vfs_subr.c Wed Jun 6 12:57:11 2018 > >> >> > (r334708) > >> >> > @@ -2807,6 +2807,9 @@ _vhold(struct vnode *vp, bool locked) > >> >> > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > >> >> > if (!locked) { > >> >> > if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) { > >> >> > +#if !defined(__amd64__) && !defined(__i386__) > >> >> > + mb(); > >> >> > +#endif > >> >> > VNASSERT((vp->v_iflag & VI_FREE) == 0, vp, > >> >> > ("_vhold: vnode with holdcnt is free")); > >> >> > return; > >> >> First, mb() must not be used in the FreeBSD code at all. > >> >> Look at atomic_thread_fenceXXX(9) KPI. > >> >> > >> >> Second, you need the reciprocal fence between clearing of VI_FREE and > >> >> refcount_acquire(), otherwise the added barrier is nop. Most likely, > >> >> you got away with it because there is a mutex unlock between clearing > >> >> of VI_FREE and acquire, which release semantic you abused. > >> > > >> > I note that vnlru_free_locked() clears VI_FREE and increments v_holdcnt > >> > without an intervening release fence. At this point the caller has not > >> > purged the vnode from the name cache, so it seems possible that the > >> > panicking thread observed the two stores out of order. In particular, it > >> > seems to me that the patch below is necessary, but possibly (probably?) > >> > not sufficient: > >> > >> Mark, Justin, and I looked at this. > >> > >> I think that the VNASSERT itself is not quite valid, since it is > >> checking the value of v_iflag without the vnode interlock held (and > >> without the vop lock also). If the rule is that we normally need the > >> vnode interlock to check VI_FREE, then I don't think that possible > >> reordering between the refcount_acquire and VI_FREE clearing in > >> vnlru_free_locked is necessarily a problem in production. > > > > > > Checking it without any locks is perfectly valid in this case. It is done > > after v_holdcnt gets bumped from a non-zero value. So at that time it > > is at least two. Of course that result is stale as an arbitrary number of > > other threads could have bumped and dropped the ref past that point. > > The minimum value is 1 since we hold the ref. But this means the > > vnode must not be on the free list and that's what the assertion is > > verifying. > > > > The problem is indeed lack of ordering against the code clearing the > > flag for the case where 2 threads to vhold and one does the 0->1 > > transition. > > > > That said, the fence is required for the assertion to work. > > > > Yeah, I agree with this logic. What I mean is that reordering between > v_holdcnt 0->1 and v_iflag is normally settled by the release and > acquisition of the vnode interlock, which we are supposed to hold for > v_*i*flag. A quick scan seems to show all of the checks of VI_FREE that > are not asserts do hold the vnode interlock. So, I'm just saying that I > don't think the possible reordering affects them. But do we know that only VI_FREE checks are affected ? My concern is that users of _vhold() rely on seeing up to date state of the vnode, and VI_FREE is only an example of the problem. Most likely, the code which fetched the vnode pointer before _vhold() call, should guarantee visibility. > > >> > >> > >> It might just be that unlocked assertions about v_iflag actually need > >> additional synchronization (although it would be a little unfortunate to > >> add synchronization only to INVARIANTS builds). > >> > >> > > >> >> Does the fence needed for the non-invariants case ? > >> >> > >> >> Fourth, doesn't v_usecount has the same issues WRT inactivation ? > >> > > >> > diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c > >> > index 286a871c3631..c97a8ba63612 100644 > >> > --- a/sys/kern/vfs_subr.c > >> > +++ b/sys/kern/vfs_subr.c > >> > @@ -1018,6 +1018,7 @@ vnlru_free_locked(int count, struct vfsops > >> > *mnt_op) > >> > */ > >> > freevnodes--; > >> > vp->v_iflag &= ~VI_FREE; > >> > + atomic_thread_fence_rel(); > > > > > > This probably has no added value for non-debug kernels. > > > >> > >> > refcount_acquire(&vp->v_holdcnt); > >> > > >> > mtx_unlock(&vnode_free_list_mtx); > >> > @@ -2807,9 +2808,7 @@ _vhold(struct vnode *vp, bool locked) > >> > CTR2(KTR_VFS, "%s: vp %p", __func__, vp); > >> > if (!locked) { > >> > if (refcount_acquire_if_not_zero(&vp->v_holdcnt)) { > >> > -#if !defined(__amd64__) && !defined(__i386__) > >> > - mb(); > >> > -#endif > >> > + atomic_thread_fence_acq(); > > > > > > Same as above. > > > >> > >> > VNASSERT((vp->v_iflag & VI_FREE) == 0, vp, > >> > ("_vhold: vnode with holdcnt is free")); > >> > return; > > > > > > The main codepath which runs into this (... -> cache_lookup -> vhold) most > > definitely does not need the fence for production operation. > > > > What is unclear without audit is whether there are vhold users which need > > one. I think the patch is fine to go in if the other VI_FREE place gets a > > comment about a fence stemming from mtx_unlock and there is another one > > prior to the assertion explaining that this orders against v_iflag tests and > > may > > or may not be needed for other consumers. > > > > In general the code is *full* of data races and accidental reliance of > > ordering > > provided by amd64. Weeding this all out will be a painful exercise. > > > > Part of the problem is lack of primitives like READ_ONCE/WRITE_ONCE as > > seen in the linux kernel, someone should hack up equivalents. > > > > -- > > Mateusz Guzik > > Yeah, I think we should understand whether the problem is just the > unsynchronized assertions or whether there are some true bugs. It > doesn't seem great to me to add synchronization which as you said > > > This probably has no added value for non-debug kernels.