From owner-freebsd-current@freebsd.org Sun Nov 13 21:07:39 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CC6ADC40F2F for ; Sun, 13 Nov 2016 21:07:39 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 958811F2C; Sun, 13 Nov 2016 21:07:39 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id uADL7S7a083951; Sun, 13 Nov 2016 13:07:32 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201611132107.uADL7S7a083951@gw.catspoiler.org> Date: Sun, 13 Nov 2016 13:07:28 -0800 (PST) From: Don Lewis Subject: Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447 To: mjguzik@gmail.com cc: freebsd-current@FreeBSD.org, mjg@FreeBSD.org In-Reply-To: <20161113084054.GB14804@dft-labs.eu> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 13 Nov 2016 21:07:39 -0000 The machine managed to survive 21 hours of building FreeBSD 12 ports. All of the panic happened while building FreeBSD 10 ports, which build faster. That may be enough to affect timing and affect the likelyhood of a race condition triggering the panic. On 13 Nov, Mateusz Guzik wrote: > On Sat, Nov 12, 2016 at 05:11:57PM -0800, Don Lewis wrote: >> This !neg_locked code in cache_negative_remove() looks suspicious to me: >> >> if (!neg_locked) { >> if (ncp->nc_flag & NCF_HOTNEGATIVE) { >> hot_locked = true; >> mtx_lock(&ncneg_hot.nl_lock); >> if (!(ncp->nc_flag & NCF_HOTNEGATIVE)) { >> list_locked = true; >> mtx_lock(&neglist->nl_lock); >> } >> } else { >> list_locked = true; >> mtx_lock(&neglist->nl_lock); >> } >> >> It looks like you handle NCF_HOTNEGATIVE going away while waiting for >> the ncneg_hot.nl_lock, but don't handle the possible appearance of >> NCF_HOTNEGATIVE while waiting for neglist->nl_lock. >> > > Promotions from regular to hot are only possible on a hit, which is > prevented by the caller holding both the vnode and bucket lock. The only > way to see the entry at this stage is from the shrinker, which can > demote it and then take all the locks to try to remove it. But it checks > after locking if the node is still there. > >> What protects nc_flag, the lock for the list that it resides on? >> > > It is supposed to be the hot list lock and I think this uncovers a bug > here. > > Consider a NCF_HOTNEGATIVE entry which is being evicted. It sets the > NCV_DVDROP flag without the lock held, but the entry is still not > removed from negative lists. So in principle we can either lose the > newly set flag or the information that hotnegative is unset. > > That said, I think the fix would be to remove from negative entries > prior to setting the NCV_DVDROP flag. > > Normally the flag is protected by the hotlist lock. > > Untested, but should do the trick: > > --- vfs_cache.c.old 2016-11-13 09:37:50.096000000 +0100 > +++ vfs_cache.c.new 2016-11-13 09:39:45.004000000 +0100 > @@ -868,6 +868,13 @@ > nc_get_name(ncp), ncp->nc_neghits); > } > LIST_REMOVE(ncp, nc_hash); > + if (!(ncp->nc_flag & NCF_NEGATIVE)) { > + TAILQ_REMOVE(&ncp->nc_vp->v_cache_dst, ncp, nc_dst); > + if (ncp == ncp->nc_vp->v_cache_dd) > + ncp->nc_vp->v_cache_dd = NULL; > + } else { > + cache_negative_remove(ncp, neg_locked); > + } > if (ncp->nc_flag & NCF_ISDOTDOT) { > if (ncp == ncp->nc_dvp->v_cache_dd) > ncp->nc_dvp->v_cache_dd = NULL; > @@ -878,13 +885,6 @@ > atomic_subtract_rel_long(&numcachehv, 1); > } > } > - if (!(ncp->nc_flag & NCF_NEGATIVE)) { > - TAILQ_REMOVE(&ncp->nc_vp->v_cache_dst, ncp, nc_dst); > - if (ncp == ncp->nc_vp->v_cache_dd) > - ncp->nc_vp->v_cache_dd = NULL; > - } else { > - cache_negative_remove(ncp, neg_locked); > - } > atomic_subtract_rel_long(&numcache, 1); > } I will give this a try.