Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Nov 2016 18:53:47 +0100
From:      Mateusz Guzik <mjguzik@gmail.com>
To:        Don Lewis <truckman@FreeBSD.org>
Cc:        freebsd-current@FreeBSD.org, mjg@FreeBSD.org
Subject:   Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447
Message-ID:  <20161112175347.GA14804@dft-labs.eu>
In-Reply-To: <201611120729.uAC7Tm9B077551@gw.catspoiler.org>
References:  <201611112356.uABNu3xY069782@gw.catspoiler.org> <201611120729.uAC7Tm9B077551@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Nov 11, 2016 at 11:29:48PM -0800, Don Lewis wrote:
> On 11 Nov, To: freebsd-current@FreeBSD.org wrote:
> > Yesterday I upgraded my ports builder box from r306349 to r308477.  It
> > crashed overnight during a poudriere run.
> > 
> > panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743
> 
> This problem is very reproduceable for me, so I added some assertions to
> the code to try to track it down:
> 
> panic: Bad link elm 0xfffff802a59cb690 prev->next != elm
> #13 0xffffffff80b0d55a in cache_negative_remove (ncp=0xfffff802a59cb690, 
>     neg_locked=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:747
> #14 0xffffffff80b0d84a in cache_zap_locked (ncp=0xfffff802a59cb690, 
>     neg_locked=false) at /usr/src/sys/kern/vfs_cache.c:890
> #15 0xffffffff80b0ab98 in cache_zap_rlocked_bucket (ncp=0xfffff802a59cb690, 
>     blp=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:1021
> #16 cache_lookup (dvp=<optimized out>, vpp=<optimized out>, 
>     cnp=<optimized out>, tsp=<optimized out>, ticksp=<optimized out>)
>     at /usr/src/sys/kern/vfs_cache.c:1308
> #17 0xffffffff80b0e97c in vfs_cache_lookup (ap=<optimized out>)
>     at /usr/src/sys/kern/vfs_cache.c:2024
> #18 0xffffffff8103fcea in VOP_LOOKUP_APV (vop=<optimized out>, 
>     a=<optimized out>) at vnode_if.c:127
> #19 0xffffffff80b17ed2 in VOP_LOOKUP (vpp=0xfffffe085bfa86e8, 
>     cnp=0xfffffe085bfa8710, dvp=<optimized out>) at ./vnode_if.h:54
> #20 lookup (ndp=<optimized out>) at /usr/src/sys/kern/vfs_lookup.c:830
> #21 0xffffffff80b1748e in namei (ndp=<optimized out>)
>     at /usr/src/sys/kern/vfs_lookup.c:397
> #22 0xffffffff80b30bc4 in kern_renameat (td=<optimized out>, oldfd=10, 
>     old=0x28e72a59 <error: Cannot access memory at address 0x28e72a59>, 
>     newfd=10, 
>     new=0x28e72601 <error: Cannot access memory at address 0x28e72601>, 
>     pathseg=<optimized out>) at /usr/src/sys/kern/vfs_syscalls.c:3515
> #23 0xffffffff80ff98a0 in syscallenter (td=0xfffff801ebfd8500, 
>     sa=<optimized out>)
>     at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135
> #24 ia32_syscall (frame=0xfffffe085bfa8a40)
>     at /usr/src/sys/amd64/ia32/ia32_syscall.c:187
> #25 <signal handler called>
> 
> 
> #13 0xffffffff80b0d55a in cache_negative_remove (ncp=0xfffff802a59cb690, 
>     neg_locked=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:747
> 747			TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst);
> (kgdb) list
> 742		} else {
> 743			mtx_assert(&neglist->nl_lock, MA_OWNED);
> 744			mtx_assert(&ncneg_hot.nl_lock, MA_OWNED);
> 745		}
> 746		if (ncp->nc_flag & NCF_HOTNEGATIVE) {
> 747			TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst);
> 748		} else {
> 749			TAILQ_REMOVE(&neglist->nl_list, ncp, nc_dst);
> 750		}
> 751		if (list_locked)
> 
> (kgdb) print ncp
> $1 = (struct namecache *) 0xfffff802a59cb690
> 
> (kgdb) print *ncp
> $2 = {nc_hash = {le_next = 0x0, le_prev = 0xfffffe0001814178}, nc_src = {
>     le_next = 0xfffff80415e22290, le_prev = 0xfffff8048e5597a8}, nc_dst = {
>     tqe_next = 0xfffff80421744a10, tqe_prev = 0xfffff80421744a30}, 
>   nc_dvp = 0xfffff8048e559760, n_un = {nu_vp = 0x0, nu_neghits = 0}, 
>   nc_flag = 32 ' ', nc_nlen = 25 '\031', 
>   nc_name = 0xfffff802a59cb6d2 "gtk3-icon-browser.desktop"}
> (kgdb) print ncp->nc_dst->tqe_prev   
> $8 = (struct namecache **) 0xfffff80421744a30
> (kgdb) print *(ncp->nc_dst->tqe_prev)
> $9 = (struct namecache *) 0x0
> 
> 
> The line number for the failed assertion doesn't make sense.  The value
> of nc_flag is 32 (0x20), but NCF_HOTNEGATIVE is defined as 0x40 and
> NCF_NEGATIVE is defined as 0x20.  In any case, the list seems to be
> messed up.
> 

Can you check if the problem persists after reverting r307650 and
r307685?

I got another report of trouble with poudriere, but so far can neither
reproduce nor diagnose what happens here.

For instance with this crash, in principle there could be a race where
the entry is demoted and loses the HOTNEGATIVE. But the demotion code
has the hot list locked and so does this code if it spots the flag.
What's more, the race is handled by re-checking the flag after the lock
was taken.

That said, there is something seriously fishing going on here.
-- 
Mateusz Guzik <mjguzik gmail.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161112175347.GA14804>