Date: Sat, 12 Nov 2016 17:11:57 -0800 (PST) From: Don Lewis <truckman@FreeBSD.org> To: mjguzik@gmail.com Cc: freebsd-current@FreeBSD.org, mjg@FreeBSD.org Subject: Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447 Message-ID: <201611130112.uAD1Bv9V081063@gw.catspoiler.org> In-Reply-To: <20161112175347.GA14804@dft-labs.eu>
next in thread | previous in thread | raw e-mail | index | archive | help
On 12 Nov, Mateusz Guzik wrote: > On Fri, Nov 11, 2016 at 11:29:48PM -0800, Don Lewis wrote: >> On 11 Nov, To: freebsd-current@FreeBSD.org wrote: >> > Yesterday I upgraded my ports builder box from r306349 to r308477. It >> > crashed overnight during a poudriere run. >> > >> > panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 >> >> This problem is very reproduceable for me, so I added some assertions to >> the code to try to track it down: >> >> panic: Bad link elm 0xfffff802a59cb690 prev->next != elm >> #13 0xffffffff80b0d55a in cache_negative_remove (ncp=0xfffff802a59cb690, >> neg_locked=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:747 >> #14 0xffffffff80b0d84a in cache_zap_locked (ncp=0xfffff802a59cb690, >> neg_locked=false) at /usr/src/sys/kern/vfs_cache.c:890 >> #15 0xffffffff80b0ab98 in cache_zap_rlocked_bucket (ncp=0xfffff802a59cb690, >> blp=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:1021 >> #16 cache_lookup (dvp=<optimized out>, vpp=<optimized out>, >> cnp=<optimized out>, tsp=<optimized out>, ticksp=<optimized out>) >> at /usr/src/sys/kern/vfs_cache.c:1308 >> #17 0xffffffff80b0e97c in vfs_cache_lookup (ap=<optimized out>) >> at /usr/src/sys/kern/vfs_cache.c:2024 >> #18 0xffffffff8103fcea in VOP_LOOKUP_APV (vop=<optimized out>, >> a=<optimized out>) at vnode_if.c:127 >> #19 0xffffffff80b17ed2 in VOP_LOOKUP (vpp=0xfffffe085bfa86e8, >> cnp=0xfffffe085bfa8710, dvp=<optimized out>) at ./vnode_if.h:54 >> #20 lookup (ndp=<optimized out>) at /usr/src/sys/kern/vfs_lookup.c:830 >> #21 0xffffffff80b1748e in namei (ndp=<optimized out>) >> at /usr/src/sys/kern/vfs_lookup.c:397 >> #22 0xffffffff80b30bc4 in kern_renameat (td=<optimized out>, oldfd=10, >> old=0x28e72a59 <error: Cannot access memory at address 0x28e72a59>, >> newfd=10, >> new=0x28e72601 <error: Cannot access memory at address 0x28e72601>, >> pathseg=<optimized out>) at /usr/src/sys/kern/vfs_syscalls.c:3515 >> #23 0xffffffff80ff98a0 in syscallenter (td=0xfffff801ebfd8500, >> sa=<optimized out>) >> at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135 >> #24 ia32_syscall (frame=0xfffffe085bfa8a40) >> at /usr/src/sys/amd64/ia32/ia32_syscall.c:187 >> #25 <signal handler called> >> >> >> #13 0xffffffff80b0d55a in cache_negative_remove (ncp=0xfffff802a59cb690, >> neg_locked=<optimized out>) at /usr/src/sys/kern/vfs_cache.c:747 >> 747 TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); >> (kgdb) list >> 742 } else { >> 743 mtx_assert(&neglist->nl_lock, MA_OWNED); >> 744 mtx_assert(&ncneg_hot.nl_lock, MA_OWNED); >> 745 } >> 746 if (ncp->nc_flag & NCF_HOTNEGATIVE) { >> 747 TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); >> 748 } else { >> 749 TAILQ_REMOVE(&neglist->nl_list, ncp, nc_dst); >> 750 } >> 751 if (list_locked) >> >> (kgdb) print ncp >> $1 = (struct namecache *) 0xfffff802a59cb690 >> >> (kgdb) print *ncp >> $2 = {nc_hash = {le_next = 0x0, le_prev = 0xfffffe0001814178}, nc_src = { >> le_next = 0xfffff80415e22290, le_prev = 0xfffff8048e5597a8}, nc_dst = { >> tqe_next = 0xfffff80421744a10, tqe_prev = 0xfffff80421744a30}, >> nc_dvp = 0xfffff8048e559760, n_un = {nu_vp = 0x0, nu_neghits = 0}, >> nc_flag = 32 ' ', nc_nlen = 25 '\031', >> nc_name = 0xfffff802a59cb6d2 "gtk3-icon-browser.desktop"} >> (kgdb) print ncp->nc_dst->tqe_prev >> $8 = (struct namecache **) 0xfffff80421744a30 >> (kgdb) print *(ncp->nc_dst->tqe_prev) >> $9 = (struct namecache *) 0x0 >> >> >> The line number for the failed assertion doesn't make sense. The value >> of nc_flag is 32 (0x20), but NCF_HOTNEGATIVE is defined as 0x40 and >> NCF_NEGATIVE is defined as 0x20. In any case, the list seems to be >> messed up. >> > > Can you check if the problem persists after reverting r307650 and > r307685? > > I got another report of trouble with poudriere, but so far can neither > reproduce nor diagnose what happens here. > > For instance with this crash, in principle there could be a race where > the entry is demoted and loses the HOTNEGATIVE. But the demotion code > has the hot list locked and so does this code if it spots the flag. > What's more, the race is handled by re-checking the flag after the lock > was taken. > > That said, there is something seriously fishing going on here. This !neg_locked code in cache_negative_remove() looks suspicious to me: if (!neg_locked) { if (ncp->nc_flag & NCF_HOTNEGATIVE) { hot_locked = true; mtx_lock(&ncneg_hot.nl_lock); if (!(ncp->nc_flag & NCF_HOTNEGATIVE)) { list_locked = true; mtx_lock(&neglist->nl_lock); } } else { list_locked = true; mtx_lock(&neglist->nl_lock); } It looks like you handle NCF_HOTNEGATIVE going away while waiting for the ncneg_hot.nl_lock, but don't handle the possible appearance of NCF_HOTNEGATIVE while waiting for neglist->nl_lock. What protects nc_flag, the lock for the list that it resides on? I think the only safe thing to do here is to always grab both locks. My package builder is now up to 18 1/2 hours of runtime building packages with the same kernel that previously crashed in a few hours.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201611130112.uAD1Bv9V081063>