From owner-freebsd-current@freebsd.org Sat Nov 12 07:29:57 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3A3E8C3D79F for ; Sat, 12 Nov 2016 07:29:57 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id CE1F918D6; Sat, 12 Nov 2016 07:29:56 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id uAC7Tm9B077551; Fri, 11 Nov 2016 23:29:52 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201611120729.uAC7Tm9B077551@gw.catspoiler.org> Date: Fri, 11 Nov 2016 23:29:48 -0800 (PST) From: Don Lewis Subject: Re: panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 in 12.0-CURRENT r308447 To: freebsd-current@FreeBSD.org cc: mjg@FreeBSD.org In-Reply-To: <201611112356.uABNu3xY069782@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Nov 2016 07:29:57 -0000 On 11 Nov, To: freebsd-current@FreeBSD.org wrote: > Yesterday I upgraded my ports builder box from r306349 to r308477. It > crashed overnight during a poudriere run. > > panic: mutex ncnegl not owned at /usr/src/sys/kern/vfs_cache.c:743 > cpuid = 2 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe085c2a4100 > vpanic() at vpanic+0x182/frame 0xfffffe085c2a4180 > panic() at panic+0x43/frame 0xfffffe085c2a41e0 > __mtx_assert() at __mtx_assert+0xc1/frame 0xfffffe085c2a41f0 > cache_negative_remove() at cache_negative_remove+0x65/frame 0xfffffe085c2a4230 > cache_zap_locked() at cache_zap_locked+0x1ca/frame 0xfffffe085c2a4250 > cache_enter_time() at cache_enter_time+0x1c20/frame 0xfffffe085c2a4330 > tmpfs_lookup() at tmpfs_lookup+0x47a/frame 0xfffffe085c2a4390 > VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0xda/frame 0xfffffe085c2a43c0 > vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfffffe085c2a4420 > VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0xda/frame 0xfffffe085c2a4450 > lookup() at lookup+0x6a2/frame 0xfffffe085c2a44e0 > namei() at namei+0x57e/frame 0xfffffe085c2a45a0 > vn_open_cred() at vn_open_cred+0x25b/frame 0xfffffe085c2a4710 > kern_openat() at kern_openat+0x25c/frame 0xfffffe085c2a48a0 > ia32_syscall() at ia32_syscall+0x330/frame 0xfffffe085c2a4a30 > Xint0x80_syscall() at Xint0x80_syscall+0x95/frame 0xfffffe085c2a4a30 > --- syscall (499, FreeBSD ELF32, sys_openat), rip = 0x97b2f57, rsp = 0xffff7698, rbp = 0xffff76b0 --- > KDB: enter: panic > > > I was able to get a crash dump. This is the kgdb backtrace: > > #0 __curthread () at ./machine/pcpu.h:222 > #1 doadump (textdump=1546271680) at /usr/src/sys/kern/kern_shutdown.c:298 > #2 0xffffffff80396db6 in db_fncall_generic (nargs=0, addr=, > rv=, args=) > at /usr/src/sys/ddb/db_command.c:581 > #3 db_fncall (dummy1=, dummy2=, > dummy3=, dummy4=) > at /usr/src/sys/ddb/db_command.c:629 > #4 0xffffffff80396919 in db_command (last_cmdp=, > cmd_table=, dopager=) > at /usr/src/sys/ddb/db_command.c:453 > #5 0xffffffff80396674 in db_command_loop () > at /usr/src/sys/ddb/db_command.c:506 > #6 0xffffffff8039972f in db_trap (type=, code=) > at /usr/src/sys/ddb/db_main.c:248 > #7 0xffffffff80aa0a13 in kdb_trap (type=, > code=, tf=) > at /usr/src/sys/kern/subr_kdb.c:654 > #8 0xffffffff80ed8cc4 in trap (frame=0xfffffe085c2a4030) > at /usr/src/sys/amd64/amd64/trap.c:537 > #9 > #10 kdb_enter (why=0xffffffff8140e5d1 "panic", > msg=0x80 ) > at /usr/src/sys/kern/subr_kdb.c:444 > #11 0xffffffff80a5eb1f in vpanic (fmt=, ap=0xfffffe085c2a41c0) > at /usr/src/sys/kern/kern_shutdown.c:752 > #12 0xffffffff80a5eb83 in panic (fmt=0xffffffff81bf1de0 "\004") > at /usr/src/sys/kern/kern_shutdown.c:690 > #13 0xffffffff80a40191 in __mtx_assert (c=, > what=, file=0xfffffe085c2a3fe0 "", line=0) > at /usr/src/sys/kern/kern_mutex.c:937 > #14 0xffffffff80b0d3d5 in cache_negative_remove (ncp=0xfffff8062bb8dee0, > neg_locked=true) at /usr/src/sys/kern/vfs_cache.c:743 > #15 0xffffffff80b0d87a in cache_zap_locked (ncp=0xfffff8062bb8dee0, > neg_locked=true) at /usr/src/sys/kern/vfs_cache.c:886 > #16 0xffffffff80b0d210 in cache_negative_zap_one () > at /usr/src/sys/kern/vfs_cache.c:835 > #17 cache_enter_time (dvp=, vp=, > cnp=, tsp=, dtsp=) > at /usr/src/sys/kern/vfs_cache.c:1741 > #18 0xffffffff82893b3a in tmpfs_lookup (v=) > at /usr/src/sys/modules/tmpfs/../../fs/tmpfs/tmpfs_vnops.c:199 > #19 0xffffffff8103ee4a in VOP_CACHEDLOOKUP_APV (vop=, > a=) at vnode_if.c:195 > #20 0xffffffff80b0e9d6 in VOP_CACHEDLOOKUP (dvp=0xfffff8023f3693b0, > vpp=0xfffffe085c2a47d8, cnp=0xfffffe085c2a4800) at ./vnode_if.h:80 > #21 vfs_cache_lookup (ap=) at /usr/src/sys/kern/vfs_cache.c:2022 > #22 0xffffffff8103ecea in VOP_LOOKUP_APV (vop=, > a=) at vnode_if.c:127 > #23 0xffffffff80b17df2 in VOP_LOOKUP (vpp=0xfffffe085c2a47d8, > cnp=0xfffffe085c2a4800, dvp=) at ./vnode_if.h:54 > #24 lookup (ndp=) at /usr/src/sys/kern/vfs_lookup.c:830 > #25 0xffffffff80b173ae in namei (ndp=) > at /usr/src/sys/kern/vfs_lookup.c:397 > #26 0xffffffff80b333db in vn_open_cred (ndp=, > flagp=0xfffffe085c2a485c, cmode=0, vn_open_flags=, > cred=, fp=0x1d0) at /usr/src/sys/kern/vfs_vnops.c:277 > #27 0xffffffff80b2c72c in kern_openat (td=0xfffff8001d68d000, fd=-100, > path=0x2c9e71e8 , > pathseg=UIO_USERSPACE, > flags=, > mode=) at /usr/src/sys/kern/vfs_syscalls.c:1016 > #28 0xffffffff80ff88a0 in syscallenter (td=0xfffff8001d68d000, > sa=) > at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135 > #29 ia32_syscall (frame=0xfffffe085c2a4a40) > at /usr/src/sys/amd64/ia32/ia32_syscall.c:187 > #30 > #31 0x00000000097b2f57 in ?? () > Backtrace stopped: Cannot access memory at address 0xffff7698 > > > I don't see any obvious issues in the code. This problem is very reproduceable for me, so I added some assertions to the code to try to track it down: %svn diff sys/kern Index: sys/kern/vfs_cache.c =================================================================== --- sys/kern/vfs_cache.c (revision 308477) +++ sys/kern/vfs_cache.c (working copy) @@ -776,6 +776,7 @@ mtx_unlock(&neglist->nl_lock); } + MPASS(ncp == NULL || neglist == NCP2NEGLIST(ncp)); [2] *neglistpp = neglist; *ncpp = ncp; } @@ -815,6 +816,7 @@ } MPASS(ncp->nc_flag & NCF_NEGATIVE); + MPASS(neglist == NCP2NEGLIST(ncp)); [1] dvlp = VP2VNODELOCK(ncp->nc_dvp); blp = NCP2BUCKETLOCK(ncp); mtx_unlock(&neglist->nl_lock); @@ -832,6 +834,8 @@ SDT_PROBE3(vfs, namecache, shrink_negative, done, ncp->nc_dvp, nc_get_name(ncp), ncp->nc_neghits); + MPASS(neglist == NCP2NEGLIST(ncp)); + mtx_assert(&neglist->nl_lock, MA_OWNED); cache_zap_locked(ncp, true); out_unlock_all: mtx_unlock(&neglist->nl_lock); My first attempt was to add all of the above except [2]. When I re-ran poudriere, I got a panic at [1]. Since it looked like maybe ncp was on the wrong list, I added the assertion at [2] and got a different panic: panic: Bad link elm 0xfffff802a59cb690 prev->next != elm cpuid = 2 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe085bfa8230 vpanic() at vpanic+0x182/frame 0xfffffe085bfa82b0 panic() at panic+0x43/frame 0xfffffe085bfa8310 cache_negative_remove() at cache_negative_remove+0x21a/frame 0xfffffe085bfa8350 cache_zap_locked() at cache_zap_locked+0x1ca/frame 0xfffffe085bfa8370 cache_lookup() at cache_lookup+0xe58/frame 0xfffffe085bfa8440 vfs_cache_lookup() at vfs_cache_lookup+0xac/frame 0xfffffe085bfa84a0 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0xda/frame 0xfffffe085bfa84d0 lookup() at lookup+0x6a2/frame 0xfffffe085bfa8560 namei() at namei+0x57e/frame 0xfffffe085bfa8620 kern_renameat() at kern_renameat+0x1b4/frame 0xfffffe085bfa88a0 ia32_syscall() at ia32_syscall+0x330/frame 0xfffffe085bfa8a30 Xint0x80_syscall() at Xint0x80_syscall+0x95/frame 0xfffffe085bfa8a30 #11 0xffffffff80a5eb1f in vpanic (fmt=, ap=0xfffffe085bfa82f0) at /usr/src/sys/kern/kern_shutdown.c:752 #12 0xffffffff80a5eb83 in panic (fmt=0xffffffff81bf2de0 "\004") at /usr/src/sys/kern/kern_shutdown.c:690 #13 0xffffffff80b0d55a in cache_negative_remove (ncp=0xfffff802a59cb690, neg_locked=) at /usr/src/sys/kern/vfs_cache.c:747 #14 0xffffffff80b0d84a in cache_zap_locked (ncp=0xfffff802a59cb690, neg_locked=false) at /usr/src/sys/kern/vfs_cache.c:890 #15 0xffffffff80b0ab98 in cache_zap_rlocked_bucket (ncp=0xfffff802a59cb690, blp=) at /usr/src/sys/kern/vfs_cache.c:1021 #16 cache_lookup (dvp=, vpp=, cnp=, tsp=, ticksp=) at /usr/src/sys/kern/vfs_cache.c:1308 #17 0xffffffff80b0e97c in vfs_cache_lookup (ap=) at /usr/src/sys/kern/vfs_cache.c:2024 #18 0xffffffff8103fcea in VOP_LOOKUP_APV (vop=, a=) at vnode_if.c:127 #19 0xffffffff80b17ed2 in VOP_LOOKUP (vpp=0xfffffe085bfa86e8, cnp=0xfffffe085bfa8710, dvp=) at ./vnode_if.h:54 #20 lookup (ndp=) at /usr/src/sys/kern/vfs_lookup.c:830 #21 0xffffffff80b1748e in namei (ndp=) at /usr/src/sys/kern/vfs_lookup.c:397 #22 0xffffffff80b30bc4 in kern_renameat (td=, oldfd=10, old=0x28e72a59 , newfd=10, new=0x28e72601 , pathseg=) at /usr/src/sys/kern/vfs_syscalls.c:3515 #23 0xffffffff80ff98a0 in syscallenter (td=0xfffff801ebfd8500, sa=) at /usr/src/sys/amd64/ia32/../../kern/subr_syscall.c:135 #24 ia32_syscall (frame=0xfffffe085bfa8a40) at /usr/src/sys/amd64/ia32/ia32_syscall.c:187 #25 #13 0xffffffff80b0d55a in cache_negative_remove (ncp=0xfffff802a59cb690, neg_locked=) at /usr/src/sys/kern/vfs_cache.c:747 747 TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); (kgdb) list 742 } else { 743 mtx_assert(&neglist->nl_lock, MA_OWNED); 744 mtx_assert(&ncneg_hot.nl_lock, MA_OWNED); 745 } 746 if (ncp->nc_flag & NCF_HOTNEGATIVE) { 747 TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); 748 } else { 749 TAILQ_REMOVE(&neglist->nl_list, ncp, nc_dst); 750 } 751 if (list_locked) (kgdb) print ncp $1 = (struct namecache *) 0xfffff802a59cb690 (kgdb) print *ncp $2 = {nc_hash = {le_next = 0x0, le_prev = 0xfffffe0001814178}, nc_src = { le_next = 0xfffff80415e22290, le_prev = 0xfffff8048e5597a8}, nc_dst = { tqe_next = 0xfffff80421744a10, tqe_prev = 0xfffff80421744a30}, nc_dvp = 0xfffff8048e559760, n_un = {nu_vp = 0x0, nu_neghits = 0}, nc_flag = 32 ' ', nc_nlen = 25 '\031', nc_name = 0xfffff802a59cb6d2 "gtk3-icon-browser.desktop"} (kgdb) print ncp->nc_dst->tqe_prev $8 = (struct namecache **) 0xfffff80421744a30 (kgdb) print *(ncp->nc_dst->tqe_prev) $9 = (struct namecache *) 0x0 The line number for the failed assertion doesn't make sense. The value of nc_flag is 32 (0x20), but NCF_HOTNEGATIVE is defined as 0x40 and NCF_NEGATIVE is defined as 0x20. In any case, the list seems to be messed up.