Date: Sun, 23 Aug 2015 15:22:07 -0600 From: "Justin T. Gibbs" <gibbs@scsiguy.com> To: d@delphij.net Cc: Don Lewis <truckman@FreeBSD.org>, freebsd-fs@FreeBSD.org, "Justin T. Gibbs" <gibbs@freebsd.org>, George Wilson <george.wilson@delphix.com> Subject: Re: solaris assert: avl_is_empty(&dn -> dn_dbufs) panic Message-ID: <D9D6BA1D-30F3-4417-A3DA-E0253E6EA5A2@scsiguy.com> In-Reply-To: <55D7700B.3080207@delphij.net> References: <201508211748.t7LHmo96096088@gw.catspoiler.org> <55D7700B.3080207@delphij.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, I'll need a little time to fully reload the context for these changes. = However, reintroducing a blocking loop is not the right fix - it was a = hack in the original code. :-) My hunch is that removing the assert is = safe, but it would be nice to have a core dump to better understand why = the list isn't empty. -- Justin > On Aug 21, 2015, at 12:38 PM, Xin Li <delphij@delphij.net> wrote: >=20 > Hi, >=20 > A quick glance at the changes suggests that Justin's changeset may be > related. The reasoning is here: >=20 > https://reviews.csiden.org/r/131/ >=20 > Related Illumos ticket: >=20 > https://www.illumos.org/issues/5056 >=20 > In dnode_evict_dbufs(), remove multiple passes over dn->dn_dbufs. > This is possible now that objset eviction is asynchronously > completed in a different context once dbuf eviction completes. >=20 > In the case of objset eviction, any dbufs held by children will > be evicted via dbuf_rele_and_unlock() once their refcounts go > to zero. Even when objset eviction is not active, the ordering > of the avl tree guarantees that children will be released before > parents, allowing the parent's refcounts to naturally drop to > zero before they are inspected in this single loop. >=20 > =3D=3D=3D=3D >=20 > So, upon return from dnode_evict_dbufs(), there could be some > DB_EVICTING buffers on the AVL pending release and thus breaks the > invariant. >=20 > Should we restore the loop where we yield briefly with the lock > released, then reacquire and recheck? >=20 > Cheers, >=20 > On 08/21/15 10:48, Don Lewis wrote: >> On 21 Aug, Don Lewis wrote: >>> On 21 Aug, Don Lewis wrote: >>>> I just started getting this panic: >>>>=20 >>>> solaris assert: avl_is_empty(&dn -> dn_dbufs), file: >>>> = /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c, >>>> line 495 >>>>=20 >>>> System info: >>>> FreeBSD zipper.catspoiler.org 11.0-CURRENT FreeBSD 11.0-CURRENT #25 = r286923: Wed Aug 19 09:28:53 PDT 2015 = dl@zipper.catspoiler.org:/usr/obj/usr/src/sys/GENERIC amd64 >>>>=20 >>>> My zfs pool has one mirrored vdev. Scrub doesn't find any = problems. >>>>=20 >>>> %zpool status >>>> pool: zroot >>>> state: ONLINE >>>> scan: scrub repaired 0 in 2h58m with 0 errors on Fri Aug 21 = 00:44:52 2015 >>>> config: >>>>=20 >>>> NAME STATE READ WRITE CKSUM >>>> zroot ONLINE 0 0 0 >>>> mirror-0 ONLINE 0 0 0 >>>> ada0p3 ONLINE 0 0 0 >>>> ada1p3 ONLINE 0 0 0 >>>>=20 >>>> This panic is reproduceable and happens every time I use poudriere = to >>>> build ports using my 9.3-RELEASE amd64 jail and occurs at the end = of the >>>> poudriere run when it is unmounting filesystems. >>>>=20 >>>> [00:10:43] =3D=3D=3D=3D>> Stopping 4 builders >>>> 93amd64-default-job-01: removed >>>> 93amd64-default-job-01-n: removed >>>> 93amd64-default-job-02: removed >>>> 93amd64-default-job-02-n: removed >>>> 93amd64-default-job-03: removed >>>> 93amd64-default-job-03-n: removed >>>> 93amd64-default-job-04: removed >>>> 93amd64-default-job-04-n: removed >>>> [00:10:46] =3D=3D=3D=3D>> Creating pkgng repository >>>> Creating repository in /tmp/packages: 100% >>>> Packing files for repository: 100% >>>> [00:10:55] =3D=3D=3D=3D>> Committing packages to repository >>>> [00:10:55] =3D=3D=3D=3D>> Removing old packages >>>> [00:10:55] =3D=3D=3D=3D>> Built ports: devel/py-pymtbl net/sie-nmsg = net/p5-Net-Nmsg net/axa >>>> [93amd64-default] [2015-08-21_00h47m41s] [committing:] Queued: 4 = Built: 4 Failed: 0 Skipped: 0 Ignored: 0 Tobuild: 0 Time: 00:10:53 >>>> [00:10:55] =3D=3D=3D=3D>> Logs: = /var/poudriere/data/logs/bulk/93amd64-default/2015-08-21_00h47m41s >>>> [00:10:55] =3D=3D=3D=3D>> Cleaning up >>>> 93amd64-default: removed >>>> 93amd64-default-n: removed >>>> [00:10:55] =3D=3D=3D=3D>> Umounting file systems >>>> Write failed: Broken pipe >>>>=20 >>>> Prior to that, I ran poudriere a number of times with a 10.2-STABLE >>>> amd64 jail without incident. >>>>=20 >>>> I've kicked off a bunch of poudriere runs for other jails and >>>> will check on it in the morning. >>>=20 >>> Died the same way after building ports on the first jail, >>> 10.1-RELEASE amd64. >>>=20 >>> Since there have been some zfs commits since r286923, I upgraded to >>> r286998 this morning and tried again with no better luck. I got the >>> same panic again. >>>=20 >>> This machine has mirrored swap, and even though I've done what >>> gmirror(8) says to do in order to capture crash dumps, I've had no = luck >>> with that. The dump is getting written, but savecore is unable to = find >>> it. >>=20 >> Not sure what is happening with savecore during boot, but I was able = to >> run it manually and collect the crash dump. >>=20 >>=20 >> Unread portion of the kernel message buffer: >> panic: solaris assert: avl_is_empty(&dn->dn_dbufs), file: = /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c, = line: 495 >> cpuid =3D 1 >> KDB: stack backtrace: >> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame = 0xfffffe0859e4e4e0 >> vpanic() at vpanic+0x189/frame 0xfffffe0859e4e560 >> panic() at panic+0x43/frame 0xfffffe0859e4e5c0 >> assfail() at assfail+0x1a/frame 0xfffffe0859e4e5d0 >> dnode_sync() at dnode_sync+0x6c8/frame 0xfffffe0859e4e6b0 >> dmu_objset_sync_dnodes() at dmu_objset_sync_dnodes+0x2b/frame = 0xfffffe0859e4e6e0 >> dmu_objset_sync() at dmu_objset_sync+0x29e/frame 0xfffffe0859e4e7b0 >> dsl_pool_sync() at dsl_pool_sync+0x348/frame 0xfffffe0859e4e820 >> spa_sync() at spa_sync+0x442/frame 0xfffffe0859e4e910 >> txg_sync_thread() at txg_sync_thread+0x23d/frame 0xfffffe0859e4e9f0 >> fork_exit() at fork_exit+0x84/frame 0xfffffe0859e4ea30 >> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0859e4ea30 >> --- trap 0, rip =3D 0, rsp =3D 0, rbp =3D 0 --- >> KDB: enter: panic >>=20 >>=20 >>=20 >> (kgdb) bt >> #0 doadump (textdump=3D0) at pcpu.h:221 >> #1 0xffffffff8037bb86 in db_fncall (dummy1=3D<value optimized out>,=20= >> dummy2=3D<value optimized out>, dummy3=3D<value optimized out>,=20 >> dummy4=3D<value optimized out>) at = /usr/src/sys/ddb/db_command.c:568 >> #2 0xffffffff8037b941 in db_command (cmd_table=3D0x0) >> at /usr/src/sys/ddb/db_command.c:440 >> #3 0xffffffff8037b5d4 in db_command_loop () >> at /usr/src/sys/ddb/db_command.c:493 >> #4 0xffffffff8037e18b in db_trap (type=3D<value optimized out>, = code=3D0) >> at /usr/src/sys/ddb/db_main.c:251 >> #5 0xffffffff80a5b294 in kdb_trap (type=3D3, code=3D0, tf=3D<value = optimized out>) >> at /usr/src/sys/kern/subr_kdb.c:654 >> #6 0xffffffff80e6a4b1 in trap (frame=3D0xfffffe0859e4e410) >> at /usr/src/sys/amd64/amd64/trap.c:540 >> #7 0xffffffff80e49f22 in calltrap () >> at /usr/src/sys/amd64/amd64/exception.S:235 >> #8 0xffffffff80a5a96e in kdb_enter (why=3D0xffffffff81379010 = "panic",=20 >> msg=3D0xffffffff80a60b60 = "UH\211\ufffdAWAVATSH\203\ufffdPI\211\ufffdA\211\ufffdH\213\004%Py\ufffd\2= 01H\211E\ufffd\201<%\ufffd\210\ufffd\201") at cpufunc.h:63 >> #9 0xffffffff80a1e2c9 in vpanic (fmt=3D<value optimized out>,=20 >> ap=3D<value optimized out>) at = /usr/src/sys/kern/kern_shutdown.c:619 >> #10 0xffffffff80a1e333 in panic (fmt=3D0xffffffff81aafa90 "\004") >> at /usr/src/sys/kern/kern_shutdown.c:557 >> ---Type <return> to continue, or q <return> to quit--- >> #11 0xffffffff8240922a in assfail (a=3D<value optimized out>,=20 >> f=3D<value optimized out>, l=3D<value optimized out>) >> at = /usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:81 >> #12 0xffffffff820d4f78 in dnode_sync (dn=3D0xfffff8040b72d3d0,=20 >> tx=3D0xfffff8001598ec00) >> at = /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c:495 >> #13 0xffffffff820c922b in dmu_objset_sync_dnodes = (list=3D0xfffff80007712b90,=20 >> newlist=3D<value optimized out>, tx=3D<value optimized out>) >> at = /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1045 >> #14 0xffffffff820c8ede in dmu_objset_sync (os=3D0xfffff80007712800,=20= >> pio=3D<value optimized out>, tx=3D0xfffff8001598ec00) >> at = /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1163 >> #15 0xffffffff820e8e78 in dsl_pool_sync (dp=3D0xfffff80015676000, = txg=3D2660975) >> at = /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:536 >> #16 0xffffffff8210dca2 in spa_sync (spa=3D0xfffffe00089c6000, = txg=3D2660975) >> at = /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:6641 >> #17 0xffffffff8211843d in txg_sync_thread (arg=3D0xfffff80015676000) >> at = /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c:517 >> #18 0xffffffff809e47c4 in fork_exit ( >> callout=3D0xffffffff82118200 <txg_sync_thread>, = arg=3D0xfffff80015676000,=20 >> frame=3D0xfffffe0859e4ea40) at /usr/src/sys/kern/kern_fork.c:1006 >> ---Type <return> to continue, or q <return> to quit--- >> #19 0xffffffff80e4a45e in fork_trampoline () >> at /usr/src/sys/amd64/amd64/exception.S:610 >> #20 0x0000000000000000 in ?? () >> Current language: auto; currently minimal >>=20 >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >>=20 >=20 >=20 > --=20 > Xin LI <delphij@delphij.net> https://www.delphij.net/ > FreeBSD - The Power to Serve! Live free or die >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D9D6BA1D-30F3-4417-A3DA-E0253E6EA5A2>