Date: Wed, 24 Feb 2010 22:36:14 -0800 From: "Jake Holland" <jholland@fastsoft.com> To: <freebsd-stable@freebsd.org> Subject: vfs deadlock during panic? Message-ID: <D13CB108B048BD47B69C0CA1E0B5C0320111335C@hq-es.FASTSOFT.COM>
next in thread | raw e-mail | index | archive | help
Hello, I'm running today's stable/8 -r 204294 on a Dell r610 (2 processors with 4 cores per processor). I have turned on WITNESS and DEBUG_VFS_LOCKS, because I've been trying to troubleshoot an occasional hang during panic, and I think I have reason to believe a vfs deadlock is occurring. Some details below. A little backstory, with some long-story-short: I first encountered a hang on panic in a well-hacked version of 7.0, with a bunch of random backports from various other freebsd branches. I wrote a module that would panic on request in a cpu-pinned thread, and over the course of many runs, it seemed to hang approximately every 20 panics, if it was on cpu 0. I did some research, and found a very promising-looking patch in stable/8 -r196198. I backported that, and it helped, but I still got a hang approximately every 50 panics. So I pulled down stable/8, and found I'm still getting hangs during panic, though I'm not certain it's the same problem. I turned on WITNESS and DEBUG_VFS_LOCKS, and I very frequently see lock order reversals under vfs, both with r203674 and r204294, which is from today (Feb 24, 2010). I saw very similar lock order reversals with release/8.0.0. (I changed vfs_badlock_ddb to 0 in kern/vfs_subr.c, because I hit LORs pretty early on basically every boot.) So much for the backstory. Here's 3 backtraces that I see from the serial port not long after most boots, from -r204294 of stable/8: 1. lock order reversal: 1st 0xffffff015641d578 ufs (ufs) @ ufs/ffs/ffs_snapshot.c:423 2nd 0xffffff819b4c28a8 bufwait (bufwait) @ kern/vfs_bio.c:2559 3rd 0xffffff0010433098 ufs (ufs) @ ufs/ffs/ffs_snapshot.c:544 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e __lockmgr_args() at __lockmgr_args+0xd31 ffs_lock() at ffs_lock+0x9c VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b _vn_lock() at _vn_lock+0x57 ffs_snapshot() at ffs_snapshot+0x1bb0 ffs_mount() at ffs_mount+0x666 vfs_donmount() at vfs_donmount+0xcde nmount() at nmount+0x63 syscall() at syscall+0x102 Xfast_syscall() at Xfast_syscall+0xe1 --- syscall (378, FreeBSD ELF64, nmount), rip =3D 0x8007affdc, rsp =3D 0x7fffffffe9b 8, rbp =3D 0x800a04d30 --- 2. lock order reversal: 1st 0xffffff819b4c28a8 bufwait (bufwait) @ kern/vfs_bio.c:2559 2nd 0xffffff0156305230 snaplk (snaplk) @ ufs/ffs/ffs_snapshot.c:793 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e __lockmgr_args() at __lockmgr_args+0xd31 ffs_lock() at ffs_lock+0x9c VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b _vn_lock() at _vn_lock+0x57 ffs_snapshot() at ffs_snapshot+0x1a7d ffs_mount() at ffs_mount+0x666 vfs_donmount() at vfs_donmount+0xcde nmount() at nmount+0x63 syscall() at syscall+0x102 Xfast_syscall() at Xfast_syscall+0xe1 --- syscall (378, FreeBSD ELF64, nmount), rip =3D 0x8007affdc, rsp =3D 0x7fffffffe9b 8, rbp =3D 0x800a04d30 --- 3. lock order reversal: 1st 0xffffff0156305230 snaplk (snaplk) @ kern/vfs_vnops.c:296 2nd 0xffffff015641d578 ufs (ufs) @ ufs/ffs/ffs_snapshot.c:1587 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e __lockmgr_args() at __lockmgr_args+0xd31 ffs_snapremove() at ffs_snapremove+0xe7 softdep_releasefile() at softdep_releasefile+0x139 ufs_inactive() at ufs_inactive+0x1a5 VOP_INACTIVE_APV() at VOP_INACTIVE_APV+0xb5 vinactive() at vinactive+0x90 vputx() at vputx+0x2de vn_close() at vn_close+0x118 vn_closefile() at vn_closefile+0x5a _fdrop() at _fdrop+0x23 closef() at closef+0x5b kern_close() at kern_close+0x110 syscall() at syscall+0x102 Xfast_syscall() at Xfast_syscall+0xe1 --- syscall (6, FreeBSD ELF64, close), rip =3D 0x80084e94c, rsp =3D 0x7fffffffe9b8, rbp =3D 0 --- And another interesting observation. During one of my runs with -r203674 which hung on panic, I saw this: login: retry 5: 2010-02-23.15:35:35 (had 3 in crash) panic: /usr/src/sys/modules/ jpanic/jpanic.c:168 panic requested prio 16, cpu 0, delay 500, curcpu 0 cpuid =3D 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a panic() at panic+0x182 jpanic_thread() at jpanic_thread+0x191 fork_exit() at fork_exit+0x12a fork_trampoline() at fork_trampoline+0xe --- trap 0, rip =3D 0, rsp =3D 0xffffff825964ad30, rbp =3D 0 --- Uptime: 3m15s Physical memory: 24554 MB Dumping 2049 MB:VOP_STRATEGY: bp is not locked but should be VOP_STRATEGY: bp is not locked but should be 2034 2018 2002 1986 1970 1954 1938 1922 1906 1890 1874 1858 1842 1826 1810 1794 1778 1762 1746 1730 1714 1698 1682 1666 1650 1634 1618 1602 1586 1570 1554 1538 1522 1506 1490 1474 1458 1442 1426 1410 1394 1378 1362 1346 1330 1314 1298 1282 1266 1250 1234 1218 1202 1186 1170 1154 1138 1122 1106 1090 1074 1058 1042 1026 1010 994 978 962 946 930 914 898 882 866 850 834 818 802 786 770 754 738 722 706 690 674 658 642 626 610 594 578 562 546 530 514 498 482 466 450 434 418 402 386 370 354 338 322 306 290 274 258 242 226 210 194 178 162 146 130 114 98 82 66 50 34 18 2Attempt to write outside dump device boundaries. ** DUMP FAILED (ERROR 6) ** (I get this error fairly often, by the way. Maybe 25% or so, to make a probably-wrong estimate. I don't think I only see hangs with this error, however.) It was hung here for something like 25 minutes, until I pressed the power button, when it dumped again, this time successfully, and gave me a core that showed the following backtrace: #0 doadump () at pcpu.h:223 #1 0xffffffff803debf3 in boot (howto=3D16644) at ../../../kern/kern_shutdown.c:416 #2 0xffffffff803df07c in panic (fmt=3DVariable "fmt" is not available. ) at ../../../kern/kern_shutdown.c:579 #3 0xffffffff80454ed8 in bremfree (bp=3D0x0) at ../../../kern/vfs_bio.c:675 #4 0xffffffff8045afe5 in getblk (vp=3D0xffffff000be35000, = blkno=3D753600, size=3D16384, slpflag=3D0, slptimeo=3D0, flags=3D0) at ../../../kern/vfs_bio.c:2581 #5 0xffffffff8045b6cf in breadn (vp=3D0xffffff000be35000, = blkno=3DVariable "blkno" is not available. ) at ../../../kern/vfs_bio.c:800 #6 0xffffffff8045b83e in bread (vp=3DVariable "vp" is not available. ) at ../../../kern/vfs_bio.c:748 #7 0xffffffff80501c52 in ffs_update (vp=3D0xffffff0013268270, = waitfor=3D0) at ../../../ufs/ffs/ffs_inode.c:104 #8 0xffffffff8051eafe in ufs_inactive (ap=3DVariable "ap" is not available. ) at ../../../ufs/ufs/ufs_inode.c:158 #9 0xffffffff8060ad95 in VOP_INACTIVE_APV (vop=3D0xffffffff808a94a0, a=3D0xffffff80000878b0) at vnode_if.c:1863 #10 0xffffffff8046fc80 in vinactive (vp=3D0xffffff0013268270, td=3D0xffffff000b3b8ae0) at vnode_if.h:807 #11 0xffffffff8047101e in vputx (vp=3D0xffffff0013268270, func=3D2) at ../../../kern/vfs_subr.c:2214 #12 0xffffffff8047d2c8 in vn_close (vp=3D0xffffff0013268270, flags=3DVariable "flags" is not available. ) at ../../../kern/vfs_vnops.c:303 #13 0xffffffff8047d3ba in vn_closefile (fp=3D0xffffff00132c38c0, td=3D0xffffff000b3b8ae0) at ../../../kern/vfs_vnops.c:938 #14 0xffffffff803a95e3 in _fdrop (fp=3D0xffffff00132c38c0, td=3DVariable "td" is not available. ) at file.h:293 #15 0xffffffff803ab04b in closef (fp=3D0xffffff00132c38c0, td=3D0xffffff000b3b8ae0) at ../../../kern/kern_descrip.c:2117 #16 0xffffffff803ab630 in kern_close (td=3D0xffffff000b3b8ae0, fd=3D0) at ../../../kern/kern_descrip.c:1162 #17 0xffffffff805d0cd2 in syscall (frame=3D0xffffff8000087c80) at ../../../amd64/amd64/trap.c:1025 #18 0xffffffff805b6f21 in Xfast_syscall () at ../../../amd64/amd64/exception.S:373 #19 0x000000000045b64c in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) That particular sequence of events is extremely rare; ordinarily the power button doesn't do anything from the hung state unless you hold it down long enough for a hard power-off. Based on these observations, I suspect that some kind of vfs thread-safety bug might be a key part of this hanging problem. I'm no vfs expert, so I'm a bit stuck. Can anyone help? I'm happy to answer any questions and provide any information I can. And my apologies if I've violated etiquette or if this is the wrong place for this (and if so, please correct me), I've never posted here before. Thank you. Jake
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D13CB108B048BD47B69C0CA1E0B5C0320111335C>