Date: Sat, 10 Sep 2011 04:48:41 +0900 (JST) From: Hiroki Sato <hrs@FreeBSD.org> To: pjd@FreeBSD.org, mm@FreeBSD.org, freebsd-stable@FreeBSD.org Cc: attilio@FreeBSD.org, kib@FreeBSD.org Subject: ZFS panic on a RELENG_8 NFS server (Was: panic: spin lock held too long (RELENG_8 from today)) Message-ID: <20110910.044841.232160047547388224.hrs@allbsd.org> In-Reply-To: <20110907.094717.2272609566853905102.hrs@allbsd.org> References: <20110903.071908.971549835606878048.hrs@allbsd.org> <CAJ-FndAChGndC=LkZNi7i6mOt%2BSpw3-OftO9rH0%2B5WNnVWzuBw@mail.gmail.com> <20110907.094717.2272609566853905102.hrs@allbsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hiroki Sato <hrs@freebsd.org> wrote in <20110907.094717.2272609566853905102.hrs@allbsd.org>: hr> During this investigation an disk has to be replaced and resilvering hr> it is now in progress. A deadlock and a forced reboot after that hr> make recovering of the zfs datasets take a long time (for committing hr> logs, I think), so I will try to reproduce the deadlock and get a hr> core dump after it finished. I think I could reproduce the symptoms. I have no idea about if these are exactly the same as occurred on my box before because the kernel was replaced with one with some debugging options, but these are reproducible at least. There are two symptoms. One is a panic. A DDB output when the panic occurred is the following: ---- Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x100000040 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8065b926 stack pointer = 0x28:0xffffff8257b94d70 frame pointer = 0x28:0xffffff8257b94e10 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 992 (nfsd: service) [thread pid 992 tid 100586 ] Stopped at witness_checkorder+0x246: movl 0x40(%r13),%ebx db> bt Tracing pid 992 tid 100586 td 0xffffff00595d9000 witness_checkorder() at witness_checkorder+0x246 _sx_slock() at _sx_slock+0x35 dmu_bonus_hold() at dmu_bonus_hold+0x57 zfs_zget() at zfs_zget+0x237 zfs_dirent_lock() at zfs_dirent_lock+0x488 zfs_dirlook() at zfs_dirlook+0x69 zfs_lookup() at zfs_lookup+0x26b zfs_freebsd_lookup() at zfs_freebsd_lookup+0x81 vfs_cache_lookup() at vfs_cache_lookup+0xf0 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x40 lookup() at lookup+0x384 nfsvno_namei() at nfsvno_namei+0x268 nfsrvd_lookup() at nfsrvd_lookup+0xd6 nfsrvd_dorpc() at nfsrvd_dorpc+0x745 nfssvc_program() at nfssvc_program+0x447 svc_run_internal() at svc_run_internal+0x51b svc_thread_start() at svc_thread_start+0xb fork_exit() at fork_exit+0x11d fork_trampoline() at fork_trampoline+0xe --- trap 0xc, rip = 0x8006a031c, rsp = 0x7fffffffe6c8, rbp = 0x6 --- ---- The complete output can be found at: http://people.allbsd.org/~hrs/zfs_panic_20110909_1/pool-zfs-20110909-1.txt Another is getting stuck at ZFS access. The kernel is running with no panic but any access to ZFS datasets causes a program non-responsive. The DDB output can be found at: http://people.allbsd.org/~hrs/zfs_panic_20110909_2/pool-zfs-20110909-2.txt The trigger for the both was some access to a ZFS dataset from the NFS clients. Because the access pattern was complex I could not narrow down what was the culprit, but it seems timing-dependent and simply doing "rm -rf" locally on the server can sometimes trigger them. The crash dump and the kernel can be found at the following URLs: panic: http://people.allbsd.org/~hrs/zfs_panic_20110909_1/ no panic but unresponsive: http://people.allbsd.org/~hrs/zfs_panic_20110909_2/ kernel: http://people.allbsd.org/~hrs/zfs_panic_20110909_kernel/ -- Hiroki
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110910.044841.232160047547388224.hrs>