From owner-freebsd-stable@FreeBSD.ORG Fri Sep 9 20:10:46 2011 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DE5351065670; Fri, 9 Sep 2011 20:10:45 +0000 (UTC) (envelope-from hrs@FreeBSD.org) Received: from mail.allbsd.org (gatekeeper-int.allbsd.org [IPv6:2001:2f0:104:e002::2]) by mx1.freebsd.org (Postfix) with ESMTP id 2EEB68FC13; Fri, 9 Sep 2011 20:10:44 +0000 (UTC) Received: from alph.allbsd.org ([IPv6:2001:2f0:104:e010:862b:2bff:febc:8956]) (authenticated bits=128) by mail.allbsd.org (8.14.4/8.14.4) with ESMTP id p89KASb9005483; Sat, 10 Sep 2011 05:10:38 +0900 (JST) (envelope-from hrs@FreeBSD.org) Received: from localhost (localhost [IPv6:::1]) (authenticated bits=0) by alph.allbsd.org (8.14.4/8.14.4) with ESMTP id p89KARbm026576; Sat, 10 Sep 2011 05:10:28 +0900 (JST) (envelope-from hrs@FreeBSD.org) Date: Sat, 10 Sep 2011 04:48:41 +0900 (JST) Message-Id: <20110910.044841.232160047547388224.hrs@allbsd.org> To: pjd@FreeBSD.org, mm@FreeBSD.org, freebsd-stable@FreeBSD.org From: Hiroki Sato In-Reply-To: <20110907.094717.2272609566853905102.hrs@allbsd.org> References: <20110903.071908.971549835606878048.hrs@allbsd.org> <20110907.094717.2272609566853905102.hrs@allbsd.org> X-PGPkey-fingerprint: BDB3 443F A5DD B3D0 A530 FFD7 4F2C D3D8 2793 CF2D X-Mailer: Mew version 6.3.51 on Emacs 23.3 / Mule 6.0 (HANACHIRUSATO) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned: clamav-milter 0.97 at gatekeeper.allbsd.org X-Virus-Status: Clean X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.3 (mail.allbsd.org [IPv6:2001:2f0:104:e001::32]); Sat, 10 Sep 2011 05:10:42 +0900 (JST) X-Spam-Status: No, score=-104.6 required=13.0 tests=BAYES_00, CONTENT_TYPE_PRESENT, RDNS_NONE, SPF_SOFTFAIL, USER_IN_WHITELIST autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on gatekeeper.allbsd.org Cc: attilio@FreeBSD.org, kib@FreeBSD.org Subject: ZFS panic on a RELENG_8 NFS server (Was: panic: spin lock held too long (RELENG_8 from today)) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Sep 2011 20:10:46 -0000 Hiroki Sato wrote in <20110907.094717.2272609566853905102.hrs@allbsd.org>: hr> During this investigation an disk has to be replaced and resilvering hr> it is now in progress. A deadlock and a forced reboot after that hr> make recovering of the zfs datasets take a long time (for committing hr> logs, I think), so I will try to reproduce the deadlock and get a hr> core dump after it finished. I think I could reproduce the symptoms. I have no idea about if these are exactly the same as occurred on my box before because the kernel was replaced with one with some debugging options, but these are reproducible at least. There are two symptoms. One is a panic. A DDB output when the panic occurred is the following: ---- Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x100000040 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8065b926 stack pointer = 0x28:0xffffff8257b94d70 frame pointer = 0x28:0xffffff8257b94e10 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 992 (nfsd: service) [thread pid 992 tid 100586 ] Stopped at witness_checkorder+0x246: movl 0x40(%r13),%ebx db> bt Tracing pid 992 tid 100586 td 0xffffff00595d9000 witness_checkorder() at witness_checkorder+0x246 _sx_slock() at _sx_slock+0x35 dmu_bonus_hold() at dmu_bonus_hold+0x57 zfs_zget() at zfs_zget+0x237 zfs_dirent_lock() at zfs_dirent_lock+0x488 zfs_dirlook() at zfs_dirlook+0x69 zfs_lookup() at zfs_lookup+0x26b zfs_freebsd_lookup() at zfs_freebsd_lookup+0x81 vfs_cache_lookup() at vfs_cache_lookup+0xf0 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x40 lookup() at lookup+0x384 nfsvno_namei() at nfsvno_namei+0x268 nfsrvd_lookup() at nfsrvd_lookup+0xd6 nfsrvd_dorpc() at nfsrvd_dorpc+0x745 nfssvc_program() at nfssvc_program+0x447 svc_run_internal() at svc_run_internal+0x51b svc_thread_start() at svc_thread_start+0xb fork_exit() at fork_exit+0x11d fork_trampoline() at fork_trampoline+0xe --- trap 0xc, rip = 0x8006a031c, rsp = 0x7fffffffe6c8, rbp = 0x6 --- ---- The complete output can be found at: http://people.allbsd.org/~hrs/zfs_panic_20110909_1/pool-zfs-20110909-1.txt Another is getting stuck at ZFS access. The kernel is running with no panic but any access to ZFS datasets causes a program non-responsive. The DDB output can be found at: http://people.allbsd.org/~hrs/zfs_panic_20110909_2/pool-zfs-20110909-2.txt The trigger for the both was some access to a ZFS dataset from the NFS clients. Because the access pattern was complex I could not narrow down what was the culprit, but it seems timing-dependent and simply doing "rm -rf" locally on the server can sometimes trigger them. The crash dump and the kernel can be found at the following URLs: panic: http://people.allbsd.org/~hrs/zfs_panic_20110909_1/ no panic but unresponsive: http://people.allbsd.org/~hrs/zfs_panic_20110909_2/ kernel: http://people.allbsd.org/~hrs/zfs_panic_20110909_kernel/ -- Hiroki