From owner-freebsd-fs@FreeBSD.ORG Wed Oct 10 20:57:08 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 36CD97C0 for ; Wed, 10 Oct 2012 20:57:08 +0000 (UTC) (envelope-from sean@chittenden.org) Received: from mail01.lax1.stackjet.com (mon01.lax1.stackjet.com [174.136.104.178]) by mx1.freebsd.org (Postfix) with ESMTP id 12C8F8FC17 for ; Wed, 10 Oct 2012 20:57:07 +0000 (UTC) Received: from [10.0.20.147] (fw-01.ge-0-9.sfo2.sunstreamnetworks.com [199.101.128.6]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: sean@chittenden.org) by mail01.lax1.stackjet.com (Postfix) with ESMTPSA id 4025F3E8D05 for ; Wed, 10 Oct 2012 13:57:01 -0700 (PDT) From: Sean Chittenden Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: ZFS crashing during snapdir lookup for non-existent snapshot... Message-Id: Date: Wed, 10 Oct 2012 13:57:00 -0700 To: freebsd-fs@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) X-Mailer: Apple Mail (2.1498) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 20:57:08 -0000 Using a FreeBSD -STABLE build from 2012-09-17, I now have the ability to = crash FreeBSD/ZFS within a few hours of stress testing. It appears as = though there's a locking problem when attempting to interrogate stats on = a ZFS snapshot that doesn't exist any more. I believe the scenario is as = follows: Background: *) `zfs set snapdir=3Dvisible` /was/ set on a data set *) Snapshots were being run once an hour for weeks, long enough for = zabbix to auto-discover the snapshots as valid file systems. *) `zfs inherit snapdir` was recently set (about a week ago), but zabbix = is still attempting to inquire about no snapshots that are no longer = visible or exist. After snapshots were deleted through the normal process of aging, zabbix = is still interrogating the file system attempting to acquire information = about the now deleted snapshots. FreeBSD crashes once every few minutes when zabbix is running and = pulling ZFS information about the now hidden (or most likely deleted) = snapshots. I believe that zabbix is using getfsspec(3) with the now = stale snapshot name in rapid succession and is somehow triggering a race = when there are two concurrent calls to two different non-existent = snapshots. -sc kernel: Fatal trap 12: page fault while in kernel mode kernel: cpuid =3D 0; apic id =3D 00 kernel: fault virtual address =3D 0x368 kernel: fault code =3D supervisor read data, page not = present kernel: instruction pointer =3D 0x20:0xffffffff80922be2 kernel: stack pointer =3D 0x28:0xffffff8487d7b0d0 kernel: frame pointer =3D 0x28:0xffffff8487d7b170 kernel: code segment =3D base 0x0, limit 0xfffff, type 0x1b kernel: =3D DPL 0, pres 1, long 1, def32 0, gran 1 kernel: processor eflags =3D interrupt enabled, resume, IOPL =3D 0 kernel: current process =3D 3536 (zabbix_agentd) kernel: trap number =3D 12 kernel: panic: page fault kernel: cpuid =3D 0 kernel: KDB: stack backtrace: kernel: #0 0xffffffff80950800 at kdb_backtrace+0x60 kernel: #1 0xffffffff8091ac2d at panic+0x1fd kernel: #2 0xffffffff80c21858 at trap_fatal+0x388 kernel: #3 0xffffffff80c21b23 at trap_pfault+0x2b3 kernel: #4 0xffffffff80c212b5 at trap+0x5b5 kernel: #5 0xffffffff80c0ba22 at calltrap+0x8 kernel: #6 0xffffffff8092271e at _sx_xlock+0x5e kernel: #7 0xffffffff816e9384 at zfsctl_snapdir_lookup+0x124 kernel: #8 0xffffffff80cb385f at VOP_LOOKUP_APV+0x5f kernel: #9 0xffffffff809a307f at lookup+0x5ef kernel: #10 0xffffffff809a263d at namei+0x62d kernel: #11 0xffffffff809b2b39 at kern_statfs+0x89 kernel: #12 0xffffffff809b2a80 at sys_statfs+0x20 kernel: #13 0xffffffff80c22134 at amd64_syscall+0x334 FreeBSD example.com 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #1: Mon Sep 17 = 04:34:37 UTC 2012 root@example.com:/usr/obj/usr/src/sys/GENERIC = amd64 0xffffffff80922be2 is in _sx_xlock_hard = (/usr/src/sys/kern/kern_sx.c:546). 541 x =3D sx->sx_lock; 542 if ((sx->lock_object.lo_flags & SX_NOADAPTIVE) = =3D=3D 0) { 543 if ((x & SX_LOCK_SHARED) =3D=3D 0) { 544 x =3D SX_OWNER(x); 545 owner =3D (struct thread *)x; 546 if (TD_IS_RUNNING(owner)) { 547 if = (LOCK_LOG_TEST(&sx->lock_object, 0)) 548 CTR3(KTR_LOCK, 549 "%s: spinning on %p = held by %p", 550 __func__, = sx, owner); -- Sean Chittenden sean@chittenden.org