From owner-freebsd-fs@FreeBSD.ORG Wed Oct 10 22:45:42 2012 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D4419C13; Wed, 10 Oct 2012 22:45:42 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id EDA6F8FC16; Wed, 10 Oct 2012 22:45:41 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id BAA25025; Thu, 11 Oct 2012 01:45:37 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1TM521-000LXO-F2; Thu, 11 Oct 2012 01:45:37 +0300 Message-ID: <5075FA8E.10200@FreeBSD.org> Date: Thu, 11 Oct 2012 01:45:34 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:15.0) Gecko/20120913 Thunderbird/15.0.1 MIME-Version: 1.0 To: Sean Chittenden , Pawel Jakub Dawidek Subject: Re: ZFS crashing during snapdir lookup for non-existent snapshot... References: <5075E3E0.7060706@FreeBSD.org> <0A6567E7-3BA5-4F27-AEB2-1C00EDE00641@chittenden.org> <5075EDDD.4030008@FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.4.3 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 22:45:42 -0000 [restoring mailing list cc] on 11/10/2012 00:58 Sean Chittenden said the following: >>> I don't have a dump from this particular system, only the backtrace from the crash. The system is ZFS only and I only have a ZFS swapdir. :-/ >>> >>> I have the kernel still so I can poke at the code and the compiled kernel (kernel.symbols). ? What are you looking for? -sc >>> >> >> list *zfsctl_snapdir_lookup+0x124 in kgdb > > (kgdb) list *zfsctl_snapdir_lookup+0x124 > 0xffffffff816e9384 is in zfsctl_snapdir_lookup (/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c:992). > 987 *direntflags = ED_CASE_CONFLICT; > 988 #endif > 989 } > 990 > 991 mutex_enter(&sdp->sd_lock); > 992 search.se_name = (char *)nm; > 993 if ((sep = avl_find(&sdp->sd_snaps, &search, &where)) != NULL) { > 994 *vpp = sep->se_root; > 995 VN_HOLD(*vpp); > 996 err = traverse(vpp, LK_EXCLUSIVE | LK_RETRY); It seems that the problem is in Solaris-ism that remained in the code. I think that zfsctl_snapdir_inactive should not destroy sdp, that should be a job of vop_reclaim. Otherwise, if the vnode is re-activated its v_data points to freed memory. >>> On Oct 10, 2012, at 14:08 PM, Andriy Gapon wrote: >>> >>>> on 10/10/2012 23:57 Sean Chittenden said the following: >>>>> Using a FreeBSD -STABLE build from 2012-09-17, I now have the ability to crash FreeBSD/ZFS within a few hours of stress testing. It appears as though there's a locking problem when attempting to interrogate stats on a ZFS snapshot that doesn't exist any more. I believe the scenario is as follows: >>>>> >>>>> Background: >>>>> >>>>> *) `zfs set snapdir=visible` /was/ set on a data set >>>>> >>>>> *) Snapshots were being run once an hour for weeks, long enough for zabbix to auto-discover the snapshots as valid file systems. >>>>> >>>>> *) `zfs inherit snapdir` was recently set (about a week ago), but zabbix is still attempting to inquire about no snapshots that are no longer visible or exist. >>>>> >>>>> >>>>> After snapshots were deleted through the normal process of aging, zabbix is still interrogating the file system attempting to acquire information about the now deleted snapshots. >>>>> >>>>> FreeBSD crashes once every few minutes when zabbix is running and pulling ZFS information about the now hidden (or most likely deleted) snapshots. I believe that zabbix is using getfsspec(3) with the now stale snapshot name in rapid succession and is somehow triggering a race when there are two concurrent calls to two different non-existent snapshots. >>>>> >>>>> -sc >>>>> >>>>> >>>>> kernel: Fatal trap 12: page fault while in kernel mode >>>>> kernel: cpuid = 0; apic id = 00 >>>>> kernel: fault virtual address = 0x368 >>>>> kernel: fault code = supervisor read data, page not present >>>>> kernel: instruction pointer = 0x20:0xffffffff80922be2 >>>>> kernel: stack pointer = 0x28:0xffffff8487d7b0d0 >>>>> kernel: frame pointer = 0x28:0xffffff8487d7b170 >>>>> kernel: code segment = base 0x0, limit 0xfffff, type 0x1b >>>>> kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 >>>>> kernel: processor eflags = interrupt enabled, resume, IOPL = 0 >>>>> kernel: current process = 3536 (zabbix_agentd) >>>>> kernel: trap number = 12 >>>>> kernel: panic: page fault >>>>> kernel: cpuid = 0 >>>>> kernel: KDB: stack backtrace: >>>>> kernel: #0 0xffffffff80950800 at kdb_backtrace+0x60 >>>>> kernel: #1 0xffffffff8091ac2d at panic+0x1fd >>>>> kernel: #2 0xffffffff80c21858 at trap_fatal+0x388 >>>>> kernel: #3 0xffffffff80c21b23 at trap_pfault+0x2b3 >>>>> kernel: #4 0xffffffff80c212b5 at trap+0x5b5 >>>>> kernel: #5 0xffffffff80c0ba22 at calltrap+0x8 >>>>> kernel: #6 0xffffffff8092271e at _sx_xlock+0x5e >>>>> kernel: #7 0xffffffff816e9384 at zfsctl_snapdir_lookup+0x124 >>>>> kernel: #8 0xffffffff80cb385f at VOP_LOOKUP_APV+0x5f >>>>> kernel: #9 0xffffffff809a307f at lookup+0x5ef >>>>> kernel: #10 0xffffffff809a263d at namei+0x62d >>>>> kernel: #11 0xffffffff809b2b39 at kern_statfs+0x89 >>>>> kernel: #12 0xffffffff809b2a80 at sys_statfs+0x20 >>>>> kernel: #13 0xffffffff80c22134 at amd64_syscall+0x334 >>>>> >>>>> FreeBSD example.com 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #1: Mon Sep 17 04:34:37 UTC 2012 root@example.com:/usr/obj/usr/src/sys/GENERIC amd64 >>>>> >>>>> 0xffffffff80922be2 is in _sx_xlock_hard (/usr/src/sys/kern/kern_sx.c:546). >>>>> 541 x = sx->sx_lock; >>>>> 542 if ((sx->lock_object.lo_flags & SX_NOADAPTIVE) == 0) { >>>>> 543 if ((x & SX_LOCK_SHARED) == 0) { >>>>> 544 x = SX_OWNER(x); >>>>> 545 owner = (struct thread *)x; >>>>> 546 if (TD_IS_RUNNING(owner)) { >>>>> 547 if (LOCK_LOG_TEST(&sx->lock_object, 0)) >>>>> 548 CTR3(KTR_LOCK, >>>>> 549 "%s: spinning on %p held by %p", >>>>> 550 __func__, sx, owner); >>>>> >>>> >>>> Could you please rather list frame #7 (zfsctl_snapdir_lookup+0x124)? >>>> >>>> -- >>>> Andriy Gapon >>>> >>> >>> >>> >>> >>> -- >>> Sean Chittenden >>> sean@chittenden.org >>> >> >> >> -- >> Andriy Gapon >> > > > > > -- > Sean Chittenden > sean@chittenden.org > -- Andriy Gapon