Date: Thu, 11 Oct 2012 01:45:34 +0300 From: Andriy Gapon <avg@FreeBSD.org> To: Sean Chittenden <sean@chittenden.org>, Pawel Jakub Dawidek <pjd@FreeBSD.org> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org> Subject: Re: ZFS crashing during snapdir lookup for non-existent snapshot... Message-ID: <5075FA8E.10200@FreeBSD.org> In-Reply-To: <A1901AB5-6E83-488E-9D29-EA7C4E3720F3@chittenden.org> References: <B244C0E9-539D-4F7C-8616-378E8469F4BB@chittenden.org> <5075E3E0.7060706@FreeBSD.org> <0A6567E7-3BA5-4F27-AEB2-1C00EDE00641@chittenden.org> <5075EDDD.4030008@FreeBSD.org> <A1901AB5-6E83-488E-9D29-EA7C4E3720F3@chittenden.org>
next in thread | previous in thread | raw e-mail | index | archive | help
[restoring mailing list cc] on 11/10/2012 00:58 Sean Chittenden said the following: >>> I don't have a dump from this particular system, only the backtrace from the crash. The system is ZFS only and I only have a ZFS swapdir. :-/ >>> >>> I have the kernel still so I can poke at the code and the compiled kernel (kernel.symbols). ? What are you looking for? -sc >>> >> >> list *zfsctl_snapdir_lookup+0x124 in kgdb > > (kgdb) list *zfsctl_snapdir_lookup+0x124 > 0xffffffff816e9384 is in zfsctl_snapdir_lookup (/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c:992). > 987 *direntflags = ED_CASE_CONFLICT; > 988 #endif > 989 } > 990 > 991 mutex_enter(&sdp->sd_lock); > 992 search.se_name = (char *)nm; > 993 if ((sep = avl_find(&sdp->sd_snaps, &search, &where)) != NULL) { > 994 *vpp = sep->se_root; > 995 VN_HOLD(*vpp); > 996 err = traverse(vpp, LK_EXCLUSIVE | LK_RETRY); It seems that the problem is in Solaris-ism that remained in the code. I think that zfsctl_snapdir_inactive should not destroy sdp, that should be a job of vop_reclaim. Otherwise, if the vnode is re-activated its v_data points to freed memory. >>> On Oct 10, 2012, at 14:08 PM, Andriy Gapon <avg@FreeBSD.org> wrote: >>> >>>> on 10/10/2012 23:57 Sean Chittenden said the following: >>>>> Using a FreeBSD -STABLE build from 2012-09-17, I now have the ability to crash FreeBSD/ZFS within a few hours of stress testing. It appears as though there's a locking problem when attempting to interrogate stats on a ZFS snapshot that doesn't exist any more. I believe the scenario is as follows: >>>>> >>>>> Background: >>>>> >>>>> *) `zfs set snapdir=visible` /was/ set on a data set >>>>> >>>>> *) Snapshots were being run once an hour for weeks, long enough for zabbix to auto-discover the snapshots as valid file systems. >>>>> >>>>> *) `zfs inherit snapdir` was recently set (about a week ago), but zabbix is still attempting to inquire about no snapshots that are no longer visible or exist. >>>>> >>>>> >>>>> After snapshots were deleted through the normal process of aging, zabbix is still interrogating the file system attempting to acquire information about the now deleted snapshots. >>>>> >>>>> FreeBSD crashes once every few minutes when zabbix is running and pulling ZFS information about the now hidden (or most likely deleted) snapshots. I believe that zabbix is using getfsspec(3) with the now stale snapshot name in rapid succession and is somehow triggering a race when there are two concurrent calls to two different non-existent snapshots. >>>>> >>>>> -sc >>>>> >>>>> >>>>> kernel: Fatal trap 12: page fault while in kernel mode >>>>> kernel: cpuid = 0; apic id = 00 >>>>> kernel: fault virtual address = 0x368 >>>>> kernel: fault code = supervisor read data, page not present >>>>> kernel: instruction pointer = 0x20:0xffffffff80922be2 >>>>> kernel: stack pointer = 0x28:0xffffff8487d7b0d0 >>>>> kernel: frame pointer = 0x28:0xffffff8487d7b170 >>>>> kernel: code segment = base 0x0, limit 0xfffff, type 0x1b >>>>> kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 >>>>> kernel: processor eflags = interrupt enabled, resume, IOPL = 0 >>>>> kernel: current process = 3536 (zabbix_agentd) >>>>> kernel: trap number = 12 >>>>> kernel: panic: page fault >>>>> kernel: cpuid = 0 >>>>> kernel: KDB: stack backtrace: >>>>> kernel: #0 0xffffffff80950800 at kdb_backtrace+0x60 >>>>> kernel: #1 0xffffffff8091ac2d at panic+0x1fd >>>>> kernel: #2 0xffffffff80c21858 at trap_fatal+0x388 >>>>> kernel: #3 0xffffffff80c21b23 at trap_pfault+0x2b3 >>>>> kernel: #4 0xffffffff80c212b5 at trap+0x5b5 >>>>> kernel: #5 0xffffffff80c0ba22 at calltrap+0x8 >>>>> kernel: #6 0xffffffff8092271e at _sx_xlock+0x5e >>>>> kernel: #7 0xffffffff816e9384 at zfsctl_snapdir_lookup+0x124 >>>>> kernel: #8 0xffffffff80cb385f at VOP_LOOKUP_APV+0x5f >>>>> kernel: #9 0xffffffff809a307f at lookup+0x5ef >>>>> kernel: #10 0xffffffff809a263d at namei+0x62d >>>>> kernel: #11 0xffffffff809b2b39 at kern_statfs+0x89 >>>>> kernel: #12 0xffffffff809b2a80 at sys_statfs+0x20 >>>>> kernel: #13 0xffffffff80c22134 at amd64_syscall+0x334 >>>>> >>>>> FreeBSD example.com 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #1: Mon Sep 17 04:34:37 UTC 2012 root@example.com:/usr/obj/usr/src/sys/GENERIC amd64 >>>>> >>>>> 0xffffffff80922be2 is in _sx_xlock_hard (/usr/src/sys/kern/kern_sx.c:546). >>>>> 541 x = sx->sx_lock; >>>>> 542 if ((sx->lock_object.lo_flags & SX_NOADAPTIVE) == 0) { >>>>> 543 if ((x & SX_LOCK_SHARED) == 0) { >>>>> 544 x = SX_OWNER(x); >>>>> 545 owner = (struct thread *)x; >>>>> 546 if (TD_IS_RUNNING(owner)) { >>>>> 547 if (LOCK_LOG_TEST(&sx->lock_object, 0)) >>>>> 548 CTR3(KTR_LOCK, >>>>> 549 "%s: spinning on %p held by %p", >>>>> 550 __func__, sx, owner); >>>>> >>>> >>>> Could you please rather list frame #7 (zfsctl_snapdir_lookup+0x124)? >>>> >>>> -- >>>> Andriy Gapon >>>> >>> >>> >>> >>> >>> -- >>> Sean Chittenden >>> sean@chittenden.org >>> >> >> >> -- >> Andriy Gapon >> > > > > > -- > Sean Chittenden > sean@chittenden.org > -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5075FA8E.10200>