Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Oct 2012 01:45:34 +0300
From:      Andriy Gapon <avg@FreeBSD.org>
To:        Sean Chittenden <sean@chittenden.org>, Pawel Jakub Dawidek <pjd@FreeBSD.org>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org>
Subject:   Re: ZFS crashing during snapdir lookup for non-existent snapshot...
Message-ID:  <5075FA8E.10200@FreeBSD.org>
In-Reply-To: <A1901AB5-6E83-488E-9D29-EA7C4E3720F3@chittenden.org>
References:  <B244C0E9-539D-4F7C-8616-378E8469F4BB@chittenden.org> <5075E3E0.7060706@FreeBSD.org> <0A6567E7-3BA5-4F27-AEB2-1C00EDE00641@chittenden.org> <5075EDDD.4030008@FreeBSD.org> <A1901AB5-6E83-488E-9D29-EA7C4E3720F3@chittenden.org>

next in thread | previous in thread | raw e-mail | index | archive | help

[restoring mailing list cc]

on 11/10/2012 00:58 Sean Chittenden said the following:
>>> I don't have a dump from this particular system, only the backtrace from the crash. The system is ZFS only and I only have a ZFS swapdir. :-/
>>>
>>> I have the kernel still so I can poke at the code and the compiled kernel (kernel.symbols). ? What are you looking for? -sc
>>>
>>
>> list *zfsctl_snapdir_lookup+0x124 in kgdb
> 
> (kgdb) list *zfsctl_snapdir_lookup+0x124
> 0xffffffff816e9384 is in zfsctl_snapdir_lookup (/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c:992).
> 987				*direntflags = ED_CASE_CONFLICT;
> 988	#endif
> 989		}
> 990	
> 991		mutex_enter(&sdp->sd_lock);
> 992		search.se_name = (char *)nm;
> 993		if ((sep = avl_find(&sdp->sd_snaps, &search, &where)) != NULL) {
> 994			*vpp = sep->se_root;
> 995			VN_HOLD(*vpp);
> 996			err = traverse(vpp, LK_EXCLUSIVE | LK_RETRY);

It seems that the problem is in Solaris-ism that remained in the code.
I think that zfsctl_snapdir_inactive should not destroy sdp, that should be a
job of vop_reclaim.  Otherwise, if the vnode is re-activated its v_data points
to freed memory.


>>> On Oct 10, 2012, at 14:08 PM, Andriy Gapon <avg@FreeBSD.org> wrote:
>>>
>>>> on 10/10/2012 23:57 Sean Chittenden said the following:
>>>>> Using a FreeBSD -STABLE build from 2012-09-17, I now have the ability to crash FreeBSD/ZFS within a few hours of stress testing. It appears as though there's a locking problem when attempting to interrogate stats on a ZFS snapshot that doesn't exist any more. I believe the scenario is as follows:
>>>>>
>>>>> Background:
>>>>>
>>>>> *) `zfs set snapdir=visible` /was/ set on a data set
>>>>>
>>>>> *) Snapshots were being run once an hour for weeks, long enough for zabbix to auto-discover the snapshots as valid file systems.
>>>>>
>>>>> *) `zfs inherit snapdir` was recently set (about a week ago), but zabbix is still attempting to inquire about no snapshots that are no longer visible or exist.
>>>>>
>>>>>
>>>>> After snapshots were deleted through the normal process of aging, zabbix is still interrogating the file system attempting to acquire information about the now deleted snapshots.
>>>>>
>>>>> FreeBSD crashes once every few minutes when zabbix is running and pulling ZFS information about the now hidden (or most likely deleted) snapshots. I believe that zabbix is using getfsspec(3) with the now stale snapshot name in rapid succession and is somehow triggering a race when there are two concurrent calls to two different non-existent snapshots.
>>>>>
>>>>> -sc
>>>>>
>>>>>
>>>>> kernel: Fatal trap 12: page fault while in kernel mode
>>>>> kernel: cpuid = 0; apic id = 00
>>>>> kernel: fault virtual address    = 0x368
>>>>> kernel: fault code               = supervisor read data, page not present
>>>>> kernel: instruction pointer      = 0x20:0xffffffff80922be2
>>>>> kernel: stack pointer            = 0x28:0xffffff8487d7b0d0
>>>>> kernel: frame pointer            = 0x28:0xffffff8487d7b170
>>>>> kernel: code segment             = base 0x0, limit 0xfffff, type 0x1b
>>>>> kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
>>>>> kernel: processor eflags = interrupt enabled, resume, IOPL = 0
>>>>> kernel: current process          = 3536 (zabbix_agentd)
>>>>> kernel: trap number              = 12
>>>>> kernel: panic: page fault
>>>>> kernel: cpuid = 0
>>>>> kernel: KDB: stack backtrace:
>>>>> kernel: #0 0xffffffff80950800 at kdb_backtrace+0x60
>>>>> kernel: #1 0xffffffff8091ac2d at panic+0x1fd
>>>>> kernel: #2 0xffffffff80c21858 at trap_fatal+0x388
>>>>> kernel: #3 0xffffffff80c21b23 at trap_pfault+0x2b3
>>>>> kernel: #4 0xffffffff80c212b5 at trap+0x5b5
>>>>> kernel: #5 0xffffffff80c0ba22 at calltrap+0x8
>>>>> kernel: #6 0xffffffff8092271e at _sx_xlock+0x5e
>>>>> kernel: #7 0xffffffff816e9384 at zfsctl_snapdir_lookup+0x124
>>>>> kernel: #8 0xffffffff80cb385f at VOP_LOOKUP_APV+0x5f
>>>>> kernel: #9 0xffffffff809a307f at lookup+0x5ef
>>>>> kernel: #10 0xffffffff809a263d at namei+0x62d
>>>>> kernel: #11 0xffffffff809b2b39 at kern_statfs+0x89
>>>>> kernel: #12 0xffffffff809b2a80 at sys_statfs+0x20
>>>>> kernel: #13 0xffffffff80c22134 at amd64_syscall+0x334
>>>>>
>>>>> FreeBSD example.com 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #1: Mon Sep 17 04:34:37 UTC 2012     root@example.com:/usr/obj/usr/src/sys/GENERIC  amd64
>>>>>
>>>>> 0xffffffff80922be2 is in _sx_xlock_hard (/usr/src/sys/kern/kern_sx.c:546).
>>>>> 541			x = sx->sx_lock;
>>>>> 542			if ((sx->lock_object.lo_flags & SX_NOADAPTIVE) == 0) {
>>>>> 543				if ((x & SX_LOCK_SHARED) == 0) {
>>>>> 544					x = SX_OWNER(x);
>>>>> 545					owner = (struct thread *)x;
>>>>> 546					if (TD_IS_RUNNING(owner)) {
>>>>> 547						if (LOCK_LOG_TEST(&sx->lock_object, 0))
>>>>> 548							CTR3(KTR_LOCK,
>>>>> 549						    "%s: spinning on %p held by %p",
>>>>> 550							    __func__, sx, owner);
>>>>>
>>>>
>>>> Could you please rather list frame #7 (zfsctl_snapdir_lookup+0x124)?
>>>>
>>>> -- 
>>>> Andriy Gapon
>>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sean Chittenden
>>> sean@chittenden.org
>>>
>>
>>
>> -- 
>> Andriy Gapon
>>
> 
> 
> 
> 
> --
> Sean Chittenden
> sean@chittenden.org
> 


-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5075FA8E.10200>