From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 10 22:45:42 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id D4419C13;
 Wed, 10 Oct 2012 22:45:42 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
 by mx1.freebsd.org (Postfix) with ESMTP id EDA6F8FC16;
 Wed, 10 Oct 2012 22:45:41 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
 [212.40.38.100])
 by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id BAA25025;
 Thu, 11 Oct 2012 01:45:37 +0300 (EEST)
 (envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
 by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
 id 1TM521-000LXO-F2; Thu, 11 Oct 2012 01:45:37 +0300
Message-ID: <5075FA8E.10200@FreeBSD.org>
Date: Thu, 11 Oct 2012 01:45:34 +0300
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:15.0) Gecko/20120913 Thunderbird/15.0.1
MIME-Version: 1.0
To: Sean Chittenden <sean@chittenden.org>,
 Pawel Jakub Dawidek <pjd@FreeBSD.org>
Subject: Re: ZFS crashing during snapdir lookup for non-existent snapshot...
References: <B244C0E9-539D-4F7C-8616-378E8469F4BB@chittenden.org>
 <5075E3E0.7060706@FreeBSD.org>
 <0A6567E7-3BA5-4F27-AEB2-1C00EDE00641@chittenden.org>
 <5075EDDD.4030008@FreeBSD.org>
 <A1901AB5-6E83-488E-9D29-EA7C4E3720F3@chittenden.org>
In-Reply-To: <A1901AB5-6E83-488E-9D29-EA7C4E3720F3@chittenden.org>
X-Enigmail-Version: 1.4.3
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Oct 2012 22:45:42 -0000


[restoring mailing list cc]

on 11/10/2012 00:58 Sean Chittenden said the following:
>>> I don't have a dump from this particular system, only the backtrace from the crash. The system is ZFS only and I only have a ZFS swapdir. :-/
>>>
>>> I have the kernel still so I can poke at the code and the compiled kernel (kernel.symbols). ? What are you looking for? -sc
>>>
>>
>> list *zfsctl_snapdir_lookup+0x124 in kgdb
> 
> (kgdb) list *zfsctl_snapdir_lookup+0x124
> 0xffffffff816e9384 is in zfsctl_snapdir_lookup (/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c:992).
> 987				*direntflags = ED_CASE_CONFLICT;
> 988	#endif
> 989		}
> 990	
> 991		mutex_enter(&sdp->sd_lock);
> 992		search.se_name = (char *)nm;
> 993		if ((sep = avl_find(&sdp->sd_snaps, &search, &where)) != NULL) {
> 994			*vpp = sep->se_root;
> 995			VN_HOLD(*vpp);
> 996			err = traverse(vpp, LK_EXCLUSIVE | LK_RETRY);

It seems that the problem is in Solaris-ism that remained in the code.
I think that zfsctl_snapdir_inactive should not destroy sdp, that should be a
job of vop_reclaim.  Otherwise, if the vnode is re-activated its v_data points
to freed memory.


>>> On Oct 10, 2012, at 14:08 PM, Andriy Gapon <avg@FreeBSD.org> wrote:
>>>
>>>> on 10/10/2012 23:57 Sean Chittenden said the following:
>>>>> Using a FreeBSD -STABLE build from 2012-09-17, I now have the ability to crash FreeBSD/ZFS within a few hours of stress testing. It appears as though there's a locking problem when attempting to interrogate stats on a ZFS snapshot that doesn't exist any more. I believe the scenario is as follows:
>>>>>
>>>>> Background:
>>>>>
>>>>> *) `zfs set snapdir=visible` /was/ set on a data set
>>>>>
>>>>> *) Snapshots were being run once an hour for weeks, long enough for zabbix to auto-discover the snapshots as valid file systems.
>>>>>
>>>>> *) `zfs inherit snapdir` was recently set (about a week ago), but zabbix is still attempting to inquire about no snapshots that are no longer visible or exist.
>>>>>
>>>>>
>>>>> After snapshots were deleted through the normal process of aging, zabbix is still interrogating the file system attempting to acquire information about the now deleted snapshots.
>>>>>
>>>>> FreeBSD crashes once every few minutes when zabbix is running and pulling ZFS information about the now hidden (or most likely deleted) snapshots. I believe that zabbix is using getfsspec(3) with the now stale snapshot name in rapid succession and is somehow triggering a race when there are two concurrent calls to two different non-existent snapshots.
>>>>>
>>>>> -sc
>>>>>
>>>>>
>>>>> kernel: Fatal trap 12: page fault while in kernel mode
>>>>> kernel: cpuid = 0; apic id = 00
>>>>> kernel: fault virtual address    = 0x368
>>>>> kernel: fault code               = supervisor read data, page not present
>>>>> kernel: instruction pointer      = 0x20:0xffffffff80922be2
>>>>> kernel: stack pointer            = 0x28:0xffffff8487d7b0d0
>>>>> kernel: frame pointer            = 0x28:0xffffff8487d7b170
>>>>> kernel: code segment             = base 0x0, limit 0xfffff, type 0x1b
>>>>> kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
>>>>> kernel: processor eflags = interrupt enabled, resume, IOPL = 0
>>>>> kernel: current process          = 3536 (zabbix_agentd)
>>>>> kernel: trap number              = 12
>>>>> kernel: panic: page fault
>>>>> kernel: cpuid = 0
>>>>> kernel: KDB: stack backtrace:
>>>>> kernel: #0 0xffffffff80950800 at kdb_backtrace+0x60
>>>>> kernel: #1 0xffffffff8091ac2d at panic+0x1fd
>>>>> kernel: #2 0xffffffff80c21858 at trap_fatal+0x388
>>>>> kernel: #3 0xffffffff80c21b23 at trap_pfault+0x2b3
>>>>> kernel: #4 0xffffffff80c212b5 at trap+0x5b5
>>>>> kernel: #5 0xffffffff80c0ba22 at calltrap+0x8
>>>>> kernel: #6 0xffffffff8092271e at _sx_xlock+0x5e
>>>>> kernel: #7 0xffffffff816e9384 at zfsctl_snapdir_lookup+0x124
>>>>> kernel: #8 0xffffffff80cb385f at VOP_LOOKUP_APV+0x5f
>>>>> kernel: #9 0xffffffff809a307f at lookup+0x5ef
>>>>> kernel: #10 0xffffffff809a263d at namei+0x62d
>>>>> kernel: #11 0xffffffff809b2b39 at kern_statfs+0x89
>>>>> kernel: #12 0xffffffff809b2a80 at sys_statfs+0x20
>>>>> kernel: #13 0xffffffff80c22134 at amd64_syscall+0x334
>>>>>
>>>>> FreeBSD example.com 9.1-PRERELEASE FreeBSD 9.1-PRERELEASE #1: Mon Sep 17 04:34:37 UTC 2012     root@example.com:/usr/obj/usr/src/sys/GENERIC  amd64
>>>>>
>>>>> 0xffffffff80922be2 is in _sx_xlock_hard (/usr/src/sys/kern/kern_sx.c:546).
>>>>> 541			x = sx->sx_lock;
>>>>> 542			if ((sx->lock_object.lo_flags & SX_NOADAPTIVE) == 0) {
>>>>> 543				if ((x & SX_LOCK_SHARED) == 0) {
>>>>> 544					x = SX_OWNER(x);
>>>>> 545					owner = (struct thread *)x;
>>>>> 546					if (TD_IS_RUNNING(owner)) {
>>>>> 547						if (LOCK_LOG_TEST(&sx->lock_object, 0))
>>>>> 548							CTR3(KTR_LOCK,
>>>>> 549						    "%s: spinning on %p held by %p",
>>>>> 550							    __func__, sx, owner);
>>>>>
>>>>
>>>> Could you please rather list frame #7 (zfsctl_snapdir_lookup+0x124)?
>>>>
>>>> -- 
>>>> Andriy Gapon
>>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sean Chittenden
>>> sean@chittenden.org
>>>
>>
>>
>> -- 
>> Andriy Gapon
>>
> 
> 
> 
> 
> --
> Sean Chittenden
> sean@chittenden.org
> 


-- 
Andriy Gapon