Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Aug 2015 15:44:26 -0500
From:      Karl Denninger <karl@denninger.net>
To:        Sean Chittenden <sean@chittenden.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: Panic in ZFS during zfs recv (while snapshots being destroyed)
Message-ID:  <55DF76AA.3040103@denninger.net>
In-Reply-To: <sig.0681f4fd27.ADD991B6-BCF2-4B11-A5D6-EF1DB585AA33@chittenden.org>
References:  <55BB443E.8040801@denninger.net> <55CF7926.1030901@denninger.net> <55DF7191.2080409@denninger.net> <sig.0681f4fd27.ADD991B6-BCF2-4B11-A5D6-EF1DB585AA33@chittenden.org>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
No, but that does sound like it might be involved.....

And yeah, this did start when I moved the root pool to a mirrored pair
of Intel 530s off a pair of spinning-rust WD RE4s.... (The 530s are darn
nice performance-wise, reasonably inexpensive and thus very suitable for
a root filesystem drive and they also pass the "pull the power cord"
test, incidentally.)

You may be onto something -- I'll try shutting it off, but due to the
fact that I can't make this happen and it's a "every week or two" panic,
but ALWAYS when the zfs send | zfs recv is running AND it's always on
the same filesystem it will be a fair while before I know if it's fixed
(like over a month, given the usual pattern here, as that would be 4
"average" periods without a panic).....

I also wonder if I could tune this out with some of the other TRIM
parameters instead of losing it entirely.

vfs.zfs.trim.max_interval: 1
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.txg_delay: 32
vfs.zfs.trim.enabled: 1
vfs.zfs.vdev.trim_max_pending: 10000
vfs.zfs.vdev.trim_max_active: 64
vfs.zfs.vdev.trim_min_active: 1

That it's panic'ing on a mtx_lock_sleep might point this way.... the
trace shows it coming from a zfs_onexit_destroy, which ends up calling
zfs_unmount_snap() and then it blows in dounmount() while executing
mtx_lock_sleep().

I do wonder if I'm begging for new and innovative performance issues if
I run with TRIM off for an extended period of time, however..... :-)


On 8/27/2015 15:30, Sean Chittenden wrote:
> Have you tried disabling TRIM?  We recently ran in to an issue where a `zfs delete` on a large dataset caused the host to panic because TRIM was tripping over the ZFS deadman timer.  Disabling TRIM worked as  valid workaround for us.  ?  You mentioned a recent move to SSDs, so this can happen, esp after the drive has experienced a little bit of actual work.  ?  -sc
>
>
> --
> Sean Chittenden
> sean@chittenden.org
>
>
>> On Aug 27, 2015, at 13:22, Karl Denninger <karl@denninger.net> wrote:
>>
>> On 8/15/2015 12:38, Karl Denninger wrote:
>>> Update:
>>>
>>> This /appears /to be related to attempting to send or receive a
>>> /cloned /snapshot.
>>>
>>> I use /beadm /to manage boot environments and the crashes have all
>>> come while send/recv-ing the root pool, which is the one where these
>>> clones get created.  It is /not /consistent within a given snapshot
>>> when it crashes and a second attempt (which does a "recovery"
>>> send/receive) succeeds every time -- I've yet to have it panic twice
>>> sequentially.
>>>
>>> I surmise that the problem comes about when a file in the cloned
>>> snapshot is modified, but this is a guess at this point.
>>>
>>> I'm going to try to force replication of the problem on my test system.
>>>
>>> On 7/31/2015 04:47, Karl Denninger wrote:
>>>> I have an automated script that runs zfs send/recv copies to bring a
>>>> backup data set into congruence with the running copies nightly.  The
>>>> source has automated snapshots running on a fairly frequent basis
>>>> through zfs-auto-snapshot.
>>>>
>>>> Recently I have started having a panic show up about once a week during
>>>> the backup run, but it's inconsistent.  It is in the same place, but I
>>>> cannot force it to repeat.
>>>>
>>>> The trap itself is a page fault in kernel mode in the zfs code at
>>>> zfs_unmount_snap(); here's the traceback from the kvm (sorry for the
>>>> image link but I don't have a better option right now.)
>>>>
>>>> I'll try to get a dump, this is a production machine with encrypted swap
>>>> so it's not normally turned on.
>>>>
>>>> Note that the pool that appears to be involved (the backup pool) has
>>>> passed a scrub and thus I would assume the on-disk structure is ok.....
>>>> but that might be an unfair assumption.  It is always occurring in the
>>>> same dataset although there are a half-dozen that are sync'd -- if this
>>>> one (the first one) successfully completes during the run then all the
>>>> rest will as well (that is, whenever I restart the process it has always
>>>> failed here.)  The source pool is also clean and passes a scrub.
>>>>
>>>> traceback is at http://www.denninger.net/kvmimage.png; apologies for the
>>>> image traceback but this is coming from a remote KVM.
>>>>
>>>> I first saw this on 10.1-STABLE and it is still happening on FreeBSD
>>>> 10.2-PRERELEASE #9 r285890M, which I updated to in an attempt to see if
>>>> the problem was something that had been addressed.
>>>>
>>>>
>>> -- 
>>> Karl Denninger
>>> karl@denninger.net <mailto:karl@denninger.net>
>>> /The Market Ticker/
>>> /[S/MIME encrypted email preferred]/
>> Second update: I have now taken another panic on 10.2-Stable, same deal,
>> but without any cloned snapshots in the source image. I had thought that
>> removing cloned snapshots might eliminate the issue; that is now out the
>> window.
>>
>> It ONLY happens on this one filesystem (the root one, incidentally)
>> which is fairly-recently created as I moved this machine from spinning
>> rust to SSDs for the OS and root pool -- and only when it is being
>> backed up by using zfs send | zfs recv (with the receive going to a
>> different pool in the same machine.)  I have yet to be able to provoke
>> it when using zfs send to copy to a different machine on the same LAN,
>> but given that it is not able to be reproduced on demand I can't be
>> certain it's timing related (e.g. performance between the two pools in
>> question) or just that I haven't hit the unlucky combination.
>>
>> This looks like some sort of race condition and I will continue to see
>> if I can craft a case to make it occur "on demand"
>>
>> -- 
>> Karl Denninger
>> karl@denninger.net <mailto:karl@denninger.net>
>> /The Market Ticker/
>> /[S/MIME encrypted email preferred]/
>
>
> %SPAMBLOCK-SYS: Matched [+Sean Chittenden <sean@chittenden.org>], message ok
>

-- 
Karl Denninger
karl@denninger.net <mailto:karl@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

[-- Attachment #2 --]
0	*H
010
	`He0	*H
_0[0C)0
	*H
010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1"0 	*H
	Cuda Systems LLC CA0
150421022159Z
200419022159Z0Z10	UUS10UFlorida10U
Cuda Systems LLC10UKarl Denninger (OCSP)0"0
	*H
0
X@vkY
Tq/vE]5#֯MX\8LJ/V?5Da+
sJc*/r{ȼnS+w")ąZ^DtdCOZ ~7Q '@a#ijc۴oZdB&!Ӝ-<	?HN5y
5}F|ef゘"Vلio74zn">a1qWuɖbFeGE&3(KhixG3!#e_XƬϜ/,$+;4y'Bz<qT9_?rRUpn5
Jn&Rx/p Jyel*pN8/#9u/YPEC)TY>~/˘N[vyiDKˉ,^" ?$T8v&K%z8C @?K{9f`+@,|Mbia007++0)0'+0http://cudasystems.net:88880	U00	`HB0U0,	`HB
OpenSSL Generated Certificate0U-h\Ff Y0U#0$q}ݽʒm50U0karl@denninger.net0
	*H
Owbabɺx&Uk[(Oj!%pMQ0I!#QH}.>~2&D}<wm_>V6v]f>=Nn+8;q wfΰ/RLyUG#b}n!Dր_up|_ǰc/%ۥ
nN8:d;-UJd/m1~VނיnN I˾$tF1&}|?q?\đXԑ&\4V<lKۮ3%Am_(q-(cAeGX)f}-˥6cv~Kg8m~v;|9:-iAPқ6ېn-.)<[$KJtt/L4ᖣ^Cmu4vb{+BG$M0c\[MR|0FԸP&78"4p#}DZ9;V9#>Sw"[UP7100010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1"0 	*H
	Cuda Systems LLC CA)0
	`HeM0	*H
	1	*H
0	*H
	1
150827204426Z0O	*H
	1B@ZQ$;ggC^q}JkI"_&nn%T
.u`eK>0l	*H
	1_0]0	`He*0	`He0
*H
0*H
0
*H
@0+0
*H
(0	+710010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1"0 	*H
	Cuda Systems LLC CA)0*H
	1010	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1"0 	*H
	Cuda Systems LLC CA)0
	*H
&~;x/
4Tsy'e۳w^?x$Lif!X-7S;Kvzaz=(7aOL}Q\	ȥU4$;6@O7L	ݓ,M Lk\F~T<|CTh9U^C`t0`@f2iEnlkCb>޷ƚ)*-8Sa(́h˭*3N3~LAB'8y%D7ڱ˹
Aq|:[y)9.2$\?8SL#mnLH38
`s[YQ^Awb{_ߪ`?܉cgFtjS)-jFz)0otWfc."3IFe(#+bk)ٛJy0u1~7ւjfU6E]L*+7fhɓ|۪xgZXh	ha
home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55DF76AA.3040103>