Date: Thu, 27 Aug 2015 15:44:26 -0500 From: Karl Denninger <karl@denninger.net> To: Sean Chittenden <sean@chittenden.org> Cc: freebsd-fs@freebsd.org Subject: Re: Panic in ZFS during zfs recv (while snapshots being destroyed) Message-ID: <55DF76AA.3040103@denninger.net> In-Reply-To: <sig.0681f4fd27.ADD991B6-BCF2-4B11-A5D6-EF1DB585AA33@chittenden.org> References: <55BB443E.8040801@denninger.net> <55CF7926.1030901@denninger.net> <55DF7191.2080409@denninger.net> <sig.0681f4fd27.ADD991B6-BCF2-4B11-A5D6-EF1DB585AA33@chittenden.org>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --] No, but that does sound like it might be involved..... And yeah, this did start when I moved the root pool to a mirrored pair of Intel 530s off a pair of spinning-rust WD RE4s.... (The 530s are darn nice performance-wise, reasonably inexpensive and thus very suitable for a root filesystem drive and they also pass the "pull the power cord" test, incidentally.) You may be onto something -- I'll try shutting it off, but due to the fact that I can't make this happen and it's a "every week or two" panic, but ALWAYS when the zfs send | zfs recv is running AND it's always on the same filesystem it will be a fair while before I know if it's fixed (like over a month, given the usual pattern here, as that would be 4 "average" periods without a panic)..... I also wonder if I could tune this out with some of the other TRIM parameters instead of losing it entirely. vfs.zfs.trim.max_interval: 1 vfs.zfs.trim.timeout: 30 vfs.zfs.trim.txg_delay: 32 vfs.zfs.trim.enabled: 1 vfs.zfs.vdev.trim_max_pending: 10000 vfs.zfs.vdev.trim_max_active: 64 vfs.zfs.vdev.trim_min_active: 1 That it's panic'ing on a mtx_lock_sleep might point this way.... the trace shows it coming from a zfs_onexit_destroy, which ends up calling zfs_unmount_snap() and then it blows in dounmount() while executing mtx_lock_sleep(). I do wonder if I'm begging for new and innovative performance issues if I run with TRIM off for an extended period of time, however..... :-) On 8/27/2015 15:30, Sean Chittenden wrote: > Have you tried disabling TRIM? We recently ran in to an issue where a `zfs delete` on a large dataset caused the host to panic because TRIM was tripping over the ZFS deadman timer. Disabling TRIM worked as valid workaround for us. ? You mentioned a recent move to SSDs, so this can happen, esp after the drive has experienced a little bit of actual work. ? -sc > > > -- > Sean Chittenden > sean@chittenden.org > > >> On Aug 27, 2015, at 13:22, Karl Denninger <karl@denninger.net> wrote: >> >> On 8/15/2015 12:38, Karl Denninger wrote: >>> Update: >>> >>> This /appears /to be related to attempting to send or receive a >>> /cloned /snapshot. >>> >>> I use /beadm /to manage boot environments and the crashes have all >>> come while send/recv-ing the root pool, which is the one where these >>> clones get created. It is /not /consistent within a given snapshot >>> when it crashes and a second attempt (which does a "recovery" >>> send/receive) succeeds every time -- I've yet to have it panic twice >>> sequentially. >>> >>> I surmise that the problem comes about when a file in the cloned >>> snapshot is modified, but this is a guess at this point. >>> >>> I'm going to try to force replication of the problem on my test system. >>> >>> On 7/31/2015 04:47, Karl Denninger wrote: >>>> I have an automated script that runs zfs send/recv copies to bring a >>>> backup data set into congruence with the running copies nightly. The >>>> source has automated snapshots running on a fairly frequent basis >>>> through zfs-auto-snapshot. >>>> >>>> Recently I have started having a panic show up about once a week during >>>> the backup run, but it's inconsistent. It is in the same place, but I >>>> cannot force it to repeat. >>>> >>>> The trap itself is a page fault in kernel mode in the zfs code at >>>> zfs_unmount_snap(); here's the traceback from the kvm (sorry for the >>>> image link but I don't have a better option right now.) >>>> >>>> I'll try to get a dump, this is a production machine with encrypted swap >>>> so it's not normally turned on. >>>> >>>> Note that the pool that appears to be involved (the backup pool) has >>>> passed a scrub and thus I would assume the on-disk structure is ok..... >>>> but that might be an unfair assumption. It is always occurring in the >>>> same dataset although there are a half-dozen that are sync'd -- if this >>>> one (the first one) successfully completes during the run then all the >>>> rest will as well (that is, whenever I restart the process it has always >>>> failed here.) The source pool is also clean and passes a scrub. >>>> >>>> traceback is at http://www.denninger.net/kvmimage.png; apologies for the >>>> image traceback but this is coming from a remote KVM. >>>> >>>> I first saw this on 10.1-STABLE and it is still happening on FreeBSD >>>> 10.2-PRERELEASE #9 r285890M, which I updated to in an attempt to see if >>>> the problem was something that had been addressed. >>>> >>>> >>> -- >>> Karl Denninger >>> karl@denninger.net <mailto:karl@denninger.net> >>> /The Market Ticker/ >>> /[S/MIME encrypted email preferred]/ >> Second update: I have now taken another panic on 10.2-Stable, same deal, >> but without any cloned snapshots in the source image. I had thought that >> removing cloned snapshots might eliminate the issue; that is now out the >> window. >> >> It ONLY happens on this one filesystem (the root one, incidentally) >> which is fairly-recently created as I moved this machine from spinning >> rust to SSDs for the OS and root pool -- and only when it is being >> backed up by using zfs send | zfs recv (with the receive going to a >> different pool in the same machine.) I have yet to be able to provoke >> it when using zfs send to copy to a different machine on the same LAN, >> but given that it is not able to be reproduced on demand I can't be >> certain it's timing related (e.g. performance between the two pools in >> question) or just that I haven't hit the unlucky combination. >> >> This looks like some sort of race condition and I will continue to see >> if I can craft a case to make it occur "on demand" >> >> -- >> Karl Denninger >> karl@denninger.net <mailto:karl@denninger.net> >> /The Market Ticker/ >> /[S/MIME encrypted email preferred]/ > > > %SPAMBLOCK-SYS: Matched [+Sean Chittenden <sean@chittenden.org>], message ok > -- Karl Denninger karl@denninger.net <mailto:karl@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ [-- Attachment #2 --] 0 *H 010 `He 0 *H _0[0C)0 *H 010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA0 150421022159Z 200419022159Z0Z10 UUS10UFlorida10U Cuda Systems LLC10UKarl Denninger (OCSP)0"0 *H 0 X@vkY Tq/vE]5#֯MX\8LJ/V?5Da+ sJc*/r{ȼnS+ w")ąZ^DtdCOZ ~7Q '@a#ijc۴oZdB&!Ӝ-< ?HN5y 5}F|ef"Vلio74zn">a1qWuɖbFeGE&3(KhixG3!#e_XƬϜ/,$+;4y'Bz<qT9_?rRUpn5 Jn&Rx/p Jyel*pN8/#9u/YPEC)TY>~/˘N[vyiDKˉ,^" ?$T8 v&K%z8C @?K{9f`+@,|Mbia 007++0)0'+0http://cudasystems.net:88880 U0 0 `HB0U0, `HB OpenSSL Generated Certificate0U-h\Ff Y0U#0$q}ݽʒm50U0karl@denninger.net0 *H Owbabɺx&Uk[(Oj!%p MQ0I!#QH}.>~2&D}<wm_>V6v]f>=Nn+8;q wfΰ/RLyUG#b}n!Dր_up|_ǰc/%ۥ nN8:d;-UJd/m1~VނיnN I˾$tF1&}|?q?\đXԑ&\4V<lKۮ3%Am_(q-(cAeGX)f}-˥6cv~Kg8m~v;|9:-iAPқ6ېn-.)<[$KJtt/L4ᖣ^Cmu4vb{+BG$M0c\[MR|0FԸP&78"4p#}DZ9;V9#>Sw"[UP7100010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0 `He M0 *H 1 *H 0 *H 1 150827204426Z0O *H 1B@ZQ$;ggC^q}JkI"_&nn%T .u`eK>0l *H 1_0]0 `He*0 `He0 *H 0*H 0 *H @0+0 *H (0 +710010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0*H 1010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0 *H &~;x/ 4Tsy'e۳w^?x$Lif!X-7S;Kvzaz=(7aOL}Q\ ȥU 4$;6@O7L ݓ,M Lk\F~T<|CTh9U^C`t0`@f2iEnlkCb>ƚ)*-8Sa(́h˭*3N3~LAB'8y%D7ڱ˹ Aq|:[y)9.2$\?8SL#mnLH38 `s[YQ^Awb{_ߪ`?܉cgFtjS)-jFz)0o tWfc."3IFe(#+bk)ٛJy0u1~7ւjfU6E]L*+7fhɓ|۪xgZXh h ahome | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55DF76AA.3040103>
