From owner-freebsd-fs@freebsd.org Thu Aug 27 20:30:39 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0C0589C393A for ; Thu, 27 Aug 2015 20:30:39 +0000 (UTC) (envelope-from sean@chittenden.org) Received: from mail01.lax1.stackjet.com (mon01.lax1.stackjet.com [174.136.104.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E7FF917B4 for ; Thu, 27 Aug 2015 20:30:37 +0000 (UTC) (envelope-from sean@chittenden.org) Received: from hormesis.local (localhost [127.0.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: sean@chittenden.org) by mail01.lax1.stackjet.com (Postfix) with ESMTPSA id AB8833E8E5A; Thu, 27 Aug 2015 13:30:30 -0700 (PDT) Received: from hormesis.local ([173.228.13.241] helo=hormesis.local) by ASSP.nospam with SMTPS(ECDHE-RSA-AES256-SHA) (2.4.2); 27 Aug 2015 13:30:28 -0700 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Subject: Re: Panic in ZFS during zfs recv (while snapshots being destroyed) From: Sean Chittenden In-Reply-To: <55DF7191.2080409@denninger.net> Date: Thu, 27 Aug 2015 13:30:24 -0700 Cc: freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <55BB443E.8040801@denninger.net> <55CF7926.1030901@denninger.net> <55DF7191.2080409@denninger.net> To: Karl Denninger X-Mailer: Apple Mail (2.2104) X-Assp-Version: 2.4.2(14097) on ASSP.nospam X-Assp-ID: ASSP.nospam m1-07430-05968 X-Assp-Session: 844144288 (mail 1) X-Assp-Envelope-From: sean@chittenden.org X-Assp-Intended-For: karl@denninger.net X-Assp-Intended-For: freebsd-fs@freebsd.org X-Assp-Client-TLS: yes X-Assp-Server-TLS: yes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Aug 2015 20:30:39 -0000 Have you tried disabling TRIM? We recently ran in to an issue where a = `zfs delete` on a large dataset caused the host to panic because TRIM = was tripping over the ZFS deadman timer. Disabling TRIM worked as = valid workaround for us. ? You mentioned a recent move to SSDs, so = this can happen, esp after the drive has experienced a little bit of = actual work. ? -sc -- Sean Chittenden sean@chittenden.org > On Aug 27, 2015, at 13:22, Karl Denninger wrote: >=20 > On 8/15/2015 12:38, Karl Denninger wrote: >> Update: >>=20 >> This /appears /to be related to attempting to send or receive a >> /cloned /snapshot. >>=20 >> I use /beadm /to manage boot environments and the crashes have all >> come while send/recv-ing the root pool, which is the one where these >> clones get created. It is /not /consistent within a given snapshot >> when it crashes and a second attempt (which does a "recovery" >> send/receive) succeeds every time -- I've yet to have it panic twice >> sequentially. >>=20 >> I surmise that the problem comes about when a file in the cloned >> snapshot is modified, but this is a guess at this point. >>=20 >> I'm going to try to force replication of the problem on my test = system. >>=20 >> On 7/31/2015 04:47, Karl Denninger wrote: >>> I have an automated script that runs zfs send/recv copies to bring a >>> backup data set into congruence with the running copies nightly. = The >>> source has automated snapshots running on a fairly frequent basis >>> through zfs-auto-snapshot. >>>=20 >>> Recently I have started having a panic show up about once a week = during >>> the backup run, but it's inconsistent. It is in the same place, but = I >>> cannot force it to repeat. >>>=20 >>> The trap itself is a page fault in kernel mode in the zfs code at >>> zfs_unmount_snap(); here's the traceback from the kvm (sorry for the >>> image link but I don't have a better option right now.) >>>=20 >>> I'll try to get a dump, this is a production machine with encrypted = swap >>> so it's not normally turned on. >>>=20 >>> Note that the pool that appears to be involved (the backup pool) has >>> passed a scrub and thus I would assume the on-disk structure is = ok..... >>> but that might be an unfair assumption. It is always occurring in = the >>> same dataset although there are a half-dozen that are sync'd -- if = this >>> one (the first one) successfully completes during the run then all = the >>> rest will as well (that is, whenever I restart the process it has = always >>> failed here.) The source pool is also clean and passes a scrub. >>>=20 >>> traceback is at http://www.denninger.net/kvmimage.png; apologies for = the >>> image traceback but this is coming from a remote KVM. >>>=20 >>> I first saw this on 10.1-STABLE and it is still happening on FreeBSD >>> 10.2-PRERELEASE #9 r285890M, which I updated to in an attempt to see = if >>> the problem was something that had been addressed. >>>=20 >>>=20 >>=20 >> --=20 >> Karl Denninger >> karl@denninger.net >> /The Market Ticker/ >> /[S/MIME encrypted email preferred]/ >=20 > Second update: I have now taken another panic on 10.2-Stable, same = deal, > but without any cloned snapshots in the source image. I had thought = that > removing cloned snapshots might eliminate the issue; that is now out = the > window. >=20 > It ONLY happens on this one filesystem (the root one, incidentally) > which is fairly-recently created as I moved this machine from spinning > rust to SSDs for the OS and root pool -- and only when it is being > backed up by using zfs send | zfs recv (with the receive going to a > different pool in the same machine.) I have yet to be able to provoke > it when using zfs send to copy to a different machine on the same LAN, > but given that it is not able to be reproduced on demand I can't be > certain it's timing related (e.g. performance between the two pools in > question) or just that I haven't hit the unlucky combination. >=20 > This looks like some sort of race condition and I will continue to see > if I can craft a case to make it occur "on demand" >=20 > --=20 > Karl Denninger > karl@denninger.net > /The Market Ticker/ > /[S/MIME encrypted email preferred]/