From owner-freebsd-stable@FreeBSD.ORG Tue May 17 12:23:26 2011 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 79C59106566C for ; Tue, 17 May 2011 12:23:26 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta15.westchester.pa.mail.comcast.net (qmta15.westchester.pa.mail.comcast.net [76.96.59.228]) by mx1.freebsd.org (Postfix) with ESMTP id 245AC8FC1A for ; Tue, 17 May 2011 12:23:25 +0000 (UTC) Received: from omta22.westchester.pa.mail.comcast.net ([76.96.62.73]) by qmta15.westchester.pa.mail.comcast.net with comcast id kcDE1g0071ap0As5FcPS5p; Tue, 17 May 2011 12:23:26 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta22.westchester.pa.mail.comcast.net with comcast id kcPR1g00T1t3BNj3icPRkW; Tue, 17 May 2011 12:23:26 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id E8B51102C19; Tue, 17 May 2011 05:23:23 -0700 (PDT) Date: Tue, 17 May 2011 05:23:23 -0700 From: Jeremy Chadwick To: Andriy Gapon Message-ID: <20110517122323.GA49650@icarus.home.lan> References: <20110517073029.GA44359@icarus.home.lan> <4DD25264.8040305@FreeBSD.org> <20110517112952.GA48610@icarus.home.lan> <4DD2624A.9080708@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4DD2624A.9080708@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Charles Sprickman , stable@FreeBSD.org Subject: Re: 8.1R possible zfs snapshot livelock? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2011 12:23:26 -0000 On Tue, May 17, 2011 at 02:55:54PM +0300, Andriy Gapon wrote: > on 17/05/2011 14:29 Jeremy Chadwick said the following: > > On Tue, May 17, 2011 at 01:48:04PM +0300, Andriy Gapon wrote: > >> on 17/05/2011 10:30 Jeremy Chadwick said the following: > >>> On Tue, May 17, 2011 at 02:43:44AM -0400, Charles Sprickman wrote: > >>>> Does this sound familiar to anyone running ZFS with snapshots? > >>> > >>> Yes, and is exactly why I don't use them. :-) > >> > >> You put a smiley, but is this an attempt at FUD? > > > > I wish it were. > > The reason I asked is that I could have easily answered "No, that's why I use them > all the time". And I am sure many people would join me on this. > So the way you originally described the issue was sufficiently non-specific and > strong. You're absolutely right -- and to me, your answer/experience holds much more weight than my own. But if you and I were presenting advocacy of ZFS snapshots to a person who had experienced problems with it, their reluctance to believe would be understandable, no? They'd want some form of reassurance that the problem they experience was known or had been fixed in some way. I guess what I'm saying is that yes my wording was strong -- it was an opinion based on past experience. Fact: I don't have any present-day evidence to validate my opinion, since the ZFS code has changed greatly between then and now. But also fact: I did experience something very similar to what Charles did. Sympathy is sometimes all we admins/users have in situations like this. :-) But I do understand your point. > > I experienced similar behaviour to Charles during the > > early 8.x days (possibly 8.1-RELEASE, I forget; I may be thinking of > > 8.0?) where ZFS snapshots would occasionally result in the kernel > > deadlocking on ZFS-bound I/O. The kernel was alive/responsive to some > > degree but ZFS I/O would just indefinitely stall at that point, > > requiring a full system reset. No disk or controller problems (same > > hardware I'm using today actually!). > > > > I believe there were commits and improvements for snapshotting committed > > between 8.1-RELEASE and 8.2-RELEASE, but I haven't bothered to test > > them. The experience left a very bad taste in my mouth and as such I > > have avoided ZFS snapshots since. > > > > I'd be willing to try them again assuming someone can at least confirm > > that there were commits done to address snapshot concerns during the > > past year or so. But... > > > > There are still some outstanding incidents that directly pertain to ZFS > > snapshots, or are "related" to ZFS snapshots (meaning things like > > send/recv which are commonly used alongside snapshots), which I remember > > reading about but really saw no answer to: > > > > * ZFS send | ssh zfs recv results in ZFS subsystem hanging; 8.1-RELEASE; > > February 2011: > > http://lists.freebsd.org/pipermail/freebsd-fs/2011-February/010602.html > > > > * Kernel panic during heavy disk I/O while "zfs recv" being used > > simultaneously; CURRENT (so ZFS v28?); April 2011: > > http://lists.freebsd.org/pipermail/freebsd-fs/2011-April/011155.html > > > > * ZFS snapshots taking an extremely long time to be deleted; RELENG_8_1; > > February 2011: > > http://lists.freebsd.org/pipermail/freebsd-fs/2011-February/010797.html > > > > * "zfs destroy -r" not working on filesystem-level snapshots but works > > on pool-level snapshots; RELENG_8 with ZFS v28 patch (and is specific > > to ZFS v28 given the info); May 2011: > > http://lists.freebsd.org/pipermail/freebsd-fs/2011-May/011412.html > > > > Sorry to just rattle off a bunch of URLs and issues at once; it's not my > > intention to slander work on ZFS or anything even remotely like that. > > > > I'm just wondering given the number of problem reports that seem to come > > in about snapshot or snapshot-related ZFS stuff, where we stand on > > these? This is mainly for Charles' benefit and not so much mine (our > > rsnapshot/rsync-based backups work great for us at this time, sans the > > stomping of atime). > > > > Problem reports are always over-represented on the mailing lists. > People rarely write that e.g. ZFS snapshot has flawlessly worked for them for the > millionth time again today. I am not aware of any known-but-not-fixed issues in > this area. Each problem report should be properly investigated individually. Both absolutely correct and understood. It just really sucks to be one of the people who experiences problems. When you have a system that you've taken a lot of time to get up and working, it runs reliably for weeks/months, then suddenly something like the above happens, you have to start weighing the pros and cons to alternatives (using something other than snapshot capability, changing filesystems, etc.). It would help if folks had some guidelines for what information would be helpful for kernel developers in the case of a ZFS deadlock of this nature. I would say the majority of the admin/user community (and this includes me!), once at a "db>" prompt, have no clue how to proceed. So for Charles' situation, the next time it happens what would be useful for him to provide? The best I could come up with was to induce doadump then reboot to get the system up/working again, and then use kgdb after-the-fact. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |