From owner-freebsd-fs@FreeBSD.ORG Wed Aug 26 00:14:41 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6B7A0106568C for ; Wed, 26 Aug 2009 00:14:41 +0000 (UTC) (envelope-from spawk@acm.poly.edu) Received: from acm.poly.edu (acm.poly.edu [128.238.9.200]) by mx1.freebsd.org (Postfix) with ESMTP id 3037D8FC18 for ; Wed, 26 Aug 2009 00:14:40 +0000 (UTC) Received: (qmail 2609 invoked from network); 26 Aug 2009 00:14:40 -0000 Received: from unknown (HELO ?192.168.1.47?) (spawk@70.23.211.249) by acm.poly.edu with AES256-SHA encrypted SMTP; 26 Aug 2009 00:14:40 -0000 Message-ID: <4A947E57.6050700@acm.poly.edu> Date: Tue, 25 Aug 2009 20:14:15 -0400 From: Boris Kochergin User-Agent: Thunderbird 2.0.0.19 (X11/20090108) MIME-Version: 1.0 To: freebsd-fs@freebsd.org References: <4A78AFB2.10103@acm.poly.edu> <20090805115621.GG1784@garage.freebsd.pl> <4A798A12.4070408@acm.poly.edu> <20090807073738.GA1607@garage.freebsd.pl> <20090807074400.GB1607@garage.freebsd.pl> <4A7C3002.8000003@acm.poly.edu> <20090807191334.GA1814@garage.freebsd.pl> <4A7C81CA.2040303@acm.poly.edu> <20090807193842.GA2487@garage.freebsd.pl> <4A7C87C5.1070608@acm.poly.edu> <20090807202756.GB2487@garage.freebsd.pl> <4A81CF20.7010108@acm.poly.edu> <4A8AA531.2000004@acm.poly.edu> In-Reply-To: <4A8AA531.2000004@acm.poly.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: ZFS RAID-Z panic on vdev failure + subsequent panics and hangs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Aug 2009 00:14:41 -0000 Boris Kochergin wrote: > Boris Kochergin wrote: >> Pawel Jakub Dawidek wrote: >>> On Fri, Aug 07, 2009 at 04:00:05PM -0400, Boris Kochergin wrote: >>> >>>> Pawel Jakub Dawidek wrote: >>>> >>>>> On Fri, Aug 07, 2009 at 03:34:34PM -0400, Boris Kochergin wrote: >>>>> >>>>> >>>>>> Pawel Jakub Dawidek wrote: >>>>>> >>>>>>> Yeah, that's strange indeed. Could you try: >>>>>>> >>>>>>> print ab->b_arc_node.list_prev >>>>>>> print ab->b_arc_node.list_next >>>>>>> >>>>>>> >>>>>>> >>>>>> (kgdb) print ab->b_arc_node.list_prev >>>>>> $1 = (struct list_node *) 0x1 >>>>>> >>>>> Yeah, list_prev is corrupted. If it panics on you everytime, I could >>>>> send you a patch which will try to catch where the corruption occurs. >>>>> >>>>> >>>>> >>>> I eventually get the arc_evict panic every time I successfully >>>> manage to mount the filesystem, but it usually panics (with the >>>> other backtrace) as soon as I try to mount it, or mount just hangs. >>>> I'll gladly try the patch, though--the data on the array is >>>> important to me. Thanks. >>>> >>> >>> To get the data from there you could also try to 'zfs send' it without >>> mounting the dataset at all (just in case). >>> >>> >> Sorry for the delay. I had to find another machine to move the disks >> into so that I could continue experimenting. Anyway, the filesystem >> didn't have any snapshots I could send, so I tried creating one with >> "zfs snapshot home@1" and the machine hung. >> >> FYI, In the new machine, all disks (including the one with the / >> filesystem) retain their device names. >> >> -Boris >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > Some more panics using RELENG_8 sources from yesterday: > http://acm.poly.edu/~spawk/zfs/. The one in panic3.txt happens much > more often than the other ones. If any brave soul wants to look into > it, I can provide NFS/geom_gate/whatever access to the disk images (or > actual disks, if there's a difference) so that they can recreate the > problem on a local machine. > > -Boris > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" For the archives: pjd@ took some time to examine the disk images I made of the RAID-Z pool, but found heavy corruption in the metadata. As it turns out, the machine had bad RAM during the incident, and that is probably what caused it. Unfortunately, I had only started to suspect it recently as random userland application and kernel panics became frequent. This is good news for ZFS users, as it indicates that ZFS did not corrupt my pool on its own. I do, however, advise you to be mindful of the problems bad memory can cause for ZFS. Personally, I will start shelling out a few more bucks for the ECC stuff from now on. (Eagerly awaiting the read-only offline recovery functionality described at http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20092.html). -Boris