From owner-freebsd-fs@FreeBSD.ORG Sun May 6 06:11:03 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 39171106564A for ; Sun, 6 May 2012 06:11:03 +0000 (UTC) (envelope-from artemb@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id A281D8FC0C for ; Sun, 6 May 2012 06:11:02 +0000 (UTC) Received: by lagv3 with SMTP id v3so3868435lag.13 for ; Sat, 05 May 2012 23:11:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=E0VW8wKTRqXrRME1J4cytVfm//mysDvG+hl2n99rewc=; b=A5pKWOathTHKqv/vgQnb9evXrOBgkc0B0qjhtBv0CAekyTg11unxOqesF4qy4kLh/f j7O0YvHeh0tboB62biEvwkvXIz8n0y0eLnlZ28Lee7sq9/lOYSVc6AINoLj0rCdh6vAo hlalPbCA4q0coNH45Cpoqxa0W+oTOXq1mKzGLKhxKZpE1nEqAlk1DeIxbsQNcZOlJkDg /ef4kbh7DHygiu1ywX+9J6Gz+V80p68fqMlIKrndYfY0EJuKjGH1agErnLub7tzOE+b+ 9jvYmyMxVGnQC0CmtHHnBI5jDklgtBAz0E+FDYvIgsnlUh8ehZ+veeX95V0wwcx3BYKp 05dw== MIME-Version: 1.0 Received: by 10.152.129.74 with SMTP id nu10mr10414024lab.50.1336284661240; Sat, 05 May 2012 23:11:01 -0700 (PDT) Sender: artemb@gmail.com Received: by 10.112.2.5 with HTTP; Sat, 5 May 2012 23:11:01 -0700 (PDT) In-Reply-To: References: Date: Sat, 5 May 2012 23:11:01 -0700 X-Google-Sender-Auth: BWwuTikksOcymGUE1W1pSQIhiB8 Message-ID: From: Artem Belevich To: Michael Richards Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-fs@freebsd.org Subject: Re: ZFS Kernel Panics with 32 and 64 bit versions of 8.3 and 9.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 May 2012 06:11:03 -0000 I believe I've ran into this issue couple three times. In all cases the culprit was memory corruption. If were to guess, corruption damaged critical data *before* ZFS calculated checksum and was able to write it to disk. Once that happened, kernel would panic every time once the pool was in use. Crashes could happen as soon as zpool import or as late as after few days of uptime or next scheduled scrub. I even tried importing/scrubbing the pool on opensolaris without much success -- while solaris didn't crash outright, it failed to import the pool with internal assertion. On Sat, May 5, 2012 at 7:13 PM, Michael Richards wrote: > Originally I had an 8.1 server setup on a 32bit kernel. The OS is on a > UFS filesystem and (it's a mail server) the business part of the > operation is on ZFS. > > One day it crashed with an odd kernel panic. I assumed it was a memory > issue so I had more RAM installed. I tried to get a PAE kernel working > to use this extra ram but it was crashing every few hours. > > Suspecting a hardware issue all the hardware was replaced. Bad memory could indeed do that. > I had some difficulty trying to figure out how to mount my old ZFS > partition but eventually did so. ... > zpool import -f -R /altroot 10433152746165646153 olddata > panics the kernel. Similar panic as seen in all the other kernel versions. > Gives a bit more info about things I've tried. Whatever it is seems to > affect a wide variety of kernels. Kernel is just a messenger here. The root cause is that while ZFS does go an extra mile or two in order to ensure data consistency, there's only so much it can do if RAM is bad. Once that kind of problem happened, it may leave the pool in a state that ZFS will not be able to deal with out of the box. Not everything may be lost, though. First of all -- make a copy of your pool, if it's feasible. Probability of screwing it up even more is rather high. ZFS internally keeps large number of uberblocks. Each uberblock is sort of a periodic checkpoint of the pool state after ZFS writes next transaction group (every 10-40 sec, depending on vfs.zfs.txg.timeout sysctl, more often if there are a lot of ongoing write activity). Basically you need to destroy the most recent uberblock to manually roll-back your ZFS pool. Hopefully, you'll only need to nuke few most recent ones to restore the pool to the point before corruption ruined it. Now, ZFS keeps multiple copies of uberblocks. You will need to nuke *all* instances of the most recent uberblock in order to roll pool state backwards. Solaris internals site seems to have a script to do that now (I wish I knew about it back when I needed it): http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script Good luck! --Artem