Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 5 May 2012 23:11:01 -0700
From:      Artem Belevich <art@freebsd.org>
To:        Michael Richards <hackish@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS Kernel Panics with 32 and 64 bit versions of 8.3 and 9.0
Message-ID:  <CAFqOu6gz%2BFd-NvPivMz3nfeGCYz0a563yNBOpmsAyHZS_TQybQ@mail.gmail.com>
In-Reply-To: <CAPUouH3zgnGdzbe=0x4M32_1D-9J-E=_y-BP1zhyu-axBxsjwA@mail.gmail.com>
References:  <CAPUouH3zgnGdzbe=0x4M32_1D-9J-E=_y-BP1zhyu-axBxsjwA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I believe I've ran into this issue couple three times. In all cases
the culprit was memory corruption. If were to guess, corruption
damaged critical data *before* ZFS calculated checksum and was able to
write it to disk. Once that happened, kernel would panic every time
once the pool was in use. Crashes could happen as soon as zpool import
or as late as after few days of uptime or next scheduled scrub. I even
tried importing/scrubbing the pool on opensolaris without much success
-- while solaris didn't crash outright, it failed to import the pool
with internal assertion.

On Sat, May 5, 2012 at 7:13 PM, Michael Richards <hackish@gmail.com> wrote:
> Originally I had an 8.1 server setup on a 32bit kernel. The OS is on a
> UFS filesystem and (it's a mail server) the business part of the
> operation is on ZFS.
>
> One day it crashed with an odd kernel panic. I assumed it was a memory
> issue so I had more RAM installed. I tried to get a PAE kernel working
> to use this extra ram but it was crashing every few hours.
>
> Suspecting a hardware issue all the hardware was replaced.

Bad memory could indeed do that.

> I had some difficulty trying to figure out how to mount my old ZFS
> partition but eventually did so.
...
> zpool import -f -R /altroot 10433152746165646153 olddata
> panics the kernel. Similar panic as seen in all the other kernel versions.


> Gives a bit more info about things I've tried. Whatever it is seems to
> affect a wide variety of kernels.

Kernel is just a messenger here. The root cause is that while ZFS does
go an extra mile or two in order to ensure data consistency, there's
only so much it can do if RAM is bad. Once that kind of problem
happened, it may leave the pool in a state that ZFS will not be able
to deal with out of the box.

Not everything may be lost, though.

First of all -- make a copy of your pool, if it's feasible.
Probability of screwing it up even more is rather high.

ZFS internally keeps large number of uberblocks. Each uberblock is
sort of a periodic checkpoint of the pool state after ZFS writes next
transaction group (every 10-40 sec, depending on vfs.zfs.txg.timeout
sysctl, more often if there are a lot of ongoing write activity).
Basically you need to destroy the most recent uberblock to manually
roll-back your ZFS pool. Hopefully, you'll only need to nuke few most
recent ones to restore the pool to the point before corruption ruined
it.

Now, ZFS keeps multiple copies of uberblocks. You will need to nuke
*all* instances of the most recent uberblock in order to roll pool
state backwards.

Solaris internals site seems to have a script to do that now (I wish I
knew about it back when I needed it):
http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script

Good luck!

--Artem



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFqOu6gz%2BFd-NvPivMz3nfeGCYz0a563yNBOpmsAyHZS_TQybQ>