Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Nov 2010 18:00:29 -0500 (EST)
From:      Terry Kennedy <TERRY@tmk.com>
To:        freebsd-stable@freebsd.org, freebsd-fs@freebsd.org
Subject:   ZFS panic after replacing log device
Message-ID:  <01NU7TBBN3D000BCHX@tmk.com>

next in thread | raw e-mail | index | archive | help
I'm posting this to the freebsd-stable and freebsd-fs mailing lists. Followups
should probably happen on freebsd-fs.

I have a ZFS pool configured as:

zpool create data raidz da1 da2 da3 da4 da5 raidz da6 da7 da8 da9 da10 
raidz da11 da12 da13 da14 da15 spare da16 log da0

where da1-16 are WD2003FYYS drives (2TB RE4) and da0 is a 256GB PCI-Express
SSD (name omitted to protect the guilty).

The SSD has been dropping offline randomly - it seems that one or more flash 
modules pop out of their sockets and need to be re-seated frequently for some 
reason.

The most recent time it did that, I replaced the SSD with another one (for some 
reason, the manufacturer ties the flash modules to a particular controller, so 
just moving the modules results in an offline SSD and inability to manage it 
due to "license limits exceeded" or some such nonsense).

ZFS wasn't happy with the log device being changed, and reported it as 
corrupted, with the suggested corrective action being to "zpool clear" it. I 
did that, and then did a "zpool replace data da0 da0" and it claimed to 
successfully resilver it. I then did a "zpool scrub" and the scrub completed 
with no errors. So far, so good.

However, any attempt to write to the array results in a near-immediate panic:

panic: solaris assert: sm->sm_spare + size <= sm->sm_size, file: 
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, 
line: 93 cpuid=2

(Screenshot at http://www.tmk.com/transient/zfs-panic.png in case I mis-typed
something).

This is repeatable across reboot / scrub / test cycles. System is 8-STABLE as 
of Fri Nov  5 19:08:35 EDT 2010, on-disk pool is version 4/15, same as the 
kernel.

I know that certain operations on log devices aren't supported until pool 
version 19 or thereabouts, but the error messages and zpool command results 
gave the impression that what I was doing was supported and worked (when it 
didn't). If this is truly a "you can't do that in pool version 15", perhaps a 
warning could be added so users don't get fooled into thinking it worked?

I can give a developer remote console / root access to the box if that would 
help. I have a couple days before I will need to nuke the pool and restore it 
from backups.

        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?01NU7TBBN3D000BCHX>