From owner-freebsd-stable@FreeBSD.ORG Sat Nov 13 23:02:01 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EAFEC1065672 for ; Sat, 13 Nov 2010 23:02:01 +0000 (UTC) (envelope-from TERRY@tmk.com) Received: from server.tmk.com (server.tmk.com [204.141.35.63]) by mx1.freebsd.org (Postfix) with ESMTP id C87FF8FC1C for ; Sat, 13 Nov 2010 23:02:01 +0000 (UTC) Received: from tmk.com by tmk.com (PMDF V6.4 #37010) id <01NU7T8XRYKW00BCHX@tmk.com>; Sat, 13 Nov 2010 18:01:59 -0500 (EST) Date: Sat, 13 Nov 2010 18:00:29 -0500 (EST) From: Terry Kennedy To: freebsd-stable@freebsd.org, freebsd-fs@freebsd.org Message-id: <01NU7TBBN3D000BCHX@tmk.com> MIME-version: 1.0 Content-type: TEXT/PLAIN; CHARSET=us-ascii Cc: Subject: ZFS panic after replacing log device X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Nov 2010 23:02:02 -0000 I'm posting this to the freebsd-stable and freebsd-fs mailing lists. Followups should probably happen on freebsd-fs. I have a ZFS pool configured as: zpool create data raidz da1 da2 da3 da4 da5 raidz da6 da7 da8 da9 da10 raidz da11 da12 da13 da14 da15 spare da16 log da0 where da1-16 are WD2003FYYS drives (2TB RE4) and da0 is a 256GB PCI-Express SSD (name omitted to protect the guilty). The SSD has been dropping offline randomly - it seems that one or more flash modules pop out of their sockets and need to be re-seated frequently for some reason. The most recent time it did that, I replaced the SSD with another one (for some reason, the manufacturer ties the flash modules to a particular controller, so just moving the modules results in an offline SSD and inability to manage it due to "license limits exceeded" or some such nonsense). ZFS wasn't happy with the log device being changed, and reported it as corrupted, with the suggested corrective action being to "zpool clear" it. I did that, and then did a "zpool replace data da0 da0" and it claimed to successfully resilver it. I then did a "zpool scrub" and the scrub completed with no errors. So far, so good. However, any attempt to write to the array results in a near-immediate panic: panic: solaris assert: sm->sm_spare + size <= sm->sm_size, file: /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c, line: 93 cpuid=2 (Screenshot at http://www.tmk.com/transient/zfs-panic.png in case I mis-typed something). This is repeatable across reboot / scrub / test cycles. System is 8-STABLE as of Fri Nov 5 19:08:35 EDT 2010, on-disk pool is version 4/15, same as the kernel. I know that certain operations on log devices aren't supported until pool version 19 or thereabouts, but the error messages and zpool command results gave the impression that what I was doing was supported and worked (when it didn't). If this is truly a "you can't do that in pool version 15", perhaps a warning could be added so users don't get fooled into thinking it worked? I can give a developer remote console / root access to the box if that would help. I have a couple days before I will need to nuke the pool and restore it from backups. Terry Kennedy http://www.tmk.com terry@tmk.com New York, NY USA