From owner-freebsd-stable@FreeBSD.ORG  Tue Jan 26 16:59:42 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2972C1065676
	for <freebsd-stable@freebsd.org>; Tue, 26 Jan 2010 16:59:42 +0000 (UTC)
	(envelope-from cswiger@mac.com)
Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99])
	by mx1.freebsd.org (Postfix) with ESMTP id 123478FC08
	for <freebsd-stable@freebsd.org>; Tue, 26 Jan 2010 16:59:41 +0000 (UTC)
MIME-version: 1.0
Content-type: text/plain; charset=iso-8859-1
Received: from [10.0.1.46] ([173.200.179.65])
	by asmtp024.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01
	(built Dec
	16 2008; 32bit)) with ESMTPSA id <0KWV0035J5V3JE90@asmtp024.mac.com> for
	freebsd-stable@freebsd.org; Tue, 26 Jan 2010 08:59:28 -0800 (PST)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=5.0.0-0908210000 definitions=main-1001260108
From: Chuck Swiger <cswiger@mac.com>
In-reply-to: <20100126172503.927e1bb5.gerrit@pmp.uni-hannover.de>
Date: Tue, 26 Jan 2010 08:59:27 -0800
Content-transfer-encoding: quoted-printable
Message-id: <5F20B2B6-D75C-4E27-9CC9-85C6E64D13BD@mac.com>
References: <20100126143021.GA47535@icarus.home.lan>
	<20100126160320.6ed67b92.gerrit@pmp.uni-hannover.de>
	<FA0BAC0D-35A7-4296-B52C-9D4D8A6CC609@mac.com>
	<20100126172503.927e1bb5.gerrit@pmp.uni-hannover.de>
To: =?iso-8859-1?Q?Gerrit_K=FChn?= <gerrit@pmp.uni-hannover.de>
X-Mailer: Apple Mail (2.1077)
Cc: freebsd-stable@freebsd.org
Subject: Re: ZFS "zpool replace" problems
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 26 Jan 2010 16:59:42 -0000

Hi--

On Jan 26, 2010, at 8:25 AM, Gerrit K=FChn wrote:
> CS> There's your problem-- the Silicon Image 3112/4 chips are =
remarkably
> CS> buggy and exhibit data corruption:
>=20
> Hm, sure?

I'm sure that the SII 3112 is buggy.
I am not sure that it is the primary or only cause of the problems you =
describe.

[ ... ]
> I already thought about replacing the controller to get rid of the
> detach-problem. However, I cannot do this online and I really would =
prefer
> fixing the disk firmware problem first.
> I could remove the hotspare drive ad14 and use this slot for putting =
in a
> replacement disk. Is it possible to get ad18 out of zfs' replacing
> process? Maybe by detaching the disk from the pool?

I don't know enough about ZFS to provide specific advice for recovery =
attempts (aside from the notion of restoring your data from a backup =
instead).=20

As a general matter of maintaining RAID systems, however, the approach =
to upgrading drive firmware on members of a RAID array should be to take =
down the entire container and offline the drives, update one drive, test =
it (via SMART self-test and read-only checksum comparison or similar), =
and then proceed to update all of the drives (preferably doing the SMART =
self-test for each, if time allows) before returning them to the RAID =
container and onlining them.

Pulling individual drives from a RAID set while live and updating the =
firmware one at a time is not an approach I would take-- running with =
mixed firmware versions doesn't thrill me, and I know of multiple cases =
where someone made a mistake reconnecting a drive with the wrong SCSI id =
or something like that, taking out a second drive while the RAID was not =
redundant, resulting in massive data corruption or even total loss of =
the RAID contents.

Regards,
--=20
-Chuck