From owner-freebsd-stable@FreeBSD.ORG Tue Jan 26 14:30:24 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EF8A11065672 for ; Tue, 26 Jan 2010 14:30:23 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta06.emeryville.ca.mail.comcast.net (qmta06.emeryville.ca.mail.comcast.net [76.96.30.56]) by mx1.freebsd.org (Postfix) with ESMTP id D96F38FC21 for ; Tue, 26 Jan 2010 14:30:22 +0000 (UTC) Received: from omta12.emeryville.ca.mail.comcast.net ([76.96.30.44]) by qmta06.emeryville.ca.mail.comcast.net with comcast id aDH21d0020x6nqcA6EWP6s; Tue, 26 Jan 2010 14:30:23 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta12.emeryville.ca.mail.comcast.net with comcast id aEWN1d00E3S48mS8YEWNZR; Tue, 26 Jan 2010 14:30:23 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 5C49D1E3035; Tue, 26 Jan 2010 06:30:21 -0800 (PST) Date: Tue, 26 Jan 2010 06:30:21 -0800 From: Jeremy Chadwick To: Gerrit =?iso-8859-1?Q?K=FChn?= Message-ID: <20100126143021.GA47535@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-stable@freebsd.org Subject: Re: ZFS "zpool replace" problems X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Jan 2010 14:30:24 -0000 I'm removing the In-Reply-To mail headers for this thread, as you've now hijacked it for a different purpose. Please don't do this; start a new thread altogether. :-) On Tue, Jan 26, 2010 at 02:57:20PM +0100, Gerrit Kühn wrote: > I am still busy replacing RE2-disks with updated drives. I came across a > very strange thing with zfs. Actually I had the following pool layout: > > mclane# zpool status > pool: tank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad8 ONLINE 0 0 0 > ad10 ONLINE 0 0 0 > ad12 ONLINE 0 0 0 > spares > ad14 AVAIL > > errors: No known data errors > > All disks still have the firmware bug, so I want to replace them with > disks that I already fixed. I put in a updated drive as ad18 and > wanted to replace ad12 to get the drive with the broken firmware out: > > mclane# zpool replace tank /dev/ad12 /dev/ad18 > mclane# zpool status > pool: tank > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 0h0m, 0.01% done, 52h51m to go > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad8 ONLINE 0 0 0 7.21M resilvered > ad10 ONLINE 0 0 0 7.22M resilvered > replacing ONLINE 0 0 0 > ad12 ONLINE 0 0 0 > ad18 ONLINE 0 0 0 10.7M resilvered > spares > ad14 AVAIL > > errors: No known data errors > > However, something must have gone wrong during the resilvering process and > it now looks like this: > > mclane# zpool status > pool: tank > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. action: Determine if the device needs to be replaced, and > clear the errors using 'zpool clear' or replace the device with 'zpool > replace'. see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver completed after 2h39m with 0 errors on Tue Jan 26 > 14:00:00 2010 config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > ad8 ONLINE 0 0 0 975M resilvered > ad10 ONLINE 0 0 142 974M resilvered > replacing DEGRADED 0 7.25M 0 > ad12 ONLINE 0 0 0 > ad18 REMOVED 0 1 0 79.4M resilvered > spares > ad14 AVAIL > > errors: No known data errors > > > What is going on here? ad18 obviously detached during the > process. /var/log/messages just gives me > > Jan 26 11:23:33 mclane kernel: ad18: FAILURE - device detached > > Additionally ad10 obviously produced chksum errors. What do I do about the > degraded replacing process? Can I terminate it somehow and maybe replace > ad10 first? Any other hints? I'm not sure how the above is supposed to work (I haven't personally tried it), but: 1) Why didn't you offline the ad10 disk first? zpool offline tank ad10 2) How did you attach ad18? Did you tell the system about it using atacontrol? If so, what commands did you use? 3) Can you please provide uname -a output, as well as relevant dmesg output to show what kind of SATA controller you have, what's attached to what, etc.? -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |