From owner-freebsd-stable@FreeBSD.ORG  Tue Jan 26 13:57:24 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1B2EB1065672
	for <freebsd-stable@freebsd.org>; Tue, 26 Jan 2010 13:57:24 +0000 (UTC)
	(envelope-from gerrit@pmp.uni-hannover.de)
Received: from mrelay1.uni-hannover.de (mrelay1.uni-hannover.de [130.75.2.106])
	by mx1.freebsd.org (Postfix) with ESMTP id 9922E8FC14
	for <freebsd-stable@freebsd.org>; Tue, 26 Jan 2010 13:57:23 +0000 (UTC)
Received: from www.pmp.uni-hannover.de (www.pmp.uni-hannover.de [130.75.117.2])
	by mrelay1.uni-hannover.de (8.14.2/8.14.2) with ESMTP id o0QDvKfJ015286;
	Tue, 26 Jan 2010 14:57:21 +0100
Received: from pmp.uni-hannover.de (arc.pmp.uni-hannover.de [130.75.117.1])
	by www.pmp.uni-hannover.de (Postfix) with SMTP
	id 30D5924; Tue, 26 Jan 2010 14:57:20 +0100 (CET)
Date: Tue, 26 Jan 2010 14:57:20 +0100
From: Gerrit =?ISO-8859-1?Q?K=FChn?= <gerrit@pmp.uni-hannover.de>
To: Jeremy Chadwick <freebsd@jdc.parodius.com>
Message-Id: <20100126145720.ad9115ff.gerrit@pmp.uni-hannover.de>
In-Reply-To: <20100119112449.GA73052@icarus.home.lan>
References: <4B54C100.9080906@mail.zedat.fu-berlin.de>
	<4B54C5EE.5070305@pp.dyndns.biz>
	<201001191250.23625.doconnor@gsoft.com.au>
	<7346c5c61001181841j3653a7c3m32bc033c8c146a92@mail.gmail.com>
	<4B557B5A.8040902@pp.dyndns.biz>
	<20100119095736.GA71824@icarus.home.lan>
	<20100119110724.ec01a3ed.gerrit@pmp.uni-hannover.de>
	<20100119112449.GA73052@icarus.home.lan>
Organization: Albert-Einstein-Institut (MPI =?ISO-8859-1?Q?f=FCr?=
	Gravitationsphysik & IGP =?ISO-8859-1?Q?Universit=E4t?= Hannover)
X-Mailer: Sylpheed 2.7.1 (GTK+ 2.18.4; i386-portbld-freebsd7.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-PMX-Version: 5.5.9.388399, Antispam-Engine: 2.7.2.376379,
	Antispam-Data: 2010.1.26.134534
Cc: freebsd-stable@freebsd.org
Subject: Re: immense delayed write to file system (ZFS and UFS2),
 performance issues
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 26 Jan 2010 13:57:24 -0000

On Tue, 19 Jan 2010 03:24:49 -0800 Jeremy Chadwick
<freebsd@jdc.parodius.com> wrote about Re: immense delayed write to file
system (ZFS and UFS2), performance issues:

JC> So which drive models above are experiencing a continual increase in
JC> SMART attribute 193 (Load Cycle Count)?  My guess is that some of the
JC> WD Caviar Green models, and possibly all of the RE2-GP and RE4-GP
JC> models are experiencing this problem.

Just to add some more info:
I contacted WD support about the problem with RE4 drives and received a
firmware update by email today which is supposed to fix the problem. Did
not try it yet, though.


I am still busy replacing RE2-disks with updated drives. I came across a
very strange thing with zfs. Actually I had the following pool layout:

mclane# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0
            ad12    ONLINE       0     0     0
        spares
          ad14      AVAIL   

errors: No known data errors

All disks still have the firmware bug, so I want to replace them with
disks that I already fixed. I put in a updated drive as ad18 and
wanted to replace ad12 to get the drive with the broken firmware out:

mclane# zpool replace tank /dev/ad12 /dev/ad18 
mclane# zpool status
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.01% done, 52h51m to go
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          raidz1       ONLINE       0     0     0
            ad8        ONLINE       0     0     0  7.21M resilvered
            ad10       ONLINE       0     0     0  7.22M resilvered
            replacing  ONLINE       0     0     0
              ad12     ONLINE       0     0     0
              ad18     ONLINE       0     0     0  10.7M resilvered
        spares
          ad14         AVAIL   

errors: No known data errors

However, something must have gone wrong during the resilvering process and
it now looks like this:

mclane# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are
unaffected. action: Determine if the device needs to be replaced, and
clear the errors using 'zpool clear' or replace the device with 'zpool
replace'. see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 2h39m with 0 errors on Tue Jan 26
14:00:00 2010 config:

        NAME           STATE     READ WRITE CKSUM
        tank           DEGRADED     0     0     0
          raidz1       DEGRADED     0     0     0
            ad8        ONLINE       0     0     0  975M resilvered
            ad10       ONLINE       0     0   142  974M resilvered
            replacing  DEGRADED     0 7.25M     0
              ad12     ONLINE       0     0     0
              ad18     REMOVED      0     1     0  79.4M resilvered
        spares
          ad14         AVAIL   

errors: No known data errors


What is going on here? ad18 obviously detached during the
process. /var/log/messages just gives me

Jan 26 11:23:33 mclane kernel: ad18: FAILURE - device detached

Additionally ad10 obviously produced chksum errors. What do I do about the
degraded replacing process? Can I terminate it somehow and maybe replace
ad10 first? Any other hints?


cu
  Gerrit