From owner-freebsd-stable@FreeBSD.ORG Tue Jan 26 13:57:24 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1B2EB1065672 for ; Tue, 26 Jan 2010 13:57:24 +0000 (UTC) (envelope-from gerrit@pmp.uni-hannover.de) Received: from mrelay1.uni-hannover.de (mrelay1.uni-hannover.de [130.75.2.106]) by mx1.freebsd.org (Postfix) with ESMTP id 9922E8FC14 for ; Tue, 26 Jan 2010 13:57:23 +0000 (UTC) Received: from www.pmp.uni-hannover.de (www.pmp.uni-hannover.de [130.75.117.2]) by mrelay1.uni-hannover.de (8.14.2/8.14.2) with ESMTP id o0QDvKfJ015286; Tue, 26 Jan 2010 14:57:21 +0100 Received: from pmp.uni-hannover.de (arc.pmp.uni-hannover.de [130.75.117.1]) by www.pmp.uni-hannover.de (Postfix) with SMTP id 30D5924; Tue, 26 Jan 2010 14:57:20 +0100 (CET) Date: Tue, 26 Jan 2010 14:57:20 +0100 From: Gerrit =?ISO-8859-1?Q?K=FChn?= To: Jeremy Chadwick Message-Id: <20100126145720.ad9115ff.gerrit@pmp.uni-hannover.de> In-Reply-To: <20100119112449.GA73052@icarus.home.lan> References: <4B54C100.9080906@mail.zedat.fu-berlin.de> <4B54C5EE.5070305@pp.dyndns.biz> <201001191250.23625.doconnor@gsoft.com.au> <7346c5c61001181841j3653a7c3m32bc033c8c146a92@mail.gmail.com> <4B557B5A.8040902@pp.dyndns.biz> <20100119095736.GA71824@icarus.home.lan> <20100119110724.ec01a3ed.gerrit@pmp.uni-hannover.de> <20100119112449.GA73052@icarus.home.lan> Organization: Albert-Einstein-Institut (MPI =?ISO-8859-1?Q?f=FCr?= Gravitationsphysik & IGP =?ISO-8859-1?Q?Universit=E4t?= Hannover) X-Mailer: Sylpheed 2.7.1 (GTK+ 2.18.4; i386-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-PMX-Version: 5.5.9.388399, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2010.1.26.134534 Cc: freebsd-stable@freebsd.org Subject: Re: immense delayed write to file system (ZFS and UFS2), performance issues X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Jan 2010 13:57:24 -0000 On Tue, 19 Jan 2010 03:24:49 -0800 Jeremy Chadwick wrote about Re: immense delayed write to file system (ZFS and UFS2), performance issues: JC> So which drive models above are experiencing a continual increase in JC> SMART attribute 193 (Load Cycle Count)? My guess is that some of the JC> WD Caviar Green models, and possibly all of the RE2-GP and RE4-GP JC> models are experiencing this problem. Just to add some more info: I contacted WD support about the problem with RE4 drives and received a firmware update by email today which is supposed to fix the problem. Did not try it yet, though. I am still busy replacing RE2-disks with updated drives. I came across a very strange thing with zfs. Actually I had the following pool layout: mclane# zpool status pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad8 ONLINE 0 0 0 ad10 ONLINE 0 0 0 ad12 ONLINE 0 0 0 spares ad14 AVAIL errors: No known data errors All disks still have the firmware bug, so I want to replace them with disks that I already fixed. I put in a updated drive as ad18 and wanted to replace ad12 to get the drive with the broken firmware out: mclane# zpool replace tank /dev/ad12 /dev/ad18 mclane# zpool status pool: tank state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 0.01% done, 52h51m to go config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad8 ONLINE 0 0 0 7.21M resilvered ad10 ONLINE 0 0 0 7.22M resilvered replacing ONLINE 0 0 0 ad12 ONLINE 0 0 0 ad18 ONLINE 0 0 0 10.7M resilvered spares ad14 AVAIL errors: No known data errors However, something must have gone wrong during the resilvering process and it now looks like this: mclane# zpool status pool: tank state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 2h39m with 0 errors on Tue Jan 26 14:00:00 2010 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 ad8 ONLINE 0 0 0 975M resilvered ad10 ONLINE 0 0 142 974M resilvered replacing DEGRADED 0 7.25M 0 ad12 ONLINE 0 0 0 ad18 REMOVED 0 1 0 79.4M resilvered spares ad14 AVAIL errors: No known data errors What is going on here? ad18 obviously detached during the process. /var/log/messages just gives me Jan 26 11:23:33 mclane kernel: ad18: FAILURE - device detached Additionally ad10 obviously produced chksum errors. What do I do about the degraded replacing process? Can I terminate it somehow and maybe replace ad10 first? Any other hints? cu Gerrit