From owner-freebsd-fs@FreeBSD.ORG  Thu Oct 18 05:20:21 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: FS@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 62C078B8
 for <FS@freebsd.org>; Thu, 18 Oct 2012 05:20:21 +0000 (UTC)
 (envelope-from james@jrv.org)
Received: from mail.jrv.org (adsl-70-243-84-11.dsl.austtx.swbell.net
 [70.243.84.11]) by mx1.freebsd.org (Postfix) with ESMTP id EAFED8FC08
 for <FS@freebsd.org>; Thu, 18 Oct 2012 05:20:20 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by mail.jrv.org (Postfix) with ESMTP id D41BF6D65AF;
 Thu, 18 Oct 2012 00:10:18 -0500 (CDT)
X-Virus-Scanned: amavisd-new at zimbra.housenet.jrv
Received: from mail.jrv.org ([127.0.0.1])
 by localhost (zimbra.housenet.jrv [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id XQcK+khE5SDO; Thu, 18 Oct 2012 00:09:21 -0500 (CDT)
Received: from [10.0.2.15] (adsl-70-243-84-14.dsl.austtx.swbell.net
 [70.243.84.14])
 by mail.jrv.org (Postfix) with ESMTPSA id 2B4436D603F;
 Thu, 18 Oct 2012 00:09:21 -0500 (CDT)
Message-ID: <507F8EFF.4020609@jrv.org>
Date: Thu, 18 Oct 2012 00:09:19 -0500
From: "James R. Van Artsdalen" <james@jrv.org>
User-Agent: Mozilla/5.0 (Windows NT 5.0;
 rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: Heikki Suonsivu <heikki@suonsivu.net>
Subject: Re: ZFS raidz2, errors in file?
References: <507EED58.80409@suonsivu.net>
In-Reply-To: <507EED58.80409@suonsivu.net>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: FS@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Oct 2012 05:20:21 -0000

On 10/17/2012 12:39 PM, Heikki Suonsivu wrote:
> SMART data indicates problems on two other disks, but no indication of
> those are seen in logs (the disks work, but SMART information
> indicates problems).

The problems may be in areas ZFS has not tried to read.

> One disk indeed has pending sector, not unusual and should be survivable:
>
> ------------------------------------------------------------------------
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED 
> WHEN_FAILED RAW_VALUE
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age  
> Always       -       1
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       1

That error means one sector is unreadable and a replacement is pending;
replacement will happen when next as the sector is overwritten.  The
contents of that sector are lost (unless some future read succeeds).

> In addition, there seems to be ICRC DMA errors on da0.  Looks nasty,
> but only show up in SMART log, not in /var/log/messages.
>
> ------------------------------------------------------------------------
> 199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age  
> Always       -       112

I believe that both of these messages refer to errors in transfers
between the disk and host, not to errors within the disk.  Test your
cabling and enclosures.

> SMART Error Log Version: 1
> ATA Error Count: 112 (device log contains only the most recent five
> errors)

I don't like these at all.  Consider replacing that disk.

> If the da0 ICRC errors would have been seen by ZFS, it should have
> made a) note of that in some log?  b) retried write?  c) Something
> else?  If we assume that the disk firmware is broken and does not
> report these to OS, so da0 might be corrupt.  But that should still be
> ok with raidz2.

These errors should trigger retries in layers beneath ZFS

> We do have 3 random SCSI timeouts, which were seen by FreeBSD, and
> thus should have prompted ZFS do handle the errors, and one read error
> on data, which is not reported as read error in any log, other than
> disk's SMART info says so.

The retries may have happened at layer below ZFS.

ZFS does not call the disk driver directly.  Other layers play a role in
error handing.