Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Oct 2012 00:09:19 -0500
From:      "James R. Van Artsdalen" <james@jrv.org>
To:        Heikki Suonsivu <heikki@suonsivu.net>
Cc:        FS@freebsd.org
Subject:   Re: ZFS raidz2, errors in file?
Message-ID:  <507F8EFF.4020609@jrv.org>
In-Reply-To: <507EED58.80409@suonsivu.net>
References:  <507EED58.80409@suonsivu.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 10/17/2012 12:39 PM, Heikki Suonsivu wrote:
> SMART data indicates problems on two other disks, but no indication of
> those are seen in logs (the disks work, but SMART information
> indicates problems).

The problems may be in areas ZFS has not tried to read.

> One disk indeed has pending sector, not unusual and should be survivable:
>
> ------------------------------------------------------------------------
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED 
> WHEN_FAILED RAW_VALUE
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age  
> Always       -       1
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
> Offline      -       1

That error means one sector is unreadable and a replacement is pending;
replacement will happen when next as the sector is overwritten.  The
contents of that sector are lost (unless some future read succeeds).

> In addition, there seems to be ICRC DMA errors on da0.  Looks nasty,
> but only show up in SMART log, not in /var/log/messages.
>
> ------------------------------------------------------------------------
> 199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age  
> Always       -       112

I believe that both of these messages refer to errors in transfers
between the disk and host, not to errors within the disk.  Test your
cabling and enclosures.

> SMART Error Log Version: 1
> ATA Error Count: 112 (device log contains only the most recent five
> errors)

I don't like these at all.  Consider replacing that disk.

> If the da0 ICRC errors would have been seen by ZFS, it should have
> made a) note of that in some log?  b) retried write?  c) Something
> else?  If we assume that the disk firmware is broken and does not
> report these to OS, so da0 might be corrupt.  But that should still be
> ok with raidz2.

These errors should trigger retries in layers beneath ZFS

> We do have 3 random SCSI timeouts, which were seen by FreeBSD, and
> thus should have prompted ZFS do handle the errors, and one read error
> on data, which is not reported as read error in any log, other than
> disk's SMART info says so.

The retries may have happened at layer below ZFS.

ZFS does not call the disk driver directly.  Other layers play a role in
error handing.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?507F8EFF.4020609>