Date: Thu, 18 Oct 2012 00:09:19 -0500 From: "James R. Van Artsdalen" <james@jrv.org> To: Heikki Suonsivu <heikki@suonsivu.net> Cc: FS@freebsd.org Subject: Re: ZFS raidz2, errors in file? Message-ID: <507F8EFF.4020609@jrv.org> In-Reply-To: <507EED58.80409@suonsivu.net> References: <507EED58.80409@suonsivu.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 10/17/2012 12:39 PM, Heikki Suonsivu wrote: > SMART data indicates problems on two other disks, but no indication of > those are seen in logs (the disks work, but SMART information > indicates problems). The problems may be in areas ZFS has not tried to read. > One disk indeed has pending sector, not unusual and should be survivable: > > ------------------------------------------------------------------------ > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age > Always - 1 > 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age > Offline - 1 That error means one sector is unreadable and a replacement is pending; replacement will happen when next as the sector is overwritten. The contents of that sector are lost (unless some future read succeeds). > In addition, there seems to be ICRC DMA errors on da0. Looks nasty, > but only show up in SMART log, not in /var/log/messages. > > ------------------------------------------------------------------------ > 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age > Always - 112 I believe that both of these messages refer to errors in transfers between the disk and host, not to errors within the disk. Test your cabling and enclosures. > SMART Error Log Version: 1 > ATA Error Count: 112 (device log contains only the most recent five > errors) I don't like these at all. Consider replacing that disk. > If the da0 ICRC errors would have been seen by ZFS, it should have > made a) note of that in some log? b) retried write? c) Something > else? If we assume that the disk firmware is broken and does not > report these to OS, so da0 might be corrupt. But that should still be > ok with raidz2. These errors should trigger retries in layers beneath ZFS > We do have 3 random SCSI timeouts, which were seen by FreeBSD, and > thus should have prompted ZFS do handle the errors, and one read error > on data, which is not reported as read error in any log, other than > disk's SMART info says so. The retries may have happened at layer below ZFS. ZFS does not call the disk driver directly. Other layers play a role in error handing.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?507F8EFF.4020609>