Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Apr 2020 23:03:08 -0700
From:      Xin Li <delphij@delphij.net>
To:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   zpool question -- resilvering doesn't fully check on-disk data for corruption?
Message-ID:  <bc5add51-3094-9e6b-1054-821ac18265a7@delphij.net>

next in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9
Content-Type: multipart/mixed; boundary="qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni";
 protected-headers="v1"
From: Xin Li <delphij@delphij.net>
Reply-To: d@delphij.net
To: freebsd-fs <freebsd-fs@freebsd.org>
Message-ID: <bc5add51-3094-9e6b-1054-821ac18265a7@delphij.net>
Subject: zpool question -- resilvering doesn't fully check on-disk data for
 corruption?

--qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable

Hi,

I have recently seen a bad drive on my home storage server.  The bad
drive had some timeouts occasionally that would cause the CAM subsystem
to kick it off eventually, like:

(ada1:ahcich11:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00
00 00
(ada1:ahcich11:0:0:0): CAM status: Command timeout
(ada1:ahcich11:0:0:0): Retrying command, 0 more tries remain
ada1 at ahcich11 bus 0 scbus11 target 0 lun 0
ada1: <WDC WD40EFRX-68WT0N0 80.00A80> s/n WD-WMC4E0090978 detached
(ada1:ahcich11:0:0:0): Periph destroyed

When this happens, a full 'camcontrol reset all' and 'camcontrol rescan
all' would bring it back, and ZFS would correctly start a resilvering
process as expected.  After the resilvering, zpool would detect several
checksum errors (also expected).

As a precautional measure, I usually would start another zpool scrub to
check data integration again when this happens.  To my surprise, in the
last few times when that drive was timing out, the zpool scrub would
also find some checksum errors and correct these (the drive is in a
RAID-Z pool).  A second run of 'zpool scrub' after that would no longer
be able to find any checksum errors.

I initially thought that is probably because there were some bad blocks
on the bad hard drive and didn't pay much attention as I already ordered
a new hard drive to replace it, but when the new drive arrived, I have
initiated a 'zpool replace' with both bad and new drive attached (which
will start a resilver too; I didn't perform a zpool scrub the last time
when the timeout happens because the scrub was very slow and I feared
that I might end up causing more damage to the bad drive before the new
drive arrived).  When the new drive arrived, however, to my surprise,
the zpool scrub after the replacement resilver have detected new
checksum errors on the newly attached drive.

Is this expected?  (My understanding is that both resilver and scrub
would read all data from a RAID-Z pool, therefore checking checksums for
all blocks, and for replacing, so checksum errors shouldn't really
happen for the new drive, because the written data was already
checksummed?  The system is equipped with ECC RAM, etc.; I know there is
a possibility that the disk controller or the disk itself may still
introduce bit flips, etc. if I'm really unlucky, but if that's the case
I think I should have seen errors more often...)

Cheers,


--qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni--

--5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.2.20 (Darwin)

iQIzBAEBCgAdFiEEceNg5NEMZIki80nQQHl/fJX0g08FAl6Wo5wACgkQQHl/fJX0
g0/b9w//U0Rv/w30CCdAqdH97RCZ0bnRBXJlq+XOFc2+eNymXcE6mIY29IYLbuPI
h0dozLeNZGPjm7iRUhBibhRSwdK5/wD5E1+AVoBeUUyTv9bCEf1flORkZiz/zDMB
AZ3cWotX6udqmOQFJn2Cu+cMft6NauEM3WYOnFT1BMmdReetgcGGY0WX1Pheoq8g
3RXZyecMm448vCOU3Syw78nTbSH6YPv0aTDr5GiSH5MD79E7EyzSLr/vhWmUe78G
9ec7NsDZhy/W8SF4KBTnas3N+wuAOlSHy3AGnLTzqLvhIUIFOxYi+UxxBtngz4v4
NjI8u9pURis4TF/NM86dZDBAoWaAQiAtuC7knGPc6ITmelveuIobWos73F8F3w81
NLguXAbC4ssN91MjYvAxhnv68u3nrzKWRF6QmAPPO7pEaQgMlKS0akbg1LrbQT6m
ARheG9XTAM2hR9wsSN0yN07EOOFGkXraXC2asTBtZ5vtaOKDir/DdytTYB5yVB38
opdEJlZ6ClwvNjGP8+zHBeh4bHMXsMeD9mrymiNfAkZL9+R0ocqwwa/LS5c9+jTI
HubDh79nPUOC3G6sRztVGxnP+zBHMucjgyY/SlhzncdULepkXLO7uOnT2i1SpxCS
dXcefYjBowMzIvsHQAChO4ijPC1xnlEvkBAvv/cl0/JqE+ILBSA=
=r7Jw
-----END PGP SIGNATURE-----

--5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bc5add51-3094-9e6b-1054-821ac18265a7>