Date: Tue, 14 Apr 2020 23:03:08 -0700 From: Xin Li <delphij@delphij.net> To: freebsd-fs <freebsd-fs@freebsd.org> Subject: zpool question -- resilvering doesn't fully check on-disk data for corruption? Message-ID: <bc5add51-3094-9e6b-1054-821ac18265a7@delphij.net>
next in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9 Content-Type: multipart/mixed; boundary="qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni"; protected-headers="v1" From: Xin Li <delphij@delphij.net> Reply-To: d@delphij.net To: freebsd-fs <freebsd-fs@freebsd.org> Message-ID: <bc5add51-3094-9e6b-1054-821ac18265a7@delphij.net> Subject: zpool question -- resilvering doesn't fully check on-disk data for corruption? --qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Hi, I have recently seen a bad drive on my home storage server. The bad drive had some timeouts occasionally that would cause the CAM subsystem to kick it off eventually, like: (ada1:ahcich11:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 (ada1:ahcich11:0:0:0): CAM status: Command timeout (ada1:ahcich11:0:0:0): Retrying command, 0 more tries remain ada1 at ahcich11 bus 0 scbus11 target 0 lun 0 ada1: <WDC WD40EFRX-68WT0N0 80.00A80> s/n WD-WMC4E0090978 detached (ada1:ahcich11:0:0:0): Periph destroyed When this happens, a full 'camcontrol reset all' and 'camcontrol rescan all' would bring it back, and ZFS would correctly start a resilvering process as expected. After the resilvering, zpool would detect several checksum errors (also expected). As a precautional measure, I usually would start another zpool scrub to check data integration again when this happens. To my surprise, in the last few times when that drive was timing out, the zpool scrub would also find some checksum errors and correct these (the drive is in a RAID-Z pool). A second run of 'zpool scrub' after that would no longer be able to find any checksum errors. I initially thought that is probably because there were some bad blocks on the bad hard drive and didn't pay much attention as I already ordered a new hard drive to replace it, but when the new drive arrived, I have initiated a 'zpool replace' with both bad and new drive attached (which will start a resilver too; I didn't perform a zpool scrub the last time when the timeout happens because the scrub was very slow and I feared that I might end up causing more damage to the bad drive before the new drive arrived). When the new drive arrived, however, to my surprise, the zpool scrub after the replacement resilver have detected new checksum errors on the newly attached drive. Is this expected? (My understanding is that both resilver and scrub would read all data from a RAID-Z pool, therefore checking checksums for all blocks, and for replacing, so checksum errors shouldn't really happen for the new drive, because the written data was already checksummed? The system is equipped with ECC RAM, etc.; I know there is a possibility that the disk controller or the disk itself may still introduce bit flips, etc. if I'm really unlucky, but if that's the case I think I should have seen errors more often...) Cheers, --qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni-- --5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.2.20 (Darwin) iQIzBAEBCgAdFiEEceNg5NEMZIki80nQQHl/fJX0g08FAl6Wo5wACgkQQHl/fJX0 g0/b9w//U0Rv/w30CCdAqdH97RCZ0bnRBXJlq+XOFc2+eNymXcE6mIY29IYLbuPI h0dozLeNZGPjm7iRUhBibhRSwdK5/wD5E1+AVoBeUUyTv9bCEf1flORkZiz/zDMB AZ3cWotX6udqmOQFJn2Cu+cMft6NauEM3WYOnFT1BMmdReetgcGGY0WX1Pheoq8g 3RXZyecMm448vCOU3Syw78nTbSH6YPv0aTDr5GiSH5MD79E7EyzSLr/vhWmUe78G 9ec7NsDZhy/W8SF4KBTnas3N+wuAOlSHy3AGnLTzqLvhIUIFOxYi+UxxBtngz4v4 NjI8u9pURis4TF/NM86dZDBAoWaAQiAtuC7knGPc6ITmelveuIobWos73F8F3w81 NLguXAbC4ssN91MjYvAxhnv68u3nrzKWRF6QmAPPO7pEaQgMlKS0akbg1LrbQT6m ARheG9XTAM2hR9wsSN0yN07EOOFGkXraXC2asTBtZ5vtaOKDir/DdytTYB5yVB38 opdEJlZ6ClwvNjGP8+zHBeh4bHMXsMeD9mrymiNfAkZL9+R0ocqwwa/LS5c9+jTI HubDh79nPUOC3G6sRztVGxnP+zBHMucjgyY/SlhzncdULepkXLO7uOnT2i1SpxCS dXcefYjBowMzIvsHQAChO4ijPC1xnlEvkBAvv/cl0/JqE+ILBSA= =r7Jw -----END PGP SIGNATURE----- --5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bc5add51-3094-9e6b-1054-821ac18265a7>