From owner-freebsd-fs@freebsd.org Wed Apr 15 06:03:15 2020 Return-Path: Delivered-To: freebsd-fs@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 23E032AC2E8 for ; Wed, 15 Apr 2020 06:03:15 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from anubis.delphij.net (anubis.delphij.net [IPv6:2001:470:1:117::25]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "anubis.delphij.net", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 492Bc16C0Jz4KnW for ; Wed, 15 Apr 2020 06:03:13 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from odin.corp.delphij.net (unknown [IPv6:2601:646:8600:58ba:e8a7:7c50:1605:edb4]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by anubis.delphij.net (Postfix) with ESMTPSA id 566B848846; Tue, 14 Apr 2020 23:03:12 -0700 (PDT) To: freebsd-fs Reply-To: d@delphij.net From: Xin Li Autocrypt: addr=delphij@delphij.net; prefer-encrypt=mutual; keydata= mQINBFuSR4oBEACvvEgwRIHs6IcSP/yaDtySF78Ji3rP29qdiQsxhMsOtvtffdbS56VApIWO UFb3/iN2gA8HwLvrmjijN0HEoLVX7na1WARmxRYzQMtApsZIUTtx7hnUYlsi2F5odZa6CDW9 a954DLRzYxiUwYDcu5Zjl9bglK1H8e/N9uC0Vuigr4teWfh86brzOyf819QzwFVYfMIK4ihw QGwMvTzbyVuCFy+LENkmcVYni70oQy6rZ5ktSuYbuOFvu7inRRfhSWPHziV7k+bW88sJ7xhv lBlegcnhkSudWX2M8tZ3MO1PJOcyys0CJlsBY5Weiog2lIPi05h/E9pZ9mc1Vud17iqDaL6w RaggOUhuPfDGCdO5ro82W4BZGeQMRnRF5Ntk+t2ShIH4nn3xRLV0E5nziCiKlgiMqOrz/ZTL QTVbHrCuiwD+fSK14y0oHbkOLYTYLlgh1JbwfY2Ty7elOYiWzyeJ7sJh2dF91NSEneWIOys3 mBpuvtU3nSzzTvAB48VV+Nbg1CpIOgNlPjj7uhIum/Z/VjUaJEyaLpTIRh0MVJVcbP7hXSqZ NA35EEZZVnWEOYdycm4CmEdeNPWkrAf2Ya77iR5VLGypwMlsUMQPh+sKVWDD38M8stFGBBNm d01Hi74Bsq5hKan654dOqMt5eYklrVj0ucMzFQtus7oE502UswARAQABtBxYaW4gTEkgPGRl bHBoaWpAZGVscGhpai5uZXQ+iQJUBBMBCgA+FiEEceNg5NEMZIki80nQQHl/fJX0g08FAluS R/YCGwMFCQmuhAAFCwkIBwMFFQoJCAsFFgIDAQACHgECF4AACgkQQHl/fJX0g0+2Og//bWpE F2V5/M5l6YW1T8oLcT9rIOH6oq9M0LMNRgFeiNNnilGIeeIgtOGBRueG4CZiZAvsRPJkrO70 1R2SrdkCIvwGUzUAxx1NfBWb+vgm4fgkW/MotGonceM5v0qfSKKXasWvDctkK28aG+IoQzmi FjXNW4+ju4zeQFYwD4ZDWqw9MqO0hVb24uW3dxtQhbfmOLgJ/PEDMQaFuANbW1c+iR0BQA3D Go/EeMY4kpN8on6Aqt/S/4JVltudfQ9OXdjQsC7netSaB9K3mHGt9aKAAB7RzlRY00DKkYS/ /eQwLzGPmK7yX13M68mMDjBs6mIR8t/E1S5OdBNhHRPNPlEbwugR4KaiCsN5yqzJoSV99fKY z2VyxjWPaG8yhHE+jmKUgIBKTfFUQEfkriQR4EASoeJ+soaMTiFDBij1Zw5n3ndLRFMB1ZCl fZLER36mAgW4m4kP83TWnDiJLxOxSOxifV8HpTFjff902H85cybg9KMwrfPDr6W19GGk5Vo1 fkza5krRMGbKWb7+74Evusi0ZxJLIOFwp5Y8eVqUMZaAD3f1ZX1M3pgXOp20QgAy+2KvMHij rLa4q+tMGRzYYD1BnFVSVdXAX5VOoTmHBcDz67DkuRwk2Byp1sgd407oEOmSwrNJlKS0TPCm xUJ2fdSQF+1/MMSRfee49vtMvz7cOrC5Ag0EW5JHigEQANiBmIFAfRNH3nzYNWC0yC+tfx3z sUwAsH1VaBM/cTib+yKtbBOSIlXWjJZWX3MHwoI/1LeGghB2mxkkX1L0pJ/vj1eXNR+sFZ32 0pYcl61Fxg/5fioG4QDTM4i3i7NR5PxDnc6UVaynSlII93DedRhZ1ROtdn4vyMgzsDiqhbL7 BthDOt5KxjqdRk4qRPSw7BovEqZLOcG5IJtf/zZUzRbM7SBljEbOAfekDGx1Br+RrYSD7/Ef Pwwzou9T8315IpBpIHyQF/dZNk3iFiB9Ed5CA71ZRYV5YoLWE9lL0j9kxOLQ5vHnX3mVq7QZ Bc7nzwZ6UhQgYmrG5+RWvuiPpGwvDRIsugJUGXucYkAQh5kuNblmkwpv6u9rNMjCNbzAylOa qdogra5EW+RUSbRz0b4iIr8nnZeAlh7BihCe7JjOwbDjoBEEEtSfVc4hD/LENqpcYVrChphf aOLB9YIXhnVDTVvMc9OklWT/81HzAaDQqOQCzEfY92199Ct9/CwRoQ2OpO8TO5+8A7b9Nb33 nmxMn09mb48ruRacMrfHxCWbgU4w9SEfbip4GcS5wGG6yTC+hw55Iwnnwus40NrJ0GEr8a4r cdsLbkvlyoNHB8ZGgyJ4aFCQ1V4qE1BnlTk7Z8BYBUkJM1odPSkVvHpCnMUjVpJ3hEOC+73Z YH1dh7lZABEBAAGJAjwEGAEKACYWIQRx42Dk0QxkiSLzSdBAeX98lfSDTwUCW5JHigIbDAUJ Ca6EAAAKCRBAeX98lfSDTz8DEACMh3poeUb+gWNF4RWFZuLteZVo0+E1JLYXQkmtrRBLXviP +Qy0pXyFAVxLM4hNIBoIDYfK9BcwrBYf7AwSKrH0GiNwFpgHCkbZd6qoZy2gB+adTnCpVCTJ KJetsH/8awkrChJWMK0ckGf3EeWMPvawG7kW7FBz70NYEZ0pOMiaEZNVtzD3wwbYWUiDFYth 83XGglOExg+1ShTW5XjQPRrdyJAO+aUW4o3lVjfyUJXMgI4rmhMiLVm06GuNrbpKIF0s+4Vd jQAjhrDQjfoXi9CkfsA/cONseuHNv1JGj3RqHiqHJq1dbrpodXp925zGDAnUGxCOBPoFopAH gVzR89GTut059GpwqsddZmU6y7rqifuam/ekJ+QRwc16vgt7pHqCrTY8WPxRZr2UpFU1wlTo COdeiFep1gq1F9jzFjJnoMaAdmC6k7bgAA+RQusOgIhJL0jIej7DoAHxmxFFCfRy+lDtpXwF gQ8HMvzHI65QWmQnMo7s6SQH/ZH5s1yR6SJq8+3lDz+dCuT42qJVqIPVvxd10LW0FNN+t7HF eLadU6ekSgD13/EYMYXlvNHkw7dAItSDxIzgRyykLz0bCU9xwNWoS4Z43+ifF9anJ+uR0ltW El1j++h6ZrD3LLuCgJIt1so0m49GzdcSpOI7LCwMlacyvafiEyjUn+tSNDsnfw== Subject: zpool question -- resilvering doesn't fully check on-disk data for corruption? Message-ID: Date: Tue, 14 Apr 2020 23:03:08 -0700 User-Agent: Thunderbird MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9" X-Rspamd-Queue-Id: 492Bc16C0Jz4KnW X-Spamd-Bar: ------- X-Spamd-Result: default: False [-7.73 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; HAS_REPLYTO(0.00)[d@delphij.net]; XM_UA_NO_VERSION(0.01)[]; R_SPF_ALLOW(-0.20)[+mx]; HAS_ATTACHMENT(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[delphij.net:+]; DMARC_POLICY_ALLOW(-0.50)[delphij.net,reject]; SIGNED_PGP(-2.00)[]; FROM_EQ_ENVFROM(0.00)[]; IP_SCORE(-3.64)[ip: (-9.91), ipnet: 2001:470::/32(-4.66), asn: 6939(-3.60), country: US(-0.05)]; SUBJECT_ENDS_QUESTION(1.00)[]; ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US]; MIME_TRACE(0.00)[0:+,1:+,2:+,3:~]; MID_RHS_MATCH_FROM(0.00)[]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; R_DKIM_ALLOW(-0.20)[delphij.net:s=m7e2]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.20)[multipart/signed,multipart/mixed,text/plain]; REPLYTO_DOM_EQ_FROM_DOM(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Apr 2020 06:03:15 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9 Content-Type: multipart/mixed; boundary="qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni"; protected-headers="v1" From: Xin Li Reply-To: d@delphij.net To: freebsd-fs Message-ID: Subject: zpool question -- resilvering doesn't fully check on-disk data for corruption? --qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Hi, I have recently seen a bad drive on my home storage server. The bad drive had some timeouts occasionally that would cause the CAM subsystem to kick it off eventually, like: (ada1:ahcich11:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 (ada1:ahcich11:0:0:0): CAM status: Command timeout (ada1:ahcich11:0:0:0): Retrying command, 0 more tries remain ada1 at ahcich11 bus 0 scbus11 target 0 lun 0 ada1: s/n WD-WMC4E0090978 detached (ada1:ahcich11:0:0:0): Periph destroyed When this happens, a full 'camcontrol reset all' and 'camcontrol rescan all' would bring it back, and ZFS would correctly start a resilvering process as expected. After the resilvering, zpool would detect several checksum errors (also expected). As a precautional measure, I usually would start another zpool scrub to check data integration again when this happens. To my surprise, in the last few times when that drive was timing out, the zpool scrub would also find some checksum errors and correct these (the drive is in a RAID-Z pool). A second run of 'zpool scrub' after that would no longer be able to find any checksum errors. I initially thought that is probably because there were some bad blocks on the bad hard drive and didn't pay much attention as I already ordered a new hard drive to replace it, but when the new drive arrived, I have initiated a 'zpool replace' with both bad and new drive attached (which will start a resilver too; I didn't perform a zpool scrub the last time when the timeout happens because the scrub was very slow and I feared that I might end up causing more damage to the bad drive before the new drive arrived). When the new drive arrived, however, to my surprise, the zpool scrub after the replacement resilver have detected new checksum errors on the newly attached drive. Is this expected? (My understanding is that both resilver and scrub would read all data from a RAID-Z pool, therefore checking checksums for all blocks, and for replacing, so checksum errors shouldn't really happen for the new drive, because the written data was already checksummed? The system is equipped with ECC RAM, etc.; I know there is a possibility that the disk controller or the disk itself may still introduce bit flips, etc. if I'm really unlucky, but if that's the case I think I should have seen errors more often...) Cheers, --qEs9RJ3lMWSyQCZXanM756zZTqWxxBQni-- --5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.2.20 (Darwin) iQIzBAEBCgAdFiEEceNg5NEMZIki80nQQHl/fJX0g08FAl6Wo5wACgkQQHl/fJX0 g0/b9w//U0Rv/w30CCdAqdH97RCZ0bnRBXJlq+XOFc2+eNymXcE6mIY29IYLbuPI h0dozLeNZGPjm7iRUhBibhRSwdK5/wD5E1+AVoBeUUyTv9bCEf1flORkZiz/zDMB AZ3cWotX6udqmOQFJn2Cu+cMft6NauEM3WYOnFT1BMmdReetgcGGY0WX1Pheoq8g 3RXZyecMm448vCOU3Syw78nTbSH6YPv0aTDr5GiSH5MD79E7EyzSLr/vhWmUe78G 9ec7NsDZhy/W8SF4KBTnas3N+wuAOlSHy3AGnLTzqLvhIUIFOxYi+UxxBtngz4v4 NjI8u9pURis4TF/NM86dZDBAoWaAQiAtuC7knGPc6ITmelveuIobWos73F8F3w81 NLguXAbC4ssN91MjYvAxhnv68u3nrzKWRF6QmAPPO7pEaQgMlKS0akbg1LrbQT6m ARheG9XTAM2hR9wsSN0yN07EOOFGkXraXC2asTBtZ5vtaOKDir/DdytTYB5yVB38 opdEJlZ6ClwvNjGP8+zHBeh4bHMXsMeD9mrymiNfAkZL9+R0ocqwwa/LS5c9+jTI HubDh79nPUOC3G6sRztVGxnP+zBHMucjgyY/SlhzncdULepkXLO7uOnT2i1SpxCS dXcefYjBowMzIvsHQAChO4ijPC1xnlEvkBAvv/cl0/JqE+ILBSA= =r7Jw -----END PGP SIGNATURE----- --5at5vN4LOJzYAeJKxIDmIji6TLzmlVUQ9--