Date: Sat, 26 Mar 2022 16:45:57 +0000 From: "Bram Van Steenlandt" <bram@diomedia.be> To: "Freebsd Questions" <freebsd-questions@freebsd.org> Subject: zfs mirror pool online but drives have read errors Message-ID: <emf36013e4-0469-47cd-a99d-d06600df1565@winserver>
next in thread | raw e-mail | index | archive | help
This is a MIME-formatted message. If you see this text it means that your E-mail software does not support MIME-formatted messages. --------=_MB21C53DF8-236F-4888-B964-B1E94D40931A Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Hi all, English is not my native language,sorry about any errors I'm experiencing something which I don't fully understand, maybe someone=20 here can offer some insight. I have a zfs mirror of 2 Samsung 980 pro 2TB nvme drives, according to=20 zfs the pool is online, It did repair 54M on the last scrub, I did another scrub today and again=20 repairs are needed (only 128K this time). pool: zextra state: ONLINE scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors on Thu Mar=20 24 09:44:02 2022 config: NAME STATE READ WRITE CKSUM zextra ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nvd2 ONLINE 0 0 0 nvd3 ONLINE 0 0 0 errors: No known data errors In dmesg I have messages like this: nvme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0 nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256 nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0 nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256 nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0 nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256 also for the other drive: nvme3: READ sqid:7 cid:84 nsid:1 lba:1543829024 len:256 nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0 smartctl does see the errors (but still says SMART overall-health=20 self-assessment test result: PASSED ): Media and Data Integrity Errors: 190 Error Information Log Entries: 190 Error Information (NVMe Log 0x01, 16 of 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 190 1 0x006e 0xc502 0x000 3649951416 1 - 1 189 6 0x0067 0xc502 0x000 2909882960 1 - and for the other drive: Media and Data Integrity Errors: 284 Error Information Log Entries: 284 Is the following thinking somewhat correct ? -zfs doesn't remove the drives because it has no write errors and I've=20 been lucky so far in that read errors were repairable. -Both drives are unreliable, if it was a hardware (both sit on a pcie=20 card, not the motherboard) or software problem elsewhere smartctl would=20 not find these errors in the drive logs. I'll replace one drive and see if any of the errors go away for that=20 drive, If this works I'll replace the other one as well, I have this=20 same setup on another machine, this one is error free. Could more expensive ssd's made a difference here ? according to=20 smartctl I've now written 50TB, these drives should be good for 1200TBW I backup the drives by making a snapshot and then using "zfs send >=20 imgfile" to a hard drive, what would have have happened here if more and=20 more read errors would occur ? I may change this to a separate imgfile for the even and uneven days, or=20 even one for every day of the week if I have enough room for that. thx for any input Bram --------=_MB21C53DF8-236F-4888-B964-B1E94D40931A Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable <html><head> <style id=3D"css_styles">=20 blockquote.cite { margin-left: 5px; margin-right: 0px; padding-left: 10px; = padding-right:0px; border-left: 1px solid #cccccc } blockquote.cite2 {margin-left: 5px; margin-right: 0px; padding-left: 10px; = padding-right:0px; border-left: 1px solid #cccccc; margin-top: 3px; padding= -top: 0px; } a img { border: 0px; } li[style=3D'text-align: center;'], li[style=3D'text-align: center; '], li[s= tyle=3D'text-align: right;'], li[style=3D'text-align: right; '] { list-sty= le-position: inside;} body { font-family: Segoe UI; font-size: 12pt; }=20 .quote { margin-left: 1em; margin-right: 1em; border-left: 5px #ebebeb soli= d; padding-left: 0.3em; } </style> </head> <body>Hi all,<div><br /></div><div>English is not my native language,sorry = about any errors<br /><div><br /></div><div>I'm experiencing something whic= h I don't fully understand, maybe someone here can offer some insight.</div= ><div><br /></div><div>I have a zfs mirror of 2 Samsung 980 pro 2TB nvme dr= ives, according to zfs the pool is online,</div><div>It did repair 54M on t= he last scrub, I did another scrub today and again repairs are needed (only= 128K this time).</div><div><br /></div><div>=C2=A0 pool: zextra </div><div>=C2=A0state: ONLINE </div><div>=C2=A0 scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors= on Thu Mar 24 09:44:02 2022 </div><div>config: </div><div><br /></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 NAME=C2=A0 =C2=A0= =C2=A0 =C2=A0=C2=A0 STATE=C2=A0 =C2=A0=C2=A0 READ WRITE CKSUM </div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 zextra=C2=A0 =C2=A0=C2=A0 =C2=A0ONLI= NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0 </div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mirror-0=C2=A0 ONLINE=C2=A0 = =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0 </div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nvd2=C2=A0 =C2=A0 ONLI= NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0 </div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nvd3=C2=A0 =C2=A0 ONLI= NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0 </div><div><br /></div><div>errors: No known data errors </div><div><br /></div><div>In dmesg I have messages like this:</div><div>n= vme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0 <div>nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256 </div><div>nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0 </div><div>nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256 </div><div>nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0 </div><div>nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256</div></di= v><div>also for the other drive:</div><div>nvme3: READ sqid:7 cid:84 nsid:1= lba:1543829024 len:256 </div><div>nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0 </div><div><br /></div><div>smartctl does see the errors (but still says SM= ART overall-health self-assessment test result: PASSED <span>):</span></div><div>Media and Data Integrity Errors:=C2=A0 =C2=A0 190 </div><div>Error Information Log Entries:=C2=A0 =C2=A0=C2=A0 =C2=A0190 </div><div>Error Information (NVMe Log 0x01, 16 of 64 entries) </div><div>Num=C2=A0 =C2=A0ErrCount=C2=A0 SQId=C2=A0 =C2=A0CmdId=C2=A0 Stat= us=C2=A0 PELoc=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 =C2=A0 LBA=C2=A0 NSID=C2=A0 = =C2=A0 VS </div><div>=C2=A0 0=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 190=C2=A0 =C2=A0=C2=A0 = 1=C2=A0 0x006e=C2=A0 0xc502=C2=A0 0x000=C2=A0 =C2=A03649951416=C2=A0 =C2=A0= =C2=A0 1=C2=A0 =C2=A0=C2=A0 - </div><div>=C2=A0 1=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 189=C2=A0 =C2=A0=C2=A0 = 6=C2=A0 0x0067=C2=A0 0xc502=C2=A0 0x000=C2=A0 =C2=A02909882960=C2=A0 =C2=A0= =C2=A0 1=C2=A0 =C2=A0=C2=A0 - </div><div><br /></div><div>and for the other drive:</div><div>Media and Da= ta Integrity Errors:=C2=A0 =C2=A0 284 </div><div>Error Information Log Entries:=C2=A0 =C2=A0=C2=A0 =C2=A0284 </div><div><br /></div><div>Is the following thinking somewhat correct ?</d= iv><div>-zfs doesn't remove the drives because it has no write errors and I= 've been lucky so far in that read errors were repairable.</div><div>-Both = drives are unreliable, if it was a hardware<span>=C2=A0(both sit on a pcie = card, not the motherboard)</span><span>=C2=A0or software problem elsewhere = smartctl would not find these errors in the drive logs.</span></div><div><s= pan><br /></span></div><div><span>I'll replace one drive and see if any of = the errors go away for that drive, If this works I'll replace the other one= as well, I have this same setup=C2=A0on another machine, this one is error= free.</span></div><div><span>Could more expensive ssd's made a difference = here ? according to smartctl I've now written 50TB, these drives should be = good for 1200TBW</span></div><div><span><br /></span></div><div><span>I bac= kup the drives by making a snapshot and then using "zfs send > imgfile" = to a hard drive, what would have have happened here if more and more read e= rrors would occur ?</span></div><div><span>I may change this to a separate = imgfile for the even and uneven days, or even one for every day of the week= if I have enough room for that.</span></div><div><br /></div><div><span>th= x for any input</span></div><div><span>Bram</span></div><div><br /></div><d= iv><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br = /></div><div><br /></div></div></body></html><img height=3D"1" width=3D"1" = alt=3D"" border=3D"0" src=3D"http://hmjgz.serversmtpgold.com/tracking/qaR9Z= GLkBGNmBGp2ZQL3AQp5ZGDlAPj.gif"> --------=_MB21C53DF8-236F-4888-B964-B1E94D40931A--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?emf36013e4-0469-47cd-a99d-d06600df1565>