Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Mar 2022 16:45:57 +0000
From:      "Bram Van Steenlandt" <bram@diomedia.be>
To:        "Freebsd Questions" <freebsd-questions@freebsd.org>
Subject:   zfs mirror pool online but drives have read errors
Message-ID:  <emf36013e4-0469-47cd-a99d-d06600df1565@winserver>

next in thread | raw e-mail | index | archive | help
This is a MIME-formatted message.  If you see this text it means that your
E-mail software does not support MIME-formatted messages.

--------=_MB21C53DF8-236F-4888-B964-B1E94D40931A
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Hi all,

English is not my native language,sorry about any errors

I'm experiencing something which I don't fully understand, maybe someone=20
here can offer some insight.

I have a zfs mirror of 2 Samsung 980 pro 2TB nvme drives, according to=20
zfs the pool is online,
It did repair 54M on the last scrub, I did another scrub today and again=20
repairs are needed (only 128K this time).

   pool: zextra
  state: ONLINE
   scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors on Thu Mar=20
24 09:44:02 2022
config:

         NAME        STATE     READ WRITE CKSUM
         zextra      ONLINE       0     0     0
           mirror-0  ONLINE       0     0     0
             nvd2    ONLINE       0     0     0
             nvd3    ONLINE       0     0     0

errors: No known data errors

In dmesg I have messages like this:
nvme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0
nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256
nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0
nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256
nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0
nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256
also for the other drive:
nvme3: READ sqid:7 cid:84 nsid:1 lba:1543829024 len:256
nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0

smartctl does see the errors (but still says SMART overall-health=20
self-assessment test result: PASSED ):
Media and Data Integrity Errors:    190
Error Information Log Entries:      190
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
   0        190     1  0x006e  0xc502  0x000   3649951416     1     -
   1        189     6  0x0067  0xc502  0x000   2909882960     1     -

and for the other drive:
Media and Data Integrity Errors:    284
Error Information Log Entries:      284

Is the following thinking somewhat correct ?
-zfs doesn't remove the drives because it has no write errors and I've=20
been lucky so far in that read errors were repairable.
-Both drives are unreliable, if it was a hardware (both sit on a pcie=20
card, not the motherboard) or software problem elsewhere smartctl would=20
not find these errors in the drive logs.

I'll replace one drive and see if any of the errors go away for that=20
drive, If this works I'll replace the other one as well, I have this=20
same setup on another machine, this one is error free.
Could more expensive ssd's made a difference here ? according to=20
smartctl I've now written 50TB, these drives should be good for 1200TBW

I backup the drives by making a snapshot and then using "zfs send >=20
imgfile" to a hard drive, what would have have happened here if more and=20
more read errors would occur ?
I may change this to a separate imgfile for the even and uneven days, or=20
even one for every day of the week if I have enough room for that.

thx for any input
Bram







--------=_MB21C53DF8-236F-4888-B964-B1E94D40931A
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><head>

<style id=3D"css_styles">=20
blockquote.cite { margin-left: 5px; margin-right: 0px; padding-left: 10px; =
padding-right:0px; border-left: 1px solid #cccccc }
blockquote.cite2 {margin-left: 5px; margin-right: 0px; padding-left: 10px; =
padding-right:0px; border-left: 1px solid #cccccc; margin-top: 3px; padding=
-top: 0px; }
a img { border: 0px; }
li[style=3D'text-align: center;'], li[style=3D'text-align: center; '], li[s=
tyle=3D'text-align: right;'], li[style=3D'text-align: right; '] {  list-sty=
le-position: inside;}
body { font-family: Segoe UI; font-size: 12pt;   }=20
.quote { margin-left: 1em; margin-right: 1em; border-left: 5px #ebebeb soli=
d; padding-left: 0.3em; }
 </style>
</head>
<body>Hi all,<div><br /></div><div>English is not my native language,sorry =
about any errors<br /><div><br /></div><div>I'm experiencing something whic=
h I don't fully understand, maybe someone here can offer some insight.</div=
><div><br /></div><div>I have a zfs mirror of 2 Samsung 980 pro 2TB nvme dr=
ives, according to zfs the pool is online,</div><div>It did repair 54M on t=
he last scrub, I did another scrub today and again repairs are needed (only=
 128K this time).</div><div><br /></div><div>=C2=A0 pool: zextra
</div><div>=C2=A0state: ONLINE
</div><div>=C2=A0 scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors=
 on Thu Mar 24 09:44:02 2022
</div><div>config:
</div><div><br /></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 NAME=C2=A0 =C2=A0=
=C2=A0 =C2=A0=C2=A0 STATE=C2=A0 =C2=A0=C2=A0 READ WRITE CKSUM
</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 zextra=C2=A0 =C2=A0=C2=A0 =C2=A0ONLI=
NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0
</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mirror-0=C2=A0 ONLINE=C2=A0 =
=C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0
</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nvd2=C2=A0 =C2=A0 ONLI=
NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0
</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nvd3=C2=A0 =C2=A0 ONLI=
NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0
</div><div><br /></div><div>errors: No known data errors
</div><div><br /></div><div>In dmesg I have messages like this:</div><div>n=
vme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0
<div>nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256
</div><div>nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0
</div><div>nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256
</div><div>nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0
</div><div>nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256</div></di=
v><div>also for the other drive:</div><div>nvme3: READ sqid:7 cid:84 nsid:1=
 lba:1543829024 len:256
</div><div>nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0
</div><div><br /></div><div>smartctl does see the errors (but still says SM=
ART overall-health self-assessment test result: PASSED
<span>):</span></div><div>Media and Data Integrity Errors:=C2=A0 =C2=A0 190
</div><div>Error Information Log Entries:=C2=A0 =C2=A0=C2=A0 =C2=A0190
</div><div>Error Information (NVMe Log 0x01, 16 of 64 entries)
</div><div>Num=C2=A0 =C2=A0ErrCount=C2=A0 SQId=C2=A0 =C2=A0CmdId=C2=A0 Stat=
us=C2=A0 PELoc=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 =C2=A0 LBA=C2=A0 NSID=C2=A0 =
=C2=A0 VS
</div><div>=C2=A0 0=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 190=C2=A0 =C2=A0=C2=A0 =
1=C2=A0 0x006e=C2=A0 0xc502=C2=A0 0x000=C2=A0 =C2=A03649951416=C2=A0 =C2=A0=
=C2=A0 1=C2=A0 =C2=A0=C2=A0 -
</div><div>=C2=A0 1=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 189=C2=A0 =C2=A0=C2=A0 =
6=C2=A0 0x0067=C2=A0 0xc502=C2=A0 0x000=C2=A0 =C2=A02909882960=C2=A0 =C2=A0=
=C2=A0 1=C2=A0 =C2=A0=C2=A0 -
</div><div><br /></div><div>and for the other drive:</div><div>Media and Da=
ta Integrity Errors:=C2=A0 =C2=A0 284
</div><div>Error Information Log Entries:=C2=A0 =C2=A0=C2=A0 =C2=A0284
</div><div><br /></div><div>Is the following thinking somewhat correct ?</d=
iv><div>-zfs doesn't remove the drives because it has no write errors and I=
've been lucky so far in that read errors were repairable.</div><div>-Both =
drives are unreliable, if it was a hardware<span>=C2=A0(both sit on a pcie =
card, not the motherboard)</span><span>=C2=A0or software problem elsewhere =
smartctl would not find these errors in the drive logs.</span></div><div><s=
pan><br /></span></div><div><span>I'll replace one drive and see if any of =
the errors go away for that drive, If this works I'll replace the other one=
 as well, I have this same setup=C2=A0on another machine, this one is error=
 free.</span></div><div><span>Could more expensive ssd's made a difference =
here ? according to smartctl I've now written 50TB, these drives should be =
good for 1200TBW</span></div><div><span><br /></span></div><div><span>I bac=
kup the drives by making a snapshot and then using "zfs send &gt; imgfile" =
to a hard drive, what would have have happened here if more and more read e=
rrors would occur ?</span></div><div><span>I may change this to a separate =
imgfile for the even and uneven days, or even one for every day of the week=
 if I have enough room for that.</span></div><div><br /></div><div><span>th=
x for any input</span></div><div><span>Bram</span></div><div><br /></div><d=
iv><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br =
/></div><div><br /></div></div></body></html><img height=3D"1" width=3D"1" =
alt=3D"" border=3D"0" src=3D"http://hmjgz.serversmtpgold.com/tracking/qaR9Z=
GLkBGNmBGp2ZQL3AQp5ZGDlAPj.gif">
--------=_MB21C53DF8-236F-4888-B964-B1E94D40931A--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?emf36013e4-0469-47cd-a99d-d06600df1565>