From nobody Sat Mar 26 16:45:57 2022 X-Original-To: freebsd-questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4B2411A31731 for ; Sat, 26 Mar 2022 16:46:09 +0000 (UTC) (envelope-from bram@diomedia.be) Received: from ebifccidjbei.ams03.turbo-smtp.net (ebifccidjbei.ams03.turbo-smtp.net [185.228.39.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "", Issuer "Internet Widgits Pty Ltd" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KQlG71rSCz4Y9W for ; Sat, 26 Mar 2022 16:46:06 +0000 (UTC) (envelope-from bram@diomedia.be) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=diomedia.be; s=turbo-smtp; x=1648917967; h=Received:Received: From:To:Subject:Date:Message-ID:Reply-To:User-Agent:MIME-Version: Content-Type:Feedback-Id; bh=VkFNYtz5Fknc1YiZFMk4wZc1sRq8bvtJ7rq YfEgTwNQ=; b=RnhPI60lAYIeZ2kLVCTeZfY7b9lLXgk0BB00YGj9Fsj5WyWhnc1 xjC7k/r+mqF/M4+9A4c4B7iCXzxJIVr/oadZUHQOSQBeebnVy8N141fvWigiFXaw JZaAnJl65P59GoA/PGobwKCvPfTMPGNsrgp37gPycjQTbOSrbMI3Y/lg= Received: (qmail 2435557 invoked from network); 26 Mar 2022 16:45:59 -0000 Received: from ?UNAVAILABLE? (HELO ?192.168.3.215?) (authenticated@81.82.228.129) by turbo-smtp.com with SMTP; 26 Mar 2022 16:45:58 -0000 X-TurboSMTP-Tracking: 64-0138fe03-00001677fc1594412000-000-35f4f8 From: "Bram Van Steenlandt" To: "Freebsd Questions" Subject: zfs mirror pool online but drives have read errors Date: Sat, 26 Mar 2022 16:45:57 +0000 Message-ID: Reply-To: "Bram Van Steenlandt" User-Agent: eM_Client/8.2.1659.0 List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="------=_MB21C53DF8-236F-4888-B964-B1E94D40931A" Feedback-Id: 20512259 X-Rspamd-Queue-Id: 4KQlG71rSCz4Y9W X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=diomedia.be header.s=turbo-smtp header.b=RnhPI60l; dmarc=pass (policy=quarantine) header.from=diomedia.be; spf=pass (mx1.freebsd.org: domain of bram@diomedia.be designates 185.228.39.148 as permitted sender) smtp.mailfrom=bram@diomedia.be X-Spamd-Result: default: False [-3.50 / 15.00]; HAS_REPLYTO(0.00)[bram@diomedia.be]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[diomedia.be:s=turbo-smtp]; REPLYTO_EQ_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:185.228.36.0/22]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; MID_RHS_NOT_FQDN(0.50)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[diomedia.be:+]; DMARC_POLICY_ALLOW(-0.50)[diomedia.be,quarantine]; NEURAL_HAM_SHORT(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-questions]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:36351, ipnet:185.228.36.0/22, country:US]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N This is a MIME-formatted message. If you see this text it means that your E-mail software does not support MIME-formatted messages. --------=_MB21C53DF8-236F-4888-B964-B1E94D40931A Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Hi all, English is not my native language,sorry about any errors I'm experiencing something which I don't fully understand, maybe someone=20 here can offer some insight. I have a zfs mirror of 2 Samsung 980 pro 2TB nvme drives, according to=20 zfs the pool is online, It did repair 54M on the last scrub, I did another scrub today and again=20 repairs are needed (only 128K this time). pool: zextra state: ONLINE scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors on Thu Mar=20 24 09:44:02 2022 config: NAME STATE READ WRITE CKSUM zextra ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nvd2 ONLINE 0 0 0 nvd3 ONLINE 0 0 0 errors: No known data errors In dmesg I have messages like this: nvme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0 nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256 nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0 nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256 nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0 nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256 also for the other drive: nvme3: READ sqid:7 cid:84 nsid:1 lba:1543829024 len:256 nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0 smartctl does see the errors (but still says SMART overall-health=20 self-assessment test result: PASSED ): Media and Data Integrity Errors: 190 Error Information Log Entries: 190 Error Information (NVMe Log 0x01, 16 of 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 190 1 0x006e 0xc502 0x000 3649951416 1 - 1 189 6 0x0067 0xc502 0x000 2909882960 1 - and for the other drive: Media and Data Integrity Errors: 284 Error Information Log Entries: 284 Is the following thinking somewhat correct ? -zfs doesn't remove the drives because it has no write errors and I've=20 been lucky so far in that read errors were repairable. -Both drives are unreliable, if it was a hardware (both sit on a pcie=20 card, not the motherboard) or software problem elsewhere smartctl would=20 not find these errors in the drive logs. I'll replace one drive and see if any of the errors go away for that=20 drive, If this works I'll replace the other one as well, I have this=20 same setup on another machine, this one is error free. Could more expensive ssd's made a difference here ? according to=20 smartctl I've now written 50TB, these drives should be good for 1200TBW I backup the drives by making a snapshot and then using "zfs send >=20 imgfile" to a hard drive, what would have have happened here if more and=20 more read errors would occur ? I may change this to a separate imgfile for the even and uneven days, or=20 even one for every day of the week if I have enough room for that. thx for any input Bram --------=_MB21C53DF8-236F-4888-B964-B1E94D40931A Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi all,

English is not my native language,sorry = about any errors

I'm experiencing something whic= h I don't fully understand, maybe someone here can offer some insight.

I have a zfs mirror of 2 Samsung 980 pro 2TB nvme dr= ives, according to zfs the pool is online,
It did repair 54M on t= he last scrub, I did another scrub today and again repairs are needed (only= 128K this time).

=C2=A0 pool: zextra
=C2=A0state: ONLINE
=C2=A0 scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors= on Thu Mar 24 09:44:02 2022
config:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 NAME=C2=A0 =C2=A0= =C2=A0 =C2=A0=C2=A0 STATE=C2=A0 =C2=A0=C2=A0 READ WRITE CKSUM
=C2=A0 =C2=A0 =C2=A0 =C2=A0 zextra=C2=A0 =C2=A0=C2=A0 =C2=A0ONLI= NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mirror-0=C2=A0 ONLINE=C2=A0 = =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nvd2=C2=A0 =C2=A0 ONLI= NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nvd3=C2=A0 =C2=A0 ONLI= NE=C2=A0 =C2=A0=C2=A0 =C2=A0 0=C2=A0 =C2=A0=C2=A0 0=C2=A0 =C2=A0=C2=A0 0

errors: No known data errors

In dmesg I have messages like this:
n= vme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0
nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256
nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0
nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256
nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0
nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256
also for the other drive:
nvme3: READ sqid:7 cid:84 nsid:1= lba:1543829024 len:256
nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0

smartctl does see the errors (but still says SM= ART overall-health self-assessment test result: PASSED ):
Media and Data Integrity Errors:=C2=A0 =C2=A0 190
Error Information Log Entries:=C2=A0 =C2=A0=C2=A0 =C2=A0190
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num=C2=A0 =C2=A0ErrCount=C2=A0 SQId=C2=A0 =C2=A0CmdId=C2=A0 Stat= us=C2=A0 PELoc=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 =C2=A0 LBA=C2=A0 NSID=C2=A0 = =C2=A0 VS
=C2=A0 0=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 190=C2=A0 =C2=A0=C2=A0 = 1=C2=A0 0x006e=C2=A0 0xc502=C2=A0 0x000=C2=A0 =C2=A03649951416=C2=A0 =C2=A0= =C2=A0 1=C2=A0 =C2=A0=C2=A0 -
=C2=A0 1=C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 189=C2=A0 =C2=A0=C2=A0 = 6=C2=A0 0x0067=C2=A0 0xc502=C2=A0 0x000=C2=A0 =C2=A02909882960=C2=A0 =C2=A0= =C2=A0 1=C2=A0 =C2=A0=C2=A0 -

and for the other drive:
Media and Da= ta Integrity Errors:=C2=A0 =C2=A0 284
Error Information Log Entries:=C2=A0 =C2=A0=C2=A0 =C2=A0284

Is the following thinking somewhat correct ?
-zfs doesn't remove the drives because it has no write errors and I= 've been lucky so far in that read errors were repairable.
-Both = drives are unreliable, if it was a hardware=C2=A0(both sit on a pcie = card, not the motherboard)=C2=A0or software problem elsewhere = smartctl would not find these errors in the drive logs.

I'll replace one drive and see if any of = the errors go away for that drive, If this works I'll replace the other one= as well, I have this same setup=C2=A0on another machine, this one is error= free.
Could more expensive ssd's made a difference = here ? according to smartctl I've now written 50TB, these drives should be = good for 1200TBW

I bac= kup the drives by making a snapshot and then using "zfs send > imgfile" = to a hard drive, what would have have happened here if more and more read e= rrors would occur ?
I may change this to a separate = imgfile for the even and uneven days, or even one for every day of the week= if I have enough room for that.

th= x for any input
Bram







3D"" --------=_MB21C53DF8-236F-4888-B964-B1E94D40931A--