Date: Sat, 26 Mar 2022 16:22:38 -0700 From: David Christensen <dpchrist@holgerdanske.com> To: questions@freebsd.org Subject: Re: zfs mirror pool online but drives have read errors Message-ID: <a263b16b-f9d6-65de-f6eb-b148409cfbc4@holgerdanske.com> In-Reply-To: <emf36013e4-0469-47cd-a99d-d06600df1565@winserver> References: <emf36013e4-0469-47cd-a99d-d06600df1565@winserver>
next in thread | previous in thread | raw e-mail | index | archive | help
On 3/26/22 09:45, Bram Van Steenlandt wrote: > Hi all, > > English is not my native language,sorry about any errors > > I'm experiencing something which I don't fully understand, maybe someone > here can offer some insight. > > I have a zfs mirror of 2 Samsung 980 pro 2TB nvme drives, according to > zfs the pool is online, > It did repair 54M on the last scrub, I did another scrub today and again > repairs are needed (only 128K this time). > > pool: zextra > state: ONLINE > scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors on Thu Mar > 24 09:44:02 2022 > config: > > NAME STATE READ WRITE CKSUM > zextra ONLINE 0 0 0 > mirror-0 ONLINE 0 0 0 > nvd2 ONLINE 0 0 0 > nvd3 ONLINE 0 0 0 > > errors: No known data errors > > In dmesg I have messages like this: > nvme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0 > nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256 > nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0 > nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256 > nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0 > nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256 > also for the other drive: > nvme3: READ sqid:7 cid:84 nsid:1 lba:1543829024 len:256 > nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0 > > smartctl does see the errors (but still says SMART overall-health > self-assessment test result: PASSED ): > Media and Data Integrity Errors: 190 > Error Information Log Entries: 190 > Error Information (NVMe Log 0x01, 16 of 64 entries) > Num ErrCount SQId CmdId Status PELoc LBA NSID VS > 0 190 1 0x006e 0xc502 0x000 3649951416 1 - > 1 189 6 0x0067 0xc502 0x000 2909882960 1 - > > and for the other drive: > Media and Data Integrity Errors: 284 > Error Information Log Entries: 284 > > Is the following thinking somewhat correct ? > -zfs doesn't remove the drives because it has no write errors and I've > been lucky so far in that read errors were repairable. > -Both drives are unreliable, if it was a hardware (both sit on a pcie > card, not the motherboard) or software problem elsewhere smartctl would > not find these errors in the drive logs. > > I'll replace one drive and see if any of the errors go away for that > drive, If this works I'll replace the other one as well, I have this > same setup on another machine, this one is error free. > Could more expensive ssd's made a difference here ? according to > smartctl I've now written 50TB, these drives should be good for 1200TBW > > I backup the drives by making a snapshot and then using "zfs send > > imgfile" to a hard drive, what would have have happened here if more and > more read errors would occur ? > I may change this to a separate imgfile for the even and uneven days, or > even one for every day of the week if I have enough room for that. > > thx for any input > Bram Posting these commands would be helpful: # freebsd-version ; uname -a # zpool list zextra # zfs list zextra What is the make and model of the system or motherboard? URL? Is the firmware current? What processor(s) make and model? How much memory? ECC? Any problems? What is the make and model of the PCIe NVMe adapter? URL? Firmware? Problems? Please tell us about the other machine. Is it your production machine, or can you use it and its components for A/B testing against the problem machine and components? I am unclear about the details of ZFS faulting drives that are generating errors. My experience is that the hardware must be 100% before there is any hope of the software working correctly. Unfortunately, it takes a while before new hardware works well on FOSS platforms (if ever). It is useful to have a supply of older, known-good/ supported hardware for A/B testing. I find a networked version control system (CVS) to be very helpful for system administration and trouble-shooting. I put each OS on its own 2.5" SATA SSD and install trayless mobile racks in my machines. This facilitates using multiple OS's on the same hardware, and completely avoids multi-boot. It also allows A/B testing different OS's against the same hardware. It would be useful to try Linux or Windows on the problem hardware. I experienced a lot of problems when I migrated from Debian, mdadm, and desktop hardware to FreeBSD, ZFS, and server hardware. Backups were critical. I started with individual HDD's in rotation, put one ZFS pool on each, and backed up via snapshots and replication. I soon built another server with a ZFS mirror and added it as a backup destination. David p.s. Complete SMART reports are more useful than snippets. But, I am unclear if this list considers including SMART reports inside a message to be rude. Comments? (Posting the reports on the WWW and including a URL should be okay.)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a263b16b-f9d6-65de-f6eb-b148409cfbc4>