Date: Thu, 6 Nov 2025 00:08:36 +0500 From: "Eugene M. Zheganin" <eugene@zheganin.net> To: FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org> Subject: zfs pool keeps losing files after drive faiilure Message-ID: <f0883f43-23c6-4139-b844-61a4623a59dc@zheganin.net>
next in thread | raw e-mail | index | archive | help
Hello, I'm using freeBSD 14.2 for a long time as a significant file storage on a server with 256 gig of memory, no ARC limitation is in effect . The files are stored on a pool configured as: ===Cut=== NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT tank 135T 125T 9.74T 4.06G - 60% 92% 1.20x ONLINE - zroot 31.5G 29.6G 1.90G 122M - 79% 93% 1.00x ONLINE - ===Cut=== Drives are attached via <Adaptec 3101-4i 7.41> at scbus0 target 0 lun 0 (pass0) as passthrough ones. ===Cut=== pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 16 days 05:58:25 with 1497 errors on Fri Sep 19 22:04:18 2025 checkpoint: created Wed Nov 5 00:00:03 2025, consumes 4.06G config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 gpt/data0 ONLINE 0 0 0 gpt/data1 ONLINE 0 0 0 gpt/data2 ONLINE 0 0 0 gpt/data3 ONLINE 0 0 0 gpt/data4 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 gpt/data5 ONLINE 0 0 30 gpt/data6 ONLINE 0 0 30 gpt/data7 ONLINE 0 0 30 gpt/data8 ONLINE 0 0 30 gpt/data9 ONLINE 0 0 30 raidz1-2 ONLINE 0 0 0 gpt/data10 ONLINE 0 0 0 gpt/data11 ONLINE 0 0 0 gpt/data12 ONLINE 0 0 0 gpt/data13 ONLINE 0 0 0 gpt/data14 ONLINE 0 0 0 errors: 229 data errors, use '-v' for a list ===Cut=== (the file unique list is pretty short for now, as these 229 errors mostly reference the snapshots, it contains 11 unique filenames which mostly are .mp3 files, and one <metadata>:<0x0> entry). The zfs dataset configuration looks like (the deduplication is on only for tank 0, troubled one is the tank3): ===Cut=== NAME USED AVAIL REFER MOUNTPOINT tank 112T 7.12T 141K /tank tank/data 111T 7.12T 3.72G /usr/local/public tank/data/tank0 19.9T 7.12T 19.9T /usr/local/public/tank0 tank/data/tank1 16.1T 7.12T 16.1T /usr/local/public/tank1 tank/data/tank2 38.8T 7.12T 38.8T /usr/local/public/tank2 tank/data/tank3 36.6T 7.12T 28.9T /usr/local/public/tank3 tank/obj 17.7G 7.12T 17.7G /usr/obj tank/www 197M 7.12T 197M /var/www ===Cut=== As you can probably see it conttains errors. The thing is - I'm desperately fighting these errors for almost a year now. They started to appear when a drive in this very vdev died, and was successfully replaced. But scrub found out that some of the files were damaged beyond repair with self-healing; since this entire pool is replicated to a sibling SAN (and it's copy of these were intact), they all were deleted and copied from it, with no errors (except metadata error that persisted). Since that moment, this very vdev keeps on creating these errors, and each time new files are affacted. Each time I copy them and then new files got corrupted, each time different one. The errors persist on a vdev where the initial errors were discovered, the same vdev where the drive died. Since then I've done and extended memtest86+ memory check (several full cycles, for 14 hours, no errors logged), I've also done a firmware upgrade for an Adaptec contoller and of course a FreeBSD upgrade (not sure now what was an initial OS versio when the pool got struck, but it was probably a 13.0). The questions obvioulsy are, - can these be fixed ? - do they appear because pool metadata is corrupted ? - can they be fixed without pool destruction and rereceiving ? and probably what's causing them. Thanks. Eugene.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f0883f43-23c6-4139-b844-61a4623a59dc>
