Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 6 Nov 2025 00:08:36 +0500
From:      "Eugene M. Zheganin" <eugene@zheganin.net>
To:        FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org>
Subject:   zfs pool keeps losing files after drive faiilure
Message-ID:  <f0883f43-23c6-4139-b844-61a4623a59dc@zheganin.net>

next in thread | raw e-mail | index | archive | help
Hello,

I'm using freeBSD 14.2 for a long time as a significant file storage on 
a server with 256 gig of memory, no ARC limitation is in effect . The 
files are stored on a pool configured as:


===Cut===

NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP DEDUP    
HEALTH  ALTROOT
tank    135T   125T  9.74T    4.06G         -    60%    92% 1.20x    
ONLINE  -
zroot  31.5G  29.6G  1.90G     122M         -    79%    93% 1.00x    
ONLINE  -

===Cut===

Drives are attached via

<Adaptec 3101-4i 7.41>  at scbus0 target 0 lun 0 (pass0)

as passthrough ones.

===Cut===

   pool: tank
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
   scan: scrub repaired 0B in 16 days 05:58:25 with 1497 errors on Fri 
Sep 19 22:04:18 2025
checkpoint: created Wed Nov  5 00:00:03 2025, consumes 4.06G
config:

         NAME            STATE     READ WRITE CKSUM
         tank            ONLINE       0     0     0
           raidz1-0      ONLINE       0     0     0
             gpt/data0   ONLINE       0     0     0
             gpt/data1   ONLINE       0     0     0
             gpt/data2   ONLINE       0     0     0
             gpt/data3   ONLINE       0     0     0
             gpt/data4   ONLINE       0     0     0
           raidz1-1      ONLINE       0     0     0
             gpt/data5   ONLINE       0     0    30
             gpt/data6   ONLINE       0     0    30
             gpt/data7   ONLINE       0     0    30
             gpt/data8   ONLINE       0     0    30
             gpt/data9   ONLINE       0     0    30
           raidz1-2      ONLINE       0     0     0
             gpt/data10  ONLINE       0     0     0
             gpt/data11  ONLINE       0     0     0
             gpt/data12  ONLINE       0     0     0
             gpt/data13  ONLINE       0     0     0
             gpt/data14  ONLINE       0     0     0

errors: 229 data errors, use '-v' for a list

===Cut===

(the file unique list is pretty short for now, as these 229 errors 
mostly reference the snapshots, it contains 11 unique filenames which 
mostly are .mp3 files, and one <metadata>:<0x0> entry).

The zfs dataset configuration looks like (the deduplication is on only 
for tank 0, troubled one is the tank3):

===Cut===

NAME                        USED  AVAIL  REFER  MOUNTPOINT
tank                        112T  7.12T   141K  /tank
tank/data                   111T  7.12T  3.72G  /usr/local/public
tank/data/tank0            19.9T  7.12T  19.9T /usr/local/public/tank0
tank/data/tank1            16.1T  7.12T  16.1T /usr/local/public/tank1
tank/data/tank2            38.8T  7.12T  38.8T /usr/local/public/tank2
tank/data/tank3            36.6T  7.12T  28.9T /usr/local/public/tank3
tank/obj                   17.7G  7.12T  17.7G  /usr/obj
tank/www                    197M  7.12T   197M  /var/www

===Cut===

As you can probably see it conttains errors. The thing is - I'm 
desperately fighting these errors for almost a year now. They started to 
appear when a drive in this very vdev died, and was successfully 
replaced. But scrub found out that some of the files were damaged beyond 
repair with self-healing; since this entire pool is replicated to a 
sibling SAN (and it's copy of these were intact), they all were deleted 
and copied from it, with no errors (except metadata error that 
persisted). Since that moment, this very vdev keeps on creating these 
errors, and each time new files are affacted. Each time I copy them and 
then new files got corrupted, each time different one. The errors 
persist on a vdev where the initial errors were discovered, the same 
vdev where the drive died. Since then I've done and extended memtest86+ 
memory check (several full cycles, for 14 hours, no errors logged), I've 
also done a firmware upgrade for an Adaptec contoller and of course a 
FreeBSD upgrade (not sure now what was an initial OS versio when the 
pool got struck, but it was probably a 13.0).


The questions obvioulsy are,

- can these be fixed ?

- do they appear because pool metadata is corrupted ?

- can they be fixed without pool destruction and rereceiving ?

and probably what's causing them.


Thanks.

Eugene.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?f0883f43-23c6-4139-b844-61a4623a59dc>