From nobody Wed Nov 5 19:08:36 2025 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4d1vxZ09Hbz4kHVH for ; Wed, 05 Nov 2025 19:08:50 +0000 (UTC) (envelope-from eugene@zheganin.net) Received: from elf.hq.norma.perm.ru (mail.norman-retail.ru [128.127.146.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "mail.norma.perm.ru", Issuer "R11" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4d1vxW4XZBz3pN1 for ; Wed, 05 Nov 2025 19:08:47 +0000 (UTC) (envelope-from eugene@zheganin.net) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of eugene@zheganin.net designates 128.127.146.8 as permitted sender) smtp.mailfrom=eugene@zheganin.net Received: from [192.168.243.12] ([192.168.243.12]) by elf.hq.norma.perm.ru (8.18.1/8.18.1) with ESMTP id 5A5J8b8G077592 for ; Thu, 6 Nov 2025 00:08:38 +0500 (+05) (envelope-from eugene@zheganin.net) Message-ID: Date: Thu, 6 Nov 2025 00:08:36 +0500 List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-stable@freebsd.org Sender: owner-freebsd-stable@FreeBSD.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird To: FreeBSD-STABLE Mailing List Content-Language: ru From: "Eugene M. Zheganin" Subject: zfs pool keeps losing files after drive faiilure Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spamd-Bar: + X-Spamd-Result: default: False [1.03 / 15.00]; NEURAL_SPAM_LONG(1.00)[1.000]; NEURAL_SPAM_MEDIUM(0.93)[0.929]; NEURAL_HAM_SHORT(-0.59)[-0.594]; R_SPF_ALLOW(-0.20)[+a]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_ONE(0.00)[1]; ASN(0.00)[asn:212494, ipnet:128.127.146.0/24, country:RU]; RCVD_COUNT_ONE(0.00)[1]; RCVD_TLS_LAST(0.00)[]; MIME_TRACE(0.00)[0:+]; MID_RHS_MATCH_FROM(0.00)[]; R_DKIM_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org]; TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; ARC_NA(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org]; DMARC_NA(0.00)[zheganin.net]; TO_DN_ALL(0.00)[]; FROM_HAS_DN(0.00)[] X-Rspamd-Queue-Id: 4d1vxW4XZBz3pN1 Hello, I'm using freeBSD 14.2 for a long time as a significant file storage on a server with 256 gig of memory, no ARC limitation is in effect . The files are stored on a pool configured as: ===Cut=== NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP DEDUP    HEALTH  ALTROOT tank    135T   125T  9.74T    4.06G         -    60%    92% 1.20x    ONLINE  - zroot  31.5G  29.6G  1.90G     122M         -    79%    93% 1.00x    ONLINE  - ===Cut=== Drives are attached via   at scbus0 target 0 lun 0 (pass0) as passthrough ones. ===Cut===   pool: tank  state: ONLINE status: One or more devices has experienced an error resulting in data         corruption.  Applications may be affected. action: Restore the file in question if possible.  Otherwise restore the         entire pool from backup.    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A   scan: scrub repaired 0B in 16 days 05:58:25 with 1497 errors on Fri Sep 19 22:04:18 2025 checkpoint: created Wed Nov  5 00:00:03 2025, consumes 4.06G config:         NAME            STATE     READ WRITE CKSUM         tank            ONLINE       0     0     0           raidz1-0      ONLINE       0     0     0             gpt/data0   ONLINE       0     0     0             gpt/data1   ONLINE       0     0     0             gpt/data2   ONLINE       0     0     0             gpt/data3   ONLINE       0     0     0             gpt/data4   ONLINE       0     0     0           raidz1-1      ONLINE       0     0     0             gpt/data5   ONLINE       0     0    30             gpt/data6   ONLINE       0     0    30             gpt/data7   ONLINE       0     0    30             gpt/data8   ONLINE       0     0    30             gpt/data9   ONLINE       0     0    30           raidz1-2      ONLINE       0     0     0             gpt/data10  ONLINE       0     0     0             gpt/data11  ONLINE       0     0     0             gpt/data12  ONLINE       0     0     0             gpt/data13  ONLINE       0     0     0             gpt/data14  ONLINE       0     0     0 errors: 229 data errors, use '-v' for a list ===Cut=== (the file unique list is pretty short for now, as these 229 errors mostly reference the snapshots, it contains 11 unique filenames which mostly are .mp3 files, and one :<0x0> entry). The zfs dataset configuration looks like (the deduplication is on only for tank 0, troubled one is the tank3): ===Cut=== NAME                        USED  AVAIL  REFER  MOUNTPOINT tank                        112T  7.12T   141K  /tank tank/data                   111T  7.12T  3.72G  /usr/local/public tank/data/tank0            19.9T  7.12T  19.9T /usr/local/public/tank0 tank/data/tank1            16.1T  7.12T  16.1T /usr/local/public/tank1 tank/data/tank2            38.8T  7.12T  38.8T /usr/local/public/tank2 tank/data/tank3            36.6T  7.12T  28.9T /usr/local/public/tank3 tank/obj                   17.7G  7.12T  17.7G  /usr/obj tank/www                    197M  7.12T   197M  /var/www ===Cut=== As you can probably see it conttains errors. The thing is - I'm desperately fighting these errors for almost a year now. They started to appear when a drive in this very vdev died, and was successfully replaced. But scrub found out that some of the files were damaged beyond repair with self-healing; since this entire pool is replicated to a sibling SAN (and it's copy of these were intact), they all were deleted and copied from it, with no errors (except metadata error that persisted). Since that moment, this very vdev keeps on creating these errors, and each time new files are affacted. Each time I copy them and then new files got corrupted, each time different one. The errors persist on a vdev where the initial errors were discovered, the same vdev where the drive died. Since then I've done and extended memtest86+ memory check (several full cycles, for 14 hours, no errors logged), I've also done a firmware upgrade for an Adaptec contoller and of course a FreeBSD upgrade (not sure now what was an initial OS versio when the pool got struck, but it was probably a 13.0). The questions obvioulsy are, - can these be fixed ? - do they appear because pool metadata is corrupted ? - can they be fixed without pool destruction and rereceiving ? and probably what's causing them. Thanks. Eugene.