Date: Fri, 15 Dec 2023 01:05:05 +0100 From: Miroslav Lachman <000.fbsd@quip.cz> To: Lexi Winter <lexi@le-fay.org>, "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org> Subject: Re: unusual ZFS issue Message-ID: <5d4ceb91-2046-4d2f-92b8-839a330c924a@quip.cz> In-Reply-To: <787CB64A-1687-49C3-9063-2CE3B6F957EF@le-fay.org> References: <787CB64A-1687-49C3-9063-2CE3B6F957EF@le-fay.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 14/12/2023 22:17, Lexi Winter wrote: > hi list, > > i’ve just hit this ZFS error: > > # zfs list -rt snapshot data/vm/media/disk1 > cannot iterate filesystems: I/O error > NAME USED AVAIL REFER MOUNTPOINT > data/vm/media/disk1@autosnap_2023-12-13_12:00:00_hourly 0B - 6.42G - > data/vm/media/disk1@autosnap_2023-12-14_10:16:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_11:17:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_12:04:00_monthly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_12:15:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_13:14:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_14:38:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_15:11:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_17:12:00_hourly 316K - 6.47G - > data/vm/media/disk1@autosnap_2023-12-14_17:29:00_daily 2.70M - 6.47G - > > the pool itself also reports an error: > > # zpool status -v > pool: data > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A > scan: scrub in progress since Thu Dec 14 18:58:21 2023 > 11.5T / 18.8T scanned at 1.46G/s, 6.25T / 18.8T issued at 809M/s > 0B repaired, 33.29% done, 04:30:20 to go > config: > > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > da4p1 ONLINE 0 0 0 > da6p1 ONLINE 0 0 0 > da5p1 ONLINE 0 0 0 > da7p1 ONLINE 0 0 0 > da1p1 ONLINE 0 0 0 > da0p1 ONLINE 0 0 0 > da3p1 ONLINE 0 0 0 > da2p1 ONLINE 0 0 0 > logs > mirror-2 ONLINE 0 0 0 > ada0p4 ONLINE 0 0 0 > ada1p4 ONLINE 0 0 0 > cache > ada1p5 ONLINE 0 0 0 > ada0p5 ONLINE 0 0 0 > > errors: Permanent errors have been detected in the following files: > > (it doesn’t list any files, the output ends there.) > > my assumption is that this indicates some sort of metadata corruption issue, but i can’t find anything that might have caused it. none of the disks report any errors, and while all the disks are on the same SAS controller, i would have expected controller errors to be flagged as CKSUM errors. > > my best guess is that this might be caused by a CPU or memory issue, but the system has ECC memory and hasn’t reported any issues. > > - has anyone else encountered anything like this? I've never seen "cannot iterate filesystems: I/O error". Could it be that the system has too many snapshots / not enough memory to list them? But I have seen the pool report an error in an unknown file and not shows any READ / WRITE / CKSUM errors. This is from my notes taken 10 years ago: ============================= # zpool status -v pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad0 ONLINE 0 0 0 ad1 ONLINE 0 0 0 ad2 ONLINE 0 0 0 ad3 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <0x2da>:<0x258ab13> ============================= As you can see there are no CKSUM errors. There is something that should be a path to filename: <0x2da>:<0x258ab13> Maybe it was error in a snapshot which was already deleted? Just my guess. I ran a scrub on that pool, it finished without any error and then the status of the pool was OK. Similar error reappeared after a month and then after about 6 month. The machine had ECC RAM. After these 3 incidents, I never saw it again. I still have this machine in working condition, just the disk drives were replaced from 4x 1TB to 4x 4TB and then 4x 8TB :) Kind regards Miroslav Lachman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5d4ceb91-2046-4d2f-92b8-839a330c924a>