From nobody Fri Dec 15 00:05:05 2023 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SrqGz5KYHz53y56 for ; Fri, 15 Dec 2023 00:05:15 +0000 (UTC) (envelope-from SRS0=3sSL=H2=quip.cz=000.fbsd@elsa.codelab.cz) Received: from elsa.codelab.cz (elsa.codelab.cz [94.124.105.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4SrqGz3G5nz3TRH for ; Fri, 15 Dec 2023 00:05:15 +0000 (UTC) (envelope-from SRS0=3sSL=H2=quip.cz=000.fbsd@elsa.codelab.cz) Authentication-Results: mx1.freebsd.org; none Received: from elsa.codelab.cz (localhost [127.0.0.1]) by elsa.codelab.cz (Postfix) with ESMTP id 71793D7892; Fri, 15 Dec 2023 01:05:07 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quip.cz; s=private; t=1702598707; bh=1WTNJuRq1czx6CWfmTQe6Yk60PP2BxVC02FmupxQl+U=; h=Date:Subject:To:References:From:In-Reply-To; b=cL41/g2z16oRqLS6NNy35D9GAOsDezCfSh1pxqMbMTvEf3dmkXqYciWr7CQSlqDLj ZGaPLluOGOYZjD+4TvDY3xTpYR2ExMxWVDGurd+Dna1jCLSOPzRp9E7yQi14RQyWaV pBQogBNvsCwsVQ0FEYVq7Is6xe6iXuvazz643EaE= Received: from [192.168.145.49] (ip-89-177-27-225.bb.vodafone.cz [89.177.27.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by elsa.codelab.cz (Postfix) with ESMTPSA id C6696D7884; Fri, 15 Dec 2023 01:05:05 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quip.cz; s=private; t=1702598705; bh=1WTNJuRq1czx6CWfmTQe6Yk60PP2BxVC02FmupxQl+U=; h=Date:Subject:To:References:From:In-Reply-To; b=6FZIQGUHjyApKVgpf5AFTcxTR5KlkA8ybg1JLHXdK6OcwWm83RgRxRk0T2gLpr3Tz hFNKRJMd4uEFTSwA6f9RNor8eLoa7ZDNcBsbs6yaikTDgInE8d3xV4Ml2hKAuD/plY QIMOvsVQPFw+M7rMD2LW+cuyzqtqwGqEDPyN+DrM= Message-ID: <5d4ceb91-2046-4d2f-92b8-839a330c924a@quip.cz> Date: Fri, 15 Dec 2023 01:05:05 +0100 List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: unusual ZFS issue Content-Language: cs-Cestina, en-US To: Lexi Winter , "freebsd-fs@freebsd.org" References: <787CB64A-1687-49C3-9063-2CE3B6F957EF@le-fay.org> From: Miroslav Lachman <000.fbsd@quip.cz> In-Reply-To: <787CB64A-1687-49C3-9063-2CE3B6F957EF@le-fay.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:42000, ipnet:94.124.104.0/21, country:CZ] X-Spamd-Bar: ---- X-Rspamd-Queue-Id: 4SrqGz3G5nz3TRH On 14/12/2023 22:17, Lexi Winter wrote: > hi list, > > i’ve just hit this ZFS error: > > # zfs list -rt snapshot data/vm/media/disk1 > cannot iterate filesystems: I/O error > NAME USED AVAIL REFER MOUNTPOINT > data/vm/media/disk1@autosnap_2023-12-13_12:00:00_hourly 0B - 6.42G - > data/vm/media/disk1@autosnap_2023-12-14_10:16:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_11:17:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_12:04:00_monthly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_12:15:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_13:14:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_14:38:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_15:11:00_hourly 0B - 6.46G - > data/vm/media/disk1@autosnap_2023-12-14_17:12:00_hourly 316K - 6.47G - > data/vm/media/disk1@autosnap_2023-12-14_17:29:00_daily 2.70M - 6.47G - > > the pool itself also reports an error: > > # zpool status -v > pool: data > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A > scan: scrub in progress since Thu Dec 14 18:58:21 2023 > 11.5T / 18.8T scanned at 1.46G/s, 6.25T / 18.8T issued at 809M/s > 0B repaired, 33.29% done, 04:30:20 to go > config: > > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > da4p1 ONLINE 0 0 0 > da6p1 ONLINE 0 0 0 > da5p1 ONLINE 0 0 0 > da7p1 ONLINE 0 0 0 > da1p1 ONLINE 0 0 0 > da0p1 ONLINE 0 0 0 > da3p1 ONLINE 0 0 0 > da2p1 ONLINE 0 0 0 > logs > mirror-2 ONLINE 0 0 0 > ada0p4 ONLINE 0 0 0 > ada1p4 ONLINE 0 0 0 > cache > ada1p5 ONLINE 0 0 0 > ada0p5 ONLINE 0 0 0 > > errors: Permanent errors have been detected in the following files: > > (it doesn’t list any files, the output ends there.) > > my assumption is that this indicates some sort of metadata corruption issue, but i can’t find anything that might have caused it. none of the disks report any errors, and while all the disks are on the same SAS controller, i would have expected controller errors to be flagged as CKSUM errors. > > my best guess is that this might be caused by a CPU or memory issue, but the system has ECC memory and hasn’t reported any issues. > > - has anyone else encountered anything like this? I've never seen "cannot iterate filesystems: I/O error". Could it be that the system has too many snapshots / not enough memory to list them? But I have seen the pool report an error in an unknown file and not shows any READ / WRITE / CKSUM errors. This is from my notes taken 10 years ago: ============================= # zpool status -v pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad0 ONLINE 0 0 0 ad1 ONLINE 0 0 0 ad2 ONLINE 0 0 0 ad3 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <0x2da>:<0x258ab13> ============================= As you can see there are no CKSUM errors. There is something that should be a path to filename: <0x2da>:<0x258ab13> Maybe it was error in a snapshot which was already deleted? Just my guess. I ran a scrub on that pool, it finished without any error and then the status of the pool was OK. Similar error reappeared after a month and then after about 6 month. The machine had ECC RAM. After these 3 incidents, I never saw it again. I still have this machine in working condition, just the disk drives were replaced from 4x 1TB to 4x 4TB and then 4x 8TB :) Kind regards Miroslav Lachman