From nobody Fri Dec 15 00:05:05 2023
X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SrqGz5KYHz53y56
	for <freebsd-fs@mlmmj.nyi.freebsd.org>; Fri, 15 Dec 2023 00:05:15 +0000 (UTC)
	(envelope-from SRS0=3sSL=H2=quip.cz=000.fbsd@elsa.codelab.cz)
Received: from elsa.codelab.cz (elsa.codelab.cz [94.124.105.4])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4SrqGz3G5nz3TRH
	for <freebsd-fs@FreeBSD.org>; Fri, 15 Dec 2023 00:05:15 +0000 (UTC)
	(envelope-from SRS0=3sSL=H2=quip.cz=000.fbsd@elsa.codelab.cz)
Authentication-Results: mx1.freebsd.org;
	none
Received: from elsa.codelab.cz (localhost [127.0.0.1])
	by elsa.codelab.cz (Postfix) with ESMTP id 71793D7892;
	Fri, 15 Dec 2023 01:05:07 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quip.cz; s=private;
	t=1702598707; bh=1WTNJuRq1czx6CWfmTQe6Yk60PP2BxVC02FmupxQl+U=;
	h=Date:Subject:To:References:From:In-Reply-To;
	b=cL41/g2z16oRqLS6NNy35D9GAOsDezCfSh1pxqMbMTvEf3dmkXqYciWr7CQSlqDLj
	 ZGaPLluOGOYZjD+4TvDY3xTpYR2ExMxWVDGurd+Dna1jCLSOPzRp9E7yQi14RQyWaV
	 pBQogBNvsCwsVQ0FEYVq7Is6xe6iXuvazz643EaE=
Received: from [192.168.145.49] (ip-89-177-27-225.bb.vodafone.cz [89.177.27.225])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by elsa.codelab.cz (Postfix) with ESMTPSA id C6696D7884;
	Fri, 15 Dec 2023 01:05:05 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quip.cz; s=private;
	t=1702598705; bh=1WTNJuRq1czx6CWfmTQe6Yk60PP2BxVC02FmupxQl+U=;
	h=Date:Subject:To:References:From:In-Reply-To;
	b=6FZIQGUHjyApKVgpf5AFTcxTR5KlkA8ybg1JLHXdK6OcwWm83RgRxRk0T2gLpr3Tz
	 hFNKRJMd4uEFTSwA6f9RNor8eLoa7ZDNcBsbs6yaikTDgInE8d3xV4Ml2hKAuD/plY
	 QIMOvsVQPFw+M7rMD2LW+cuyzqtqwGqEDPyN+DrM=
Message-ID: <5d4ceb91-2046-4d2f-92b8-839a330c924a@quip.cz>
Date: Fri, 15 Dec 2023 01:05:05 +0100
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-fs
List-Help: <mailto:freebsd-fs+help@freebsd.org>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Subscribe: <mailto:freebsd-fs+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-fs+unsubscribe@freebsd.org>
Sender: owner-freebsd-fs@freebsd.org
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: unusual ZFS issue
Content-Language: cs-Cestina, en-US
To: Lexi Winter <lexi@le-fay.org>,
 "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org>
References: <787CB64A-1687-49C3-9063-2CE3B6F957EF@le-fay.org>
From: Miroslav Lachman <000.fbsd@quip.cz>
In-Reply-To: <787CB64A-1687-49C3-9063-2CE3B6F957EF@le-fay.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:42000, ipnet:94.124.104.0/21, country:CZ]
X-Spamd-Bar: ----
X-Rspamd-Queue-Id: 4SrqGz3G5nz3TRH

On 14/12/2023 22:17, Lexi Winter wrote:
> hi list,
> 
> i’ve just hit this ZFS error:
> 
> # zfs list -rt snapshot data/vm/media/disk1
> cannot iterate filesystems: I/O error
> NAME                                                       USED  AVAIL  REFER  MOUNTPOINT
> data/vm/media/disk1@autosnap_2023-12-13_12:00:00_hourly      0B      -  6.42G  -
> data/vm/media/disk1@autosnap_2023-12-14_10:16:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_11:17:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_12:04:00_monthly     0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_12:15:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_13:14:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_14:38:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_15:11:00_hourly      0B      -  6.46G  -
> data/vm/media/disk1@autosnap_2023-12-14_17:12:00_hourly    316K      -  6.47G  -
> data/vm/media/disk1@autosnap_2023-12-14_17:29:00_daily    2.70M      -  6.47G  -
> 
> the pool itself also reports an error:
> 
> # zpool status -v
>    pool: data
>   state: ONLINE
> status: One or more devices has experienced an error resulting in data
> 	corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
> 	entire pool from backup.
>     see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
>    scan: scrub in progress since Thu Dec 14 18:58:21 2023
> 	11.5T / 18.8T scanned at 1.46G/s, 6.25T / 18.8T issued at 809M/s
> 	0B repaired, 33.29% done, 04:30:20 to go
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	data        ONLINE       0     0     0
> 	  raidz2-0  ONLINE       0     0     0
> 	    da4p1   ONLINE       0     0     0
> 	    da6p1   ONLINE       0     0     0
> 	    da5p1   ONLINE       0     0     0
> 	    da7p1   ONLINE       0     0     0
> 	    da1p1   ONLINE       0     0     0
> 	    da0p1   ONLINE       0     0     0
> 	    da3p1   ONLINE       0     0     0
> 	    da2p1   ONLINE       0     0     0
> 	logs
> 	  mirror-2  ONLINE       0     0     0
> 	    ada0p4  ONLINE       0     0     0
> 	    ada1p4  ONLINE       0     0     0
> 	cache
> 	  ada1p5    ONLINE       0     0     0
> 	  ada0p5    ONLINE       0     0     0
> 
> errors: Permanent errors have been detected in the following files:
> 
> (it doesn’t list any files, the output ends there.)
> 
> my assumption is that this indicates some sort of metadata corruption issue, but i can’t find anything that might have caused it.  none of the disks report any errors, and while all the disks are on the same SAS controller, i would have expected controller errors to be flagged as CKSUM errors.
> 
> my best guess is that this might be caused by a CPU or memory issue, but the system has ECC memory and hasn’t reported any issues.
> 
> - has anyone else encountered anything like this?

I've never seen "cannot iterate filesystems: I/O error". Could it be 
that the system has too many snapshots / not enough memory to list them?

But I have seen the pool report an error in an unknown file and not 
shows any READ / WRITE / CKSUM errors. This is from my notes taken 10 
years ago:

=============================
# zpool status -v

   pool: tank

  state: ONLINE

status: One or more devices has experienced an error resulting in data

         corruption.  Applications may be affected.

action: Restore the file in question if possible.  Otherwise restore the

         entire pool from backup.

    see: http://www.sun.com/msg/ZFS-8000-8A

  scrub: none requested

config:


         NAME        STATE     READ WRITE CKSUM

         tank        ONLINE       0     0     0

           raidz1    ONLINE       0     0     0

             ad0     ONLINE       0     0     0

             ad1     ONLINE       0     0     0

             ad2     ONLINE       0     0     0

             ad3     ONLINE       0     0     0


errors: Permanent errors have been detected in the following files:


         <0x2da>:<0x258ab13>
=============================

As you can see there are no CKSUM errors. There is something that should 
be a path to filename: <0x2da>:<0x258ab13>
Maybe it was error in a snapshot which was already deleted? Just my guess.
I ran a scrub on that pool, it finished without any error and then the 
status of the pool was OK.
Similar error reappeared after a month and then after about 6 month. The 
machine had ECC RAM. After these 3 incidents, I never saw it again. I 
still have this machine in working condition, just the disk drives were 
replaced from 4x 1TB to 4x 4TB and then 4x 8TB :)

Kind regards
Miroslav Lachman