Date: Fri, 15 Dec 2023 01:41:22 -0500 From: Rich <rincebrain@gmail.com> To: Miroslav Lachman <000.fbsd@quip.cz> Cc: Lexi Winter <lexi@le-fay.org>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: unusual ZFS issue Message-ID: <CAOeNLuqNoK_UTjX4w5zSXGVxtQpB9BW7qhYfDY2cqEqu%2BMypvg@mail.gmail.com> In-Reply-To: <5d4ceb91-2046-4d2f-92b8-839a330c924a@quip.cz> References: <787CB64A-1687-49C3-9063-2CE3B6F957EF@le-fay.org> <5d4ceb91-2046-4d2f-92b8-839a330c924a@quip.cz>
next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000016ad74060c86b1c8 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Native encryption decryption errors won't show up as r/w/c errors, but will show up as "things with errors" in the status output. That wouldn't be triggered by scrub noticing them, though, since scrub doesn't decrypt things. Just the only thing I know of offhand where it'll decide there are errors but the counters will be zero... - Rich On Thu, Dec 14, 2023 at 7:05=E2=80=AFPM Miroslav Lachman <000.fbsd@quip.cz>= wrote: > On 14/12/2023 22:17, Lexi Winter wrote: > > hi list, > > > > i=E2=80=99ve just hit this ZFS error: > > > > # zfs list -rt snapshot data/vm/media/disk1 > > cannot iterate filesystems: I/O error > > NAME USED AVAIL > REFER MOUNTPOINT > > data/vm/media/disk1@autosnap_2023-12-13_12:00:00_hourly 0B - > 6.42G - > > data/vm/media/disk1@autosnap_2023-12-14_10:16:00_hourly 0B - > 6.46G - > > data/vm/media/disk1@autosnap_2023-12-14_11:17:00_hourly 0B - > 6.46G - > > data/vm/media/disk1@autosnap_2023-12-14_12:04:00_monthly 0B - > 6.46G - > > data/vm/media/disk1@autosnap_2023-12-14_12:15:00_hourly 0B - > 6.46G - > > data/vm/media/disk1@autosnap_2023-12-14_13:14:00_hourly 0B - > 6.46G - > > data/vm/media/disk1@autosnap_2023-12-14_14:38:00_hourly 0B - > 6.46G - > > data/vm/media/disk1@autosnap_2023-12-14_15:11:00_hourly 0B - > 6.46G - > > data/vm/media/disk1@autosnap_2023-12-14_17:12:00_hourly 316K - > 6.47G - > > data/vm/media/disk1@autosnap_2023-12-14_17:29:00_daily 2.70M - > 6.47G - > > > > the pool itself also reports an error: > > > > # zpool status -v > > pool: data > > state: ONLINE > > status: One or more devices has experienced an error resulting in data > > corruption. Applications may be affected. > > action: Restore the file in question if possible. Otherwise restore th= e > > entire pool from backup. > > see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A > > scan: scrub in progress since Thu Dec 14 18:58:21 2023 > > 11.5T / 18.8T scanned at 1.46G/s, 6.25T / 18.8T issued at 809M/s > > 0B repaired, 33.29% done, 04:30:20 to go > > config: > > > > NAME STATE READ WRITE CKSUM > > data ONLINE 0 0 0 > > raidz2-0 ONLINE 0 0 0 > > da4p1 ONLINE 0 0 0 > > da6p1 ONLINE 0 0 0 > > da5p1 ONLINE 0 0 0 > > da7p1 ONLINE 0 0 0 > > da1p1 ONLINE 0 0 0 > > da0p1 ONLINE 0 0 0 > > da3p1 ONLINE 0 0 0 > > da2p1 ONLINE 0 0 0 > > logs > > mirror-2 ONLINE 0 0 0 > > ada0p4 ONLINE 0 0 0 > > ada1p4 ONLINE 0 0 0 > > cache > > ada1p5 ONLINE 0 0 0 > > ada0p5 ONLINE 0 0 0 > > > > errors: Permanent errors have been detected in the following files: > > > > (it doesn=E2=80=99t list any files, the output ends there.) > > > > my assumption is that this indicates some sort of metadata corruption > issue, but i can=E2=80=99t find anything that might have caused it. none= of the > disks report any errors, and while all the disks are on the same SAS > controller, i would have expected controller errors to be flagged as CKSU= M > errors. > > > > my best guess is that this might be caused by a CPU or memory issue, bu= t > the system has ECC memory and hasn=E2=80=99t reported any issues. > > > > - has anyone else encountered anything like this? > > I've never seen "cannot iterate filesystems: I/O error". Could it be > that the system has too many snapshots / not enough memory to list them? > > But I have seen the pool report an error in an unknown file and not > shows any READ / WRITE / CKSUM errors. This is from my notes taken 10 > years ago: > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > # zpool status -v > > pool: tank > > state: ONLINE > > status: One or more devices has experienced an error resulting in data > > corruption. Applications may be affected. > > action: Restore the file in question if possible. Otherwise restore the > > entire pool from backup. > > see: http://www.sun.com/msg/ZFS-8000-8A > > scrub: none requested > > config: > > > > NAME STATE READ WRITE CKSUM > > tank ONLINE 0 0 0 > > raidz1 ONLINE 0 0 0 > > ad0 ONLINE 0 0 0 > > ad1 ONLINE 0 0 0 > > ad2 ONLINE 0 0 0 > > ad3 ONLINE 0 0 0 > > > > errors: Permanent errors have been detected in the following files: > > > > <0x2da>:<0x258ab13> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > > As you can see there are no CKSUM errors. There is something that should > be a path to filename: <0x2da>:<0x258ab13> > Maybe it was error in a snapshot which was already deleted? Just my guess= . > I ran a scrub on that pool, it finished without any error and then the > status of the pool was OK. > Similar error reappeared after a month and then after about 6 month. The > machine had ECC RAM. After these 3 incidents, I never saw it again. I > still have this machine in working condition, just the disk drives were > replaced from 4x 1TB to 4x 4TB and then 4x 8TB :) > > Kind regards > Miroslav Lachman > > > --00000000000016ad74060c86b1c8 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">Native encryption decryption errors won't show up as r= /w/c errors, but will show up as "things with errors" in the stat= us output.<div><br></div><div>That wouldn't be triggered by scrub notic= ing them, though, since scrub doesn't decrypt things.</div><div><br>Jus= t the only thing I know of offhand where it'll decide there are errors = but the counters will be zero...</div><div><br></div><div>- Rich</div></div= ><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Th= u, Dec 14, 2023 at 7:05=E2=80=AFPM Miroslav Lachman <<a href=3D"mailto:0= 00.fbsd@quip.cz">000.fbsd@quip.cz</a>> wrote:<br></div><blockquote class= =3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg= b(204,204,204);padding-left:1ex">On 14/12/2023 22:17, Lexi Winter wrote:<br= > > hi list,<br> > <br> > i=E2=80=99ve just hit this ZFS error:<br> > <br> > # zfs list -rt snapshot data/vm/media/disk1<br> > cannot iterate filesystems: I/O error<br> > NAME=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0USED=C2=A0 AVAIL=C2= =A0 REFER=C2=A0 MOUNTPOINT<br> > data/vm/media/disk1@autosnap_2023-12-13_12:00:00_hourly=C2=A0 =C2=A0 = =C2=A0 0B=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.42G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_10:16:00_hourly=C2=A0 =C2=A0 = =C2=A0 0B=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.46G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_11:17:00_hourly=C2=A0 =C2=A0 = =C2=A0 0B=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.46G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_12:04:00_monthly=C2=A0 =C2=A0 = =C2=A00B=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.46G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_12:15:00_hourly=C2=A0 =C2=A0 = =C2=A0 0B=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.46G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_13:14:00_hourly=C2=A0 =C2=A0 = =C2=A0 0B=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.46G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_14:38:00_hourly=C2=A0 =C2=A0 = =C2=A0 0B=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.46G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_15:11:00_hourly=C2=A0 =C2=A0 = =C2=A0 0B=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.46G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_17:12:00_hourly=C2=A0 =C2=A0 3= 16K=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.47G=C2=A0 -<br> > data/vm/media/disk1@autosnap_2023-12-14_17:29:00_daily=C2=A0 =C2=A0 2.= 70M=C2=A0 =C2=A0 =C2=A0 -=C2=A0 6.47G=C2=A0 -<br> > <br> > the pool itself also reports an error:<br> > <br> > # zpool status -v<br> >=C2=A0 =C2=A0 pool: data<br> >=C2=A0 =C2=A0state: ONLINE<br> > status: One or more devices has experienced an error resulting in data= <br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0corruption.=C2=A0 Applications may be affect= ed.<br> > action: Restore the file in question if possible.=C2=A0 Otherwise rest= ore the<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0entire pool from backup.<br> >=C2=A0 =C2=A0 =C2=A0see: <a href=3D"https://openzfs.github.io/openzfs-d= ocs/msg/ZFS-8000-8A" rel=3D"noreferrer" target=3D"_blank">https://openzfs.g= ithub.io/openzfs-docs/msg/ZFS-8000-8A</a><br> >=C2=A0 =C2=A0 scan: scrub in progress since Thu Dec 14 18:58:21 2023<br= > >=C2=A0 =C2=A0 =C2=A0 =C2=A011.5T / 18.8T scanned at 1.46G/s, 6.25T / 18= .8T issued at 809M/s<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A00B repaired, 33.29% done, 04:30:20 to go<br> > config:<br> > <br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0NAME=C2=A0 =C2=A0 =C2=A0 =C2=A0 STATE=C2=A0 = =C2=A0 =C2=A0READ WRITE CKSUM<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0data=C2=A0 =C2=A0 =C2=A0 =C2=A0 ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0raidz2-0=C2=A0 ONLINE=C2=A0 =C2=A0 = =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0da4p1=C2=A0 =C2=A0ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0da6p1=C2=A0 =C2=A0ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0da5p1=C2=A0 =C2=A0ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0da7p1=C2=A0 =C2=A0ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0da1p1=C2=A0 =C2=A0ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0da0p1=C2=A0 =C2=A0ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0da3p1=C2=A0 =C2=A0ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0da2p1=C2=A0 =C2=A0ONLINE=C2=A0= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0logs<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mirror-2=C2=A0 ONLINE=C2=A0 =C2=A0 = =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ada0p4=C2=A0 ONLINE=C2=A0 =C2= =A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ada1p4=C2=A0 ONLINE=C2=A0 =C2= =A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0cache<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ada1p5=C2=A0 =C2=A0 ONLINE=C2=A0 =C2= =A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ada0p5=C2=A0 =C2=A0 ONLINE=C2=A0 =C2= =A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> > <br> > errors: Permanent errors have been detected in the following files:<br= > > <br> > (it doesn=E2=80=99t list any files, the output ends there.)<br> > <br> > my assumption is that this indicates some sort of metadata corruption = issue, but i can=E2=80=99t find anything that might have caused it.=C2=A0 n= one of the disks report any errors, and while all the disks are on the same= SAS controller, i would have expected controller errors to be flagged as C= KSUM errors.<br> > <br> > my best guess is that this might be caused by a CPU or memory issue, b= ut the system has ECC memory and hasn=E2=80=99t reported any issues.<br> > <br> > - has anyone else encountered anything like this?<br> <br> I've never seen "cannot iterate filesystems: I/O error". Coul= d it be <br> that the system has too many snapshots / not enough memory to list them?<br= > <br> But I have seen the pool report an error in an unknown file and not <br> shows any READ / WRITE / CKSUM errors. This is from my notes taken 10 <br> years ago:<br> <br> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D<br> # zpool status -v<br> <br> =C2=A0 =C2=A0pool: tank<br> <br> =C2=A0 state: ONLINE<br> <br> status: One or more devices has experienced an error resulting in data<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0corruption.=C2=A0 Applications may be aff= ected.<br> <br> action: Restore the file in question if possible.=C2=A0 Otherwise restore t= he<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0entire pool from backup.<br> <br> =C2=A0 =C2=A0 see: <a href=3D"http://www.sun.com/msg/ZFS-8000-8A" rel=3D"no= referrer" target=3D"_blank">http://www.sun.com/msg/ZFS-8000-8A</a><br> <br> =C2=A0 scrub: none requested<br> <br> config:<br> <br> <br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0NAME=C2=A0 =C2=A0 =C2=A0 =C2=A0 STATE=C2= =A0 =C2=A0 =C2=A0READ WRITE CKSUM<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0tank=C2=A0 =C2=A0 =C2=A0 =C2=A0 ONLINE=C2= =A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0raidz1=C2=A0 =C2=A0 ONLINE=C2=A0 = =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ad0=C2=A0 =C2=A0 =C2=A0ONLI= NE=C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br= > <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ad1=C2=A0 =C2=A0 =C2=A0ONLI= NE=C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br= > <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ad2=C2=A0 =C2=A0 =C2=A0ONLI= NE=C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br= > <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ad3=C2=A0 =C2=A0 =C2=A0ONLI= NE=C2=A0 =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br= > <br> <br> <br> errors: Permanent errors have been detected in the following files:<br> <br> <br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<0x2da>:<0x258ab13><br> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D<br> <br> As you can see there are no CKSUM errors. There is something that should <b= r> be a path to filename: <0x2da>:<0x258ab13><br> Maybe it was error in a snapshot which was already deleted? Just my guess.<= br> I ran a scrub on that pool, it finished without any error and then the <br> status of the pool was OK.<br> Similar error reappeared after a month and then after about 6 month. The <b= r> machine had ECC RAM. After these 3 incidents, I never saw it again. I <br> still have this machine in working condition, just the disk drives were <br= > replaced from 4x 1TB to 4x 4TB and then 4x 8TB :)<br> <br> Kind regards<br> Miroslav Lachman<br> <br> <br> </blockquote></div> --00000000000016ad74060c86b1c8--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOeNLuqNoK_UTjX4w5zSXGVxtQpB9BW7qhYfDY2cqEqu%2BMypvg>