FreeBSD Mail Archives

Date:      Fri, 14 Jul 2023 17:34:36 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Alan Somers <asomers@freebsd.org>
Cc:        scsi@freebsd.org
Subject:   Re: ASC/ASCQ Review
Message-ID:  <CANCZdfr-y8HYBb6GCFqZ7LAarxUAGb36Y6j%2Bbo%2BWiDwUT5uR7A@mail.gmail.com>
In-Reply-To: <CAOtMX2iwnpHL6b2-1D4N4Bi4eKoLnGK4=%2BgUowXGS_gtyDOkig@mail.gmail.com>
References:  <CANCZdfokEoRtNp0en=9pjLQSQ%2BjtmfwH3OOwz1z09VcwWpE%2Bxg@mail.gmail.com> <CAOtMX2g4%2BSDWg9WKbwZcqh4GpRan593O6qtNf7feoVejVK0YyQ@mail.gmail.com> <CANCZdfq5qti5uzWLkZaQEpyd5Q255sQeaR_kC_OQinmE9Qcqaw@mail.gmail.com> <CAOtMX2iwnpHL6b2-1D4N4Bi4eKoLnGK4=%2BgUowXGS_gtyDOkig@mail.gmail.com>

--000000000000f8e4d106007ae405
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, Jul 14, 2023 at 12:31=E2=80=AFPM Alan Somers <asomers@freebsd.org> =
wrote:

> On Fri, Jul 14, 2023 at 11:05=E2=80=AFAM Warner Losh <imp@bsdimp.com> wro=
te:
> >
> >
> >
> > On Fri, Jul 14, 2023, 11:12 AM Alan Somers <asomers@freebsd.org> wrote:
> >>
> >> On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh <imp@bsdimp.com> =
wrote:
> >> >
> >> > Greetings,
> >> >
> >> > i've been looking closely at failed drives for $WORK lately. I've
> noticed that a lot of errors that kinda sound like fatal errors have
> SS_RDEF set on them.
> >> >
> >> > What's the process for evaluating whether those error codes are wort=
h
> retrying. There are several errors that we seem to be seeing (preliminary
> read of the data) before the drive gives up the ghost altogether. For tho=
se
> cases, I'd like to post more specific lists. Should I do that here?
> >> >
> >> > Independent of that, I may want to have a more aggressive 'fail fast=
'
> policy than is appropriate for my work load (we have a lot of data that's=
 a
> copy of a copy of a copy, so if we lose it, we don't care: we'll just
> delete any files we can't read and get on with life, though I know others
> will have a more conservative attitude towards data that might be preciou=
s
> and unique). I can set the number of retries lower, I can do some other
> hacks for disks that tell the disk to fail faster, but I think part of th=
e
> solution is going to have to be failing for some sense-code/ASC/ASCQ tupl=
es
> that we don't want to fail in upstream or the general case. I was thinkin=
g
> of identifying those and creating a 'global quirk table' that gets applie=
d
> after the drive-specific quirk table that would let $WORK override the
> defaults, while letting others keep the current behavior. IMHO, it would =
be
> better to have these separate rather than in the global data for tracking
> upstream...
> >> >
> >> > Is that clear, or should I give concrete examples?
> >> >
> >> > Comments?
> >> >
> >> > Warner
> >>
> >> Basically, you want to change the retry counts for certain ASC/ASCQ
> >> codes only, on a site-by-site basis?  That sounds reasonable.  Would
> >> it be configurable at runtime or only at build time?
> >
> >
> > I'd like to change the default actions. But maybe we just do that for
> everyone and assume modern drives...
> >
> >> Also, I've been thinking lately that it would be real nice if READ
> >> UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.  That
> >> would let consumers know that retries are pointless, but that the data
> >> is probably healable.
> >
> >
> > Unlikely, unless you've tuned things to not try for long at recovery...
> >
> > But regardless... do you have a concrete example of a use case? There's
> a number of places that map any error to EIO. And I'd like a use case
> before we expand the errors the lower layers return...
> >
> > Warner
>
> My first use-case is a user-space FUSE file system.  It only has
> access to errnos, not ASC/ASCQ codes.  If we do as I suggest, then it
> could heal a READ UNRECOVERABLE by rewriting the sector, whereas other
> EIO errors aren't likely to be healed that way.
>

Yea... but READ UNRECOVERABLE is kinda hit or miss...


> My second use-case is ZFS.  zfsd treats checksum errors differently
> from I/O errors.  A checksum error normally means that a read returned
> wrong data.  But I think that READ UNRECOVERABLE should also count.
> After all, that means that the disk's media returned wrong data which
> was detected by the disk's own EDC/ECC.  I've noticed that zfsd seems
> to fault disks too eagerly when their only problem is READ
> UNRECOVERABLE errors.  Mapping it to EINTEGRITY, or even a new error
> code, would let zfsd be tuned better.
>

EINTEGRITY would then mean two different things. UFS returns in when
checksums fail for critical filesystem errors. I'm not saying no, per se,
just that it conflates two different errors.

I think both of these use cases would be better served by CAM's publishing
of the errors to devctl today. Here's some example data from a system I'm
looking at:

system=3DCAM subsystem=3Dperiph type=3Dtimeout device=3Dda36 serial=3D"1234=
5"
cam_status=3D"0x44b" timeout=3D30000 CDB=3D"28 00 4e b7 cb a3 00 04 cc 00 "
 timestamp=3D1634739729.312068
system=3DCAM subsystem=3Dperiph type=3Dtimeout device=3Dda36 serial=3D"1234=
5"
cam_status=3D"0x44b" timeout=3D30000 CDB=3D"28 00 20 6b d5 56 00 00 c0 00 "
 timestamp=3D1634739729.585541
system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D"12345"
cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 03 11 00" CDB=3D"28 0=
0 ad 1a
35 96 00 00 56 00 " timestamp=3D1641979267.469064
system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D"12345"
cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 03 11 00" CDB=3D"28 0=
0 ad 1a
35 96 00 01 5e 00 "  timestamp=3D1642252539.693699
system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda39 serial=3D"12346"
cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 04 02 00" CDB=3D"2a 0=
0 01 2b
c8 f6 00 07 81 00 "  timestamp=3D1669603144.090835

Here we get the sense key, the asc and the ascq in the scsi_sense data (I'm
currently looking at expanding this to the entire sense buffer, since it
includes how hard the drive tried to read the data on media and hardware
errors).  It doesn't include nvme data, but does include ata data (I'll
have to add that data, now that I've noticed it is missing).  With the
sense data and the CDB you know what kind of error you got, plus what block
didn't read/write correctly. With the extended sense data, you can find out
even more details that are sense-key dependent...

So I'm unsure that trying to shoehorn our imperfect knowledge of what's
retriable, fixable, should be written with zeros into the kernel and
converting that to a separate errno would give good results, and tapping
into this stream daemons that want to make more nuanced calls about disks
might be the better way to go. One of the things I'm planning for $WORK is
to enable the retry time limit of one of the mode pages so that we fail
faster and can just delete the file with the 'bad' block that we'd get
eventually if we allowed the full, default error processing to run, but
that 'slow path' processing kills performance for all other users of the
drive...  I'm unsure how well that will work out (and I know I'm lucky that
I can always recover any data for my application since it's just a cache).

I'd be interested to hear what others have to say here thought, since my
focus on this data is through the lense of my rather specialized
application...

Warner

P.S. That was generated with this rule if you wanted to play with it...
You'd have to translate absolute disk blocks to a partition and an offset
into the filesystem, then give the filesystem a chance to tell you what of
its data/metadata that block is used for...

# Disk errors
notify 10 {
        match "system"          "CAM";
        match "subsystem"       "periph";
        match "device"          "[an]?da[0-9]+";
        action "logger -t diskerr -p daemon.info $_ timestamp=3D$timestamp"=
;
};

--000000000000f8e4d106007ae405
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Fri, Jul 14, 2023 at 12:31=E2=80=
=AFPM Alan Somers &lt;<a href=3D"mailto:asomers@freebsd.org">asomers@freebs=
d.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex">On Fri, Jul 14, 2023 at 11:05=E2=80=AFAM Warner Losh &lt;<a href=3D"ma=
ilto:imp@bsdimp.com" target=3D"_blank">imp@bsdimp.com</a>&gt; wrote:<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On Fri, Jul 14, 2023, 11:12 AM Alan Somers &lt;<a href=3D"mailto:asome=
rs@freebsd.org" target=3D"_blank">asomers@freebsd.org</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh &lt;<a href=
=3D"mailto:imp@bsdimp.com" target=3D"_blank">imp@bsdimp.com</a>&gt; wrote:<=
br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Greetings,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; i&#39;ve been looking closely at failed drives for $WORK late=
ly. I&#39;ve noticed that a lot of errors that kinda sound like fatal error=
s have SS_RDEF set on them.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; What&#39;s the process for evaluating whether those error cod=
es are worth retrying. There are several errors that we seem to be seeing (=
preliminary read of the data) before the drive gives up the ghost altogethe=
r. For those cases, I&#39;d like to post more specific lists. Should I do t=
hat here?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Independent of that, I may want to have a more aggressive &#3=
9;fail fast&#39; policy than is appropriate for my work load (we have a lot=
 of data that&#39;s a copy of a copy of a copy, so if we lose it, we don&#3=
9;t care: we&#39;ll just delete any files we can&#39;t read and get on with=
 life, though I know others will have a more conservative attitude towards =
data that might be precious and unique). I can set the number of retries lo=
wer, I can do some other hacks for disks that tell the disk to fail faster,=
 but I think part of the solution is going to have to be failing for some s=
ense-code/ASC/ASCQ tuples that we don&#39;t want to fail in upstream or the=
 general case. I was thinking of identifying those and creating a &#39;glob=
al quirk table&#39; that gets applied after the drive-specific quirk table =
that would let $WORK override the defaults, while letting others keep the c=
urrent behavior. IMHO, it would be better to have these separate rather tha=
n in the global data for tracking upstream...<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Is that clear, or should I give concrete examples?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Comments?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Warner<br>
&gt;&gt;<br>
&gt;&gt; Basically, you want to change the retry counts for certain ASC/ASC=
Q<br>
&gt;&gt; codes only, on a site-by-site basis?=C2=A0 That sounds reasonable.=
=C2=A0 Would<br>
&gt;&gt; it be configurable at runtime or only at build time?<br>
&gt;<br>
&gt;<br>
&gt; I&#39;d like to change the default actions. But maybe we just do that =
for everyone and assume modern drives...<br>
&gt;<br>
&gt;&gt; Also, I&#39;ve been thinking lately that it would be real nice if =
READ<br>
&gt;&gt; UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.=C2=
=A0 That<br>
&gt;&gt; would let consumers know that retries are pointless, but that the =
data<br>
&gt;&gt; is probably healable.<br>
&gt;<br>
&gt;<br>
&gt; Unlikely, unless you&#39;ve tuned things to not try for long at recove=
ry...<br>
&gt;<br>
&gt; But regardless... do you have a concrete example of a use case? There&=
#39;s a number of places that map any error to EIO. And I&#39;d like a use =
case before we expand the errors the lower layers return...<br>
&gt;<br>
&gt; Warner<br>
<br>
My first use-case is a user-space FUSE file system.=C2=A0 It only has<br>
access to errnos, not ASC/ASCQ codes.=C2=A0 If we do as I suggest, then it<=
br>
could heal a READ UNRECOVERABLE by rewriting the sector, whereas other<br>
EIO errors aren&#39;t likely to be healed that way.<br></blockquote><div><b=
r></div><div>Yea... but READ UNRECOVERABLE is kinda hit or miss...</div><di=
v>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
My second use-case is ZFS.=C2=A0 zfsd treats checksum errors differently<br=
>
from I/O errors.=C2=A0 A checksum error normally means that a read returned=
<br>
wrong data.=C2=A0 But I think that READ UNRECOVERABLE should also count.<br=
>
After all, that means that the disk&#39;s media returned wrong data which<b=
r>
was detected by the disk&#39;s own EDC/ECC.=C2=A0 I&#39;ve noticed that zfs=
d seems<br>
to fault disks too eagerly when their only problem is READ<br>
UNRECOVERABLE errors.=C2=A0 Mapping it to EINTEGRITY, or even a new error<b=
r>
code, would let zfsd be tuned better.<br></blockquote><div><br></div><div>E=
INTEGRITY would then mean two different things. UFS returns in when checksu=
ms fail for critical=C2=A0filesystem errors. I&#39;m not saying no, per se,=
 just that it conflates two different errors.</div><div><br></div><div>I th=
ink both of these use cases would be better served by CAM&#39;s publishing =
of the errors to devctl today. Here&#39;s some example data from a system I=
&#39;m looking at:</div><div><br></div><div>system=3DCAM subsystem=3Dperiph=
 type=3Dtimeout device=3Dda36 serial=3D&quot;12345&quot; cam_status=3D&quot=
;0x44b&quot; timeout=3D30000 CDB=3D&quot;28 00 4e b7 cb a3 00 04 cc 00 &quo=
t; =C2=A0timestamp=3D1634739729.312068<br>system=3DCAM subsystem=3Dperiph t=
ype=3Dtimeout device=3Dda36 serial=3D&quot;12345&quot; cam_status=3D&quot;0=
x44b&quot; timeout=3D30000 CDB=3D&quot;28 00 20 6b d5 56 00 00 c0 00 &quot;=
 =C2=A0timestamp=3D1634739729.585541<br>system=3DCAM subsystem=3Dperiph typ=
e=3Derror device=3Dda36 serial=3D&quot;12345&quot; cam_status=3D&quot;0x4cc=
&quot; scsi_status=3D2 scsi_sense=3D&quot;72 03 11 00&quot; CDB=3D&quot;28 =
00 ad 1a 35 96 00 00 56 00 &quot; timestamp=3D1641979267.469064<br>system=
=3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D&quot;12345&q=
uot; cam_status=3D&quot;0x4cc&quot; scsi_status=3D2 scsi_sense=3D&quot;72 0=
3 11 00&quot; CDB=3D&quot;28 00 ad 1a 35 96 00 01 5e 00 &quot; =C2=A0timest=
amp=3D1642252539.693699<br></div><div>system=3DCAM subsystem=3Dperiph type=
=3Derror device=3Dda39 serial=3D&quot;12346&quot; cam_status=3D&quot;0x4cc&=
quot; scsi_status=3D2 scsi_sense=3D&quot;72 04 02 00&quot; CDB=3D&quot;2a 0=
0 01 2b c8 f6 00 07 81 00 &quot; =C2=A0timestamp=3D1669603144.090835<br></d=
iv><div><br></div><div>Here we get the sense key, the asc and the ascq in t=
he scsi_sense data (I&#39;m currently looking at expanding this to the enti=
re sense buffer, since it includes how hard the drive tried to read the dat=
a on media and hardware errors).=C2=A0 It doesn&#39;t include nvme data, bu=
t does include ata data (I&#39;ll have to add that data, now that I&#39;ve =
noticed it is missing).=C2=A0 With the sense data and the CDB you know what=
 kind of error you got, plus what block didn&#39;t read/write correctly. Wi=
th the extended sense data, you can find out even more details that are sen=
se-key dependent...</div><div><br></div><div>So I&#39;m unsure that trying =
to shoehorn our imperfect knowledge of what&#39;s retriable, fixable, shoul=
d be written with zeros into the kernel and converting that to a separate e=
rrno would give good results, and tapping into this stream daemons that wan=
t to make more nuanced calls about disks might be the better way to go. One=
 of the things I&#39;m planning for $WORK is to enable the retry time limit=
 of one of the mode pages so that we fail faster and can just delete the fi=
le with the &#39;bad&#39; block that we&#39;d get eventually if we allowed =
the full, default error processing to run, but that &#39;slow path&#39; pro=
cessing kills performance for all other users of the drive...=C2=A0 I&#39;m=
 unsure how well that will work out (and I know I&#39;m lucky that I can al=
ways recover any data for my application since it&#39;s just a cache).</div=
><div><br></div><div>I&#39;d be interested to hear what others have to say =
here thought, since my focus on this data is through the lense of my rather=
 specialized application...</div><div><br></div><div>Warner</div><div><br><=
/div><div>P.S. That was generated with this rule if you wanted to play with=
 it... You&#39;d have to translate absolute disk blocks to a partition and =
an offset into the filesystem, then give the filesystem a chance to tell yo=
u what of its data/metadata that block is used for...</div><div><br></div><=
div># Disk errors<br>notify 10 {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 match &quot=
;system&quot; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&quot;CAM&quot;;<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 match &quot;subsystem&quot; =C2=A0 =C2=A0 =C2=A0 &quot=
;periph&quot;;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 match &quot;device&quot; =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&quot;[an]?da[0-9]+&quot;;<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 action &quot;logger -t diskerr -p <a href=3D"http://daemon.in=
fo">daemon.info</a> $_ timestamp=3D$timestamp&quot;;<br>};<br></div><div><b=
r></div></div></div>

--000000000000f8e4d106007ae405--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfr-y8HYBb6GCFqZ7LAarxUAGb36Y6j%2Bbo%2BWiDwUT5uR7A>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation