Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 19 Jul 2023 09:41:37 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Alan Somers <asomers@freebsd.org>
Cc:        scsi@freebsd.org
Subject:   Re: ASC/ASCQ Review
Message-ID:  <CANCZdfptEG=%2Bxa3m31Ngre26ZQxZ_Fqsfjmh%2BtVHgP2XpqhZ7g@mail.gmail.com>
In-Reply-To: <CANCZdfr-y8HYBb6GCFqZ7LAarxUAGb36Y6j%2Bbo%2BWiDwUT5uR7A@mail.gmail.com>
References:  <CANCZdfokEoRtNp0en=9pjLQSQ%2BjtmfwH3OOwz1z09VcwWpE%2Bxg@mail.gmail.com> <CAOtMX2g4%2BSDWg9WKbwZcqh4GpRan593O6qtNf7feoVejVK0YyQ@mail.gmail.com> <CANCZdfq5qti5uzWLkZaQEpyd5Q255sQeaR_kC_OQinmE9Qcqaw@mail.gmail.com> <CAOtMX2iwnpHL6b2-1D4N4Bi4eKoLnGK4=%2BgUowXGS_gtyDOkig@mail.gmail.com> <CANCZdfr-y8HYBb6GCFqZ7LAarxUAGb36Y6j%2Bbo%2BWiDwUT5uR7A@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000084edec0600d8debe
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

btw, it also occurs to me that if I do add a 'secondary' table, then you
could use it to generate a unique errno and experiment
with that w/o affecting the main code until that stuff was mature.

I'm not sure I'll do that now, since I've found maybe 10 asc/ascq pairs
that I'd like to tag as 'if trying harder, retry, otherwise fail' since
re-retry needs have changed a lot since cam was written in the late 90s and
at least some of the asc/ascq pairs I'm looking at haven't changed since
the initial import, but that's based on a tiny sampling of the data I have
and is preliminary at best. I may just change it to reflect modern usage.

Warner

On Fri, Jul 14, 2023 at 5:34=E2=80=AFPM Warner Losh <imp@bsdimp.com> wrote:

>
>
> On Fri, Jul 14, 2023 at 12:31=E2=80=AFPM Alan Somers <asomers@freebsd.org=
> wrote:
>
>> On Fri, Jul 14, 2023 at 11:05=E2=80=AFAM Warner Losh <imp@bsdimp.com> wr=
ote:
>> >
>> >
>> >
>> > On Fri, Jul 14, 2023, 11:12 AM Alan Somers <asomers@freebsd.org> wrote=
:
>> >>
>> >> On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh <imp@bsdimp.com>=
 wrote:
>> >> >
>> >> > Greetings,
>> >> >
>> >> > i've been looking closely at failed drives for $WORK lately. I've
>> noticed that a lot of errors that kinda sound like fatal errors have
>> SS_RDEF set on them.
>> >> >
>> >> > What's the process for evaluating whether those error codes are
>> worth retrying. There are several errors that we seem to be seeing
>> (preliminary read of the data) before the drive gives up the ghost
>> altogether. For those cases, I'd like to post more specific lists. Shoul=
d I
>> do that here?
>> >> >
>> >> > Independent of that, I may want to have a more aggressive 'fail
>> fast' policy than is appropriate for my work load (we have a lot of data
>> that's a copy of a copy of a copy, so if we lose it, we don't care: we'l=
l
>> just delete any files we can't read and get on with life, though I know
>> others will have a more conservative attitude towards data that might be
>> precious and unique). I can set the number of retries lower, I can do so=
me
>> other hacks for disks that tell the disk to fail faster, but I think par=
t
>> of the solution is going to have to be failing for some sense-code/ASC/A=
SCQ
>> tuples that we don't want to fail in upstream or the general case. I was
>> thinking of identifying those and creating a 'global quirk table' that g=
ets
>> applied after the drive-specific quirk table that would let $WORK overri=
de
>> the defaults, while letting others keep the current behavior. IMHO, it
>> would be better to have these separate rather than in the global data fo=
r
>> tracking upstream...
>> >> >
>> >> > Is that clear, or should I give concrete examples?
>> >> >
>> >> > Comments?
>> >> >
>> >> > Warner
>> >>
>> >> Basically, you want to change the retry counts for certain ASC/ASCQ
>> >> codes only, on a site-by-site basis?  That sounds reasonable.  Would
>> >> it be configurable at runtime or only at build time?
>> >
>> >
>> > I'd like to change the default actions. But maybe we just do that for
>> everyone and assume modern drives...
>> >
>> >> Also, I've been thinking lately that it would be real nice if READ
>> >> UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.  That
>> >> would let consumers know that retries are pointless, but that the dat=
a
>> >> is probably healable.
>> >
>> >
>> > Unlikely, unless you've tuned things to not try for long at recovery..=
.
>> >
>> > But regardless... do you have a concrete example of a use case? There'=
s
>> a number of places that map any error to EIO. And I'd like a use case
>> before we expand the errors the lower layers return...
>> >
>> > Warner
>>
>> My first use-case is a user-space FUSE file system.  It only has
>> access to errnos, not ASC/ASCQ codes.  If we do as I suggest, then it
>> could heal a READ UNRECOVERABLE by rewriting the sector, whereas other
>> EIO errors aren't likely to be healed that way.
>>
>
> Yea... but READ UNRECOVERABLE is kinda hit or miss...
>
>
>> My second use-case is ZFS.  zfsd treats checksum errors differently
>> from I/O errors.  A checksum error normally means that a read returned
>> wrong data.  But I think that READ UNRECOVERABLE should also count.
>> After all, that means that the disk's media returned wrong data which
>> was detected by the disk's own EDC/ECC.  I've noticed that zfsd seems
>> to fault disks too eagerly when their only problem is READ
>> UNRECOVERABLE errors.  Mapping it to EINTEGRITY, or even a new error
>> code, would let zfsd be tuned better.
>>
>
> EINTEGRITY would then mean two different things. UFS returns in when
> checksums fail for critical filesystem errors. I'm not saying no, per se,
> just that it conflates two different errors.
>
> I think both of these use cases would be better served by CAM's publishin=
g
> of the errors to devctl today. Here's some example data from a system I'm
> looking at:
>
> system=3DCAM subsystem=3Dperiph type=3Dtimeout device=3Dda36 serial=3D"12=
345"
> cam_status=3D"0x44b" timeout=3D30000 CDB=3D"28 00 4e b7 cb a3 00 04 cc 00=
 "
>  timestamp=3D1634739729.312068
> system=3DCAM subsystem=3Dperiph type=3Dtimeout device=3Dda36 serial=3D"12=
345"
> cam_status=3D"0x44b" timeout=3D30000 CDB=3D"28 00 20 6b d5 56 00 00 c0 00=
 "
>  timestamp=3D1634739729.585541
> system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D"1234=
5"
> cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 03 11 00" CDB=3D"28=
 00 ad 1a
> 35 96 00 00 56 00 " timestamp=3D1641979267.469064
> system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D"1234=
5"
> cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 03 11 00" CDB=3D"28=
 00 ad 1a
> 35 96 00 01 5e 00 "  timestamp=3D1642252539.693699
> system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda39 serial=3D"1234=
6"
> cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 04 02 00" CDB=3D"2a=
 00 01 2b
> c8 f6 00 07 81 00 "  timestamp=3D1669603144.090835
>
> Here we get the sense key, the asc and the ascq in the scsi_sense data
> (I'm currently looking at expanding this to the entire sense buffer, sinc=
e
> it includes how hard the drive tried to read the data on media and hardwa=
re
> errors).  It doesn't include nvme data, but does include ata data (I'll
> have to add that data, now that I've noticed it is missing).  With the
> sense data and the CDB you know what kind of error you got, plus what blo=
ck
> didn't read/write correctly. With the extended sense data, you can find o=
ut
> even more details that are sense-key dependent...
>
> So I'm unsure that trying to shoehorn our imperfect knowledge of what's
> retriable, fixable, should be written with zeros into the kernel and
> converting that to a separate errno would give good results, and tapping
> into this stream daemons that want to make more nuanced calls about disks
> might be the better way to go. One of the things I'm planning for $WORK i=
s
> to enable the retry time limit of one of the mode pages so that we fail
> faster and can just delete the file with the 'bad' block that we'd get
> eventually if we allowed the full, default error processing to run, but
> that 'slow path' processing kills performance for all other users of the
> drive...  I'm unsure how well that will work out (and I know I'm lucky th=
at
> I can always recover any data for my application since it's just a cache)=
.
>
> I'd be interested to hear what others have to say here thought, since my
> focus on this data is through the lense of my rather specialized
> application...
>
> Warner
>
> P.S. That was generated with this rule if you wanted to play with it...
> You'd have to translate absolute disk blocks to a partition and an offset
> into the filesystem, then give the filesystem a chance to tell you what o=
f
> its data/metadata that block is used for...
>
> # Disk errors
> notify 10 {
>         match "system"          "CAM";
>         match "subsystem"       "periph";
>         match "device"          "[an]?da[0-9]+";
>         action "logger -t diskerr -p daemon.info $_ timestamp=3D$timestam=
p";
> };
>
>

--00000000000084edec0600d8debe
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>btw, it also occurs to me that if I do add a &#39;sec=
ondary&#39; table, then you could use it to generate a unique errno and exp=
eriment</div><div>with that w/o affecting the main code until that stuff wa=
s mature.</div><div><br></div><div>I&#39;m not sure I&#39;ll do that now, s=
ince I&#39;ve found maybe 10 asc/ascq pairs that I&#39;d like to tag as &#3=
9;if trying harder, retry, otherwise fail&#39; since re-retry needs have ch=
anged a lot since cam was written in the late 90s and at least some of the =
asc/ascq pairs I&#39;m looking at haven&#39;t changed since the initial imp=
ort, but that&#39;s based on a tiny sampling of the data I have and is prel=
iminary at best. I may just change it to reflect modern usage.<br></div><di=
v><br></div><div>Warner<br></div></div><br><div class=3D"gmail_quote"><div =
dir=3D"ltr" class=3D"gmail_attr">On Fri, Jul 14, 2023 at 5:34=E2=80=AFPM Wa=
rner Losh &lt;<a href=3D"mailto:imp@bsdimp.com">imp@bsdimp.com</a>&gt; wrot=
e:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"l=
tr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote"><div dir=3D"l=
tr" class=3D"gmail_attr">On Fri, Jul 14, 2023 at 12:31=E2=80=AFPM Alan Some=
rs &lt;<a href=3D"mailto:asomers@freebsd.org" target=3D"_blank">asomers@fre=
ebsd.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le=
ft:1ex">On Fri, Jul 14, 2023 at 11:05=E2=80=AFAM Warner Losh &lt;<a href=3D=
"mailto:imp@bsdimp.com" target=3D"_blank">imp@bsdimp.com</a>&gt; wrote:<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On Fri, Jul 14, 2023, 11:12 AM Alan Somers &lt;<a href=3D"mailto:asome=
rs@freebsd.org" target=3D"_blank">asomers@freebsd.org</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh &lt;<a href=
=3D"mailto:imp@bsdimp.com" target=3D"_blank">imp@bsdimp.com</a>&gt; wrote:<=
br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Greetings,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; i&#39;ve been looking closely at failed drives for $WORK late=
ly. I&#39;ve noticed that a lot of errors that kinda sound like fatal error=
s have SS_RDEF set on them.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; What&#39;s the process for evaluating whether those error cod=
es are worth retrying. There are several errors that we seem to be seeing (=
preliminary read of the data) before the drive gives up the ghost altogethe=
r. For those cases, I&#39;d like to post more specific lists. Should I do t=
hat here?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Independent of that, I may want to have a more aggressive &#3=
9;fail fast&#39; policy than is appropriate for my work load (we have a lot=
 of data that&#39;s a copy of a copy of a copy, so if we lose it, we don&#3=
9;t care: we&#39;ll just delete any files we can&#39;t read and get on with=
 life, though I know others will have a more conservative attitude towards =
data that might be precious and unique). I can set the number of retries lo=
wer, I can do some other hacks for disks that tell the disk to fail faster,=
 but I think part of the solution is going to have to be failing for some s=
ense-code/ASC/ASCQ tuples that we don&#39;t want to fail in upstream or the=
 general case. I was thinking of identifying those and creating a &#39;glob=
al quirk table&#39; that gets applied after the drive-specific quirk table =
that would let $WORK override the defaults, while letting others keep the c=
urrent behavior. IMHO, it would be better to have these separate rather tha=
n in the global data for tracking upstream...<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Is that clear, or should I give concrete examples?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Comments?<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Warner<br>
&gt;&gt;<br>
&gt;&gt; Basically, you want to change the retry counts for certain ASC/ASC=
Q<br>
&gt;&gt; codes only, on a site-by-site basis?=C2=A0 That sounds reasonable.=
=C2=A0 Would<br>
&gt;&gt; it be configurable at runtime or only at build time?<br>
&gt;<br>
&gt;<br>
&gt; I&#39;d like to change the default actions. But maybe we just do that =
for everyone and assume modern drives...<br>
&gt;<br>
&gt;&gt; Also, I&#39;ve been thinking lately that it would be real nice if =
READ<br>
&gt;&gt; UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.=C2=
=A0 That<br>
&gt;&gt; would let consumers know that retries are pointless, but that the =
data<br>
&gt;&gt; is probably healable.<br>
&gt;<br>
&gt;<br>
&gt; Unlikely, unless you&#39;ve tuned things to not try for long at recove=
ry...<br>
&gt;<br>
&gt; But regardless... do you have a concrete example of a use case? There&=
#39;s a number of places that map any error to EIO. And I&#39;d like a use =
case before we expand the errors the lower layers return...<br>
&gt;<br>
&gt; Warner<br>
<br>
My first use-case is a user-space FUSE file system.=C2=A0 It only has<br>
access to errnos, not ASC/ASCQ codes.=C2=A0 If we do as I suggest, then it<=
br>
could heal a READ UNRECOVERABLE by rewriting the sector, whereas other<br>
EIO errors aren&#39;t likely to be healed that way.<br></blockquote><div><b=
r></div><div>Yea... but READ UNRECOVERABLE is kinda hit or miss...</div><di=
v>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
My second use-case is ZFS.=C2=A0 zfsd treats checksum errors differently<br=
>
from I/O errors.=C2=A0 A checksum error normally means that a read returned=
<br>
wrong data.=C2=A0 But I think that READ UNRECOVERABLE should also count.<br=
>
After all, that means that the disk&#39;s media returned wrong data which<b=
r>
was detected by the disk&#39;s own EDC/ECC.=C2=A0 I&#39;ve noticed that zfs=
d seems<br>
to fault disks too eagerly when their only problem is READ<br>
UNRECOVERABLE errors.=C2=A0 Mapping it to EINTEGRITY, or even a new error<b=
r>
code, would let zfsd be tuned better.<br></blockquote><div><br></div><div>E=
INTEGRITY would then mean two different things. UFS returns in when checksu=
ms fail for critical=C2=A0filesystem errors. I&#39;m not saying no, per se,=
 just that it conflates two different errors.</div><div><br></div><div>I th=
ink both of these use cases would be better served by CAM&#39;s publishing =
of the errors to devctl today. Here&#39;s some example data from a system I=
&#39;m looking at:</div><div><br></div><div>system=3DCAM subsystem=3Dperiph=
 type=3Dtimeout device=3Dda36 serial=3D&quot;12345&quot; cam_status=3D&quot=
;0x44b&quot; timeout=3D30000 CDB=3D&quot;28 00 4e b7 cb a3 00 04 cc 00 &quo=
t; =C2=A0timestamp=3D1634739729.312068<br>system=3DCAM subsystem=3Dperiph t=
ype=3Dtimeout device=3Dda36 serial=3D&quot;12345&quot; cam_status=3D&quot;0=
x44b&quot; timeout=3D30000 CDB=3D&quot;28 00 20 6b d5 56 00 00 c0 00 &quot;=
 =C2=A0timestamp=3D1634739729.585541<br>system=3DCAM subsystem=3Dperiph typ=
e=3Derror device=3Dda36 serial=3D&quot;12345&quot; cam_status=3D&quot;0x4cc=
&quot; scsi_status=3D2 scsi_sense=3D&quot;72 03 11 00&quot; CDB=3D&quot;28 =
00 ad 1a 35 96 00 00 56 00 &quot; timestamp=3D1641979267.469064<br>system=
=3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D&quot;12345&q=
uot; cam_status=3D&quot;0x4cc&quot; scsi_status=3D2 scsi_sense=3D&quot;72 0=
3 11 00&quot; CDB=3D&quot;28 00 ad 1a 35 96 00 01 5e 00 &quot; =C2=A0timest=
amp=3D1642252539.693699<br></div><div>system=3DCAM subsystem=3Dperiph type=
=3Derror device=3Dda39 serial=3D&quot;12346&quot; cam_status=3D&quot;0x4cc&=
quot; scsi_status=3D2 scsi_sense=3D&quot;72 04 02 00&quot; CDB=3D&quot;2a 0=
0 01 2b c8 f6 00 07 81 00 &quot; =C2=A0timestamp=3D1669603144.090835<br></d=
iv><div><br></div><div>Here we get the sense key, the asc and the ascq in t=
he scsi_sense data (I&#39;m currently looking at expanding this to the enti=
re sense buffer, since it includes how hard the drive tried to read the dat=
a on media and hardware errors).=C2=A0 It doesn&#39;t include nvme data, bu=
t does include ata data (I&#39;ll have to add that data, now that I&#39;ve =
noticed it is missing).=C2=A0 With the sense data and the CDB you know what=
 kind of error you got, plus what block didn&#39;t read/write correctly. Wi=
th the extended sense data, you can find out even more details that are sen=
se-key dependent...</div><div><br></div><div>So I&#39;m unsure that trying =
to shoehorn our imperfect knowledge of what&#39;s retriable, fixable, shoul=
d be written with zeros into the kernel and converting that to a separate e=
rrno would give good results, and tapping into this stream daemons that wan=
t to make more nuanced calls about disks might be the better way to go. One=
 of the things I&#39;m planning for $WORK is to enable the retry time limit=
 of one of the mode pages so that we fail faster and can just delete the fi=
le with the &#39;bad&#39; block that we&#39;d get eventually if we allowed =
the full, default error processing to run, but that &#39;slow path&#39; pro=
cessing kills performance for all other users of the drive...=C2=A0 I&#39;m=
 unsure how well that will work out (and I know I&#39;m lucky that I can al=
ways recover any data for my application since it&#39;s just a cache).</div=
><div><br></div><div>I&#39;d be interested to hear what others have to say =
here thought, since my focus on this data is through the lense of my rather=
 specialized application...</div><div><br></div><div>Warner</div><div><br><=
/div><div>P.S. That was generated with this rule if you wanted to play with=
 it... You&#39;d have to translate absolute disk blocks to a partition and =
an offset into the filesystem, then give the filesystem a chance to tell yo=
u what of its data/metadata that block is used for...</div><div><br></div><=
div># Disk errors<br>notify 10 {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 match &quot=
;system&quot; =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&quot;CAM&quot;;<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 match &quot;subsystem&quot; =C2=A0 =C2=A0 =C2=A0 &quot=
;periph&quot;;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 match &quot;device&quot; =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&quot;[an]?da[0-9]+&quot;;<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 action &quot;logger -t diskerr -p <a href=3D"http://daemon.in=
fo" target=3D"_blank">daemon.info</a> $_ timestamp=3D$timestamp&quot;;<br>}=
;<br></div><div><br></div></div></div>
</blockquote></div>

--00000000000084edec0600d8debe--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfptEG=%2Bxa3m31Ngre26ZQxZ_Fqsfjmh%2BtVHgP2XpqhZ7g>