Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Jul 2023 11:30:51 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Warner Losh <imp@bsdimp.com>
Cc:        scsi@freebsd.org
Subject:   Re: ASC/ASCQ Review
Message-ID:  <CAOtMX2iwnpHL6b2-1D4N4Bi4eKoLnGK4=%2BgUowXGS_gtyDOkig@mail.gmail.com>
In-Reply-To: <CANCZdfq5qti5uzWLkZaQEpyd5Q255sQeaR_kC_OQinmE9Qcqaw@mail.gmail.com>
References:  <CANCZdfokEoRtNp0en=9pjLQSQ%2BjtmfwH3OOwz1z09VcwWpE%2Bxg@mail.gmail.com> <CAOtMX2g4%2BSDWg9WKbwZcqh4GpRan593O6qtNf7feoVejVK0YyQ@mail.gmail.com> <CANCZdfq5qti5uzWLkZaQEpyd5Q255sQeaR_kC_OQinmE9Qcqaw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jul 14, 2023 at 11:05=E2=80=AFAM Warner Losh <imp@bsdimp.com> wrote=
:
>
>
>
> On Fri, Jul 14, 2023, 11:12 AM Alan Somers <asomers@freebsd.org> wrote:
>>
>> On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh <imp@bsdimp.com> wr=
ote:
>> >
>> > Greetings,
>> >
>> > i've been looking closely at failed drives for $WORK lately. I've noti=
ced that a lot of errors that kinda sound like fatal errors have SS_RDEF se=
t on them.
>> >
>> > What's the process for evaluating whether those error codes are worth =
retrying. There are several errors that we seem to be seeing (preliminary r=
ead of the data) before the drive gives up the ghost altogether. For those =
cases, I'd like to post more specific lists. Should I do that here?
>> >
>> > Independent of that, I may want to have a more aggressive 'fail fast' =
policy than is appropriate for my work load (we have a lot of data that's a=
 copy of a copy of a copy, so if we lose it, we don't care: we'll just dele=
te any files we can't read and get on with life, though I know others will =
have a more conservative attitude towards data that might be precious and u=
nique). I can set the number of retries lower, I can do some other hacks fo=
r disks that tell the disk to fail faster, but I think part of the solution=
 is going to have to be failing for some sense-code/ASC/ASCQ tuples that we=
 don't want to fail in upstream or the general case. I was thinking of iden=
tifying those and creating a 'global quirk table' that gets applied after t=
he drive-specific quirk table that would let $WORK override the defaults, w=
hile letting others keep the current behavior. IMHO, it would be better to =
have these separate rather than in the global data for tracking upstream...
>> >
>> > Is that clear, or should I give concrete examples?
>> >
>> > Comments?
>> >
>> > Warner
>>
>> Basically, you want to change the retry counts for certain ASC/ASCQ
>> codes only, on a site-by-site basis?  That sounds reasonable.  Would
>> it be configurable at runtime or only at build time?
>
>
> I'd like to change the default actions. But maybe we just do that for eve=
ryone and assume modern drives...
>
>> Also, I've been thinking lately that it would be real nice if READ
>> UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.  That
>> would let consumers know that retries are pointless, but that the data
>> is probably healable.
>
>
> Unlikely, unless you've tuned things to not try for long at recovery...
>
> But regardless... do you have a concrete example of a use case? There's a=
 number of places that map any error to EIO. And I'd like a use case before=
 we expand the errors the lower layers return...
>
> Warner

My first use-case is a user-space FUSE file system.  It only has
access to errnos, not ASC/ASCQ codes.  If we do as I suggest, then it
could heal a READ UNRECOVERABLE by rewriting the sector, whereas other
EIO errors aren't likely to be healed that way.

My second use-case is ZFS.  zfsd treats checksum errors differently
from I/O errors.  A checksum error normally means that a read returned
wrong data.  But I think that READ UNRECOVERABLE should also count.
After all, that means that the disk's media returned wrong data which
was detected by the disk's own EDC/ECC.  I've noticed that zfsd seems
to fault disks too eagerly when their only problem is READ
UNRECOVERABLE errors.  Mapping it to EINTEGRITY, or even a new error
code, would let zfsd be tuned better.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2iwnpHL6b2-1D4N4Bi4eKoLnGK4=%2BgUowXGS_gtyDOkig>