Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Jul 2023 10:12:34 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Warner Losh <imp@bsdimp.com>
Cc:        scsi@freebsd.org
Subject:   Re: ASC/ASCQ Review
Message-ID:  <CAOtMX2g4%2BSDWg9WKbwZcqh4GpRan593O6qtNf7feoVejVK0YyQ@mail.gmail.com>
In-Reply-To: <CANCZdfokEoRtNp0en=9pjLQSQ%2BjtmfwH3OOwz1z09VcwWpE%2Bxg@mail.gmail.com>
References:  <CANCZdfokEoRtNp0en=9pjLQSQ%2BjtmfwH3OOwz1z09VcwWpE%2Bxg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh <imp@bsdimp.com> wrote=
:
>
> Greetings,
>
> i've been looking closely at failed drives for $WORK lately. I've noticed=
 that a lot of errors that kinda sound like fatal errors have SS_RDEF set o=
n them.
>
> What's the process for evaluating whether those error codes are worth ret=
rying. There are several errors that we seem to be seeing (preliminary read=
 of the data) before the drive gives up the ghost altogether. For those cas=
es, I'd like to post more specific lists. Should I do that here?
>
> Independent of that, I may want to have a more aggressive 'fail fast' pol=
icy than is appropriate for my work load (we have a lot of data that's a co=
py of a copy of a copy, so if we lose it, we don't care: we'll just delete =
any files we can't read and get on with life, though I know others will hav=
e a more conservative attitude towards data that might be precious and uniq=
ue). I can set the number of retries lower, I can do some other hacks for d=
isks that tell the disk to fail faster, but I think part of the solution is=
 going to have to be failing for some sense-code/ASC/ASCQ tuples that we do=
n't want to fail in upstream or the general case. I was thinking of identif=
ying those and creating a 'global quirk table' that gets applied after the =
drive-specific quirk table that would let $WORK override the defaults, whil=
e letting others keep the current behavior. IMHO, it would be better to hav=
e these separate rather than in the global data for tracking upstream...
>
> Is that clear, or should I give concrete examples?
>
> Comments?
>
> Warner

Basically, you want to change the retry counts for certain ASC/ASCQ
codes only, on a site-by-site basis?  That sounds reasonable.  Would
it be configurable at runtime or only at build time?

Also, I've been thinking lately that it would be real nice if READ
UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.  That
would let consumers know that retries are pointless, but that the data
is probably healable.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2g4%2BSDWg9WKbwZcqh4GpRan593O6qtNf7feoVejVK0YyQ>