Date: Fri, 14 Jul 2023 12:05:38 -0600 From: Warner Losh <imp@bsdimp.com> To: Alan Somers <asomers@freebsd.org> Cc: scsi@freebsd.org Subject: Re: ASC/ASCQ Review Message-ID: <CANCZdfq5qti5uzWLkZaQEpyd5Q255sQeaR_kC_OQinmE9Qcqaw@mail.gmail.com> In-Reply-To: <CAOtMX2g4%2BSDWg9WKbwZcqh4GpRan593O6qtNf7feoVejVK0YyQ@mail.gmail.com> References: <CANCZdfokEoRtNp0en=9pjLQSQ%2BjtmfwH3OOwz1z09VcwWpE%2Bxg@mail.gmail.com> <CAOtMX2g4%2BSDWg9WKbwZcqh4GpRan593O6qtNf7feoVejVK0YyQ@mail.gmail.com>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --] On Fri, Jul 14, 2023, 11:12 AM Alan Somers <asomers@freebsd.org> wrote: > On Thu, Jul 13, 2023 at 12:14 PM Warner Losh <imp@bsdimp.com> wrote: > > > > Greetings, > > > > i've been looking closely at failed drives for $WORK lately. I've > noticed that a lot of errors that kinda sound like fatal errors have > SS_RDEF set on them. > > > > What's the process for evaluating whether those error codes are worth > retrying. There are several errors that we seem to be seeing (preliminary > read of the data) before the drive gives up the ghost altogether. For those > cases, I'd like to post more specific lists. Should I do that here? > > > > Independent of that, I may want to have a more aggressive 'fail fast' > policy than is appropriate for my work load (we have a lot of data that's a > copy of a copy of a copy, so if we lose it, we don't care: we'll just > delete any files we can't read and get on with life, though I know others > will have a more conservative attitude towards data that might be precious > and unique). I can set the number of retries lower, I can do some other > hacks for disks that tell the disk to fail faster, but I think part of the > solution is going to have to be failing for some sense-code/ASC/ASCQ tuples > that we don't want to fail in upstream or the general case. I was thinking > of identifying those and creating a 'global quirk table' that gets applied > after the drive-specific quirk table that would let $WORK override the > defaults, while letting others keep the current behavior. IMHO, it would be > better to have these separate rather than in the global data for tracking > upstream... > > > > Is that clear, or should I give concrete examples? > > > > Comments? > > > > Warner > > Basically, you want to change the retry counts for certain ASC/ASCQ > codes only, on a site-by-site basis? That sounds reasonable. Would > it be configurable at runtime or only at build time? > I'd like to change the default actions. But maybe we just do that for everyone and assume modern drives... Also, I've been thinking lately that it would be real nice if READ > UNRECOVERABLE could be translated to EINTEGRITY instead of EIO. That > would let consumers know that retries are pointless, but that the data > is probably healable. > Unlikely, unless you've tuned things to not try for long at recovery... But regardless... do you have a concrete example of a use case? There's a number of places that map any error to EIO. And I'd like a use case before we expand the errors the lower layers return... Warner > [-- Attachment #2 --] <div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 14, 2023, 11:12 AM Alan Somers <<a href="mailto:asomers@freebsd.org">asomers@freebsd.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, Jul 13, 2023 at 12:14 PM Warner Losh <<a href="mailto:imp@bsdimp.com" target="_blank" rel="noreferrer">imp@bsdimp.com</a>> wrote:<br> ><br> > Greetings,<br> ><br> > i've been looking closely at failed drives for $WORK lately. I've noticed that a lot of errors that kinda sound like fatal errors have SS_RDEF set on them.<br> ><br> > What's the process for evaluating whether those error codes are worth retrying. There are several errors that we seem to be seeing (preliminary read of the data) before the drive gives up the ghost altogether. For those cases, I'd like to post more specific lists. Should I do that here?<br> ><br> > Independent of that, I may want to have a more aggressive 'fail fast' policy than is appropriate for my work load (we have a lot of data that's a copy of a copy of a copy, so if we lose it, we don't care: we'll just delete any files we can't read and get on with life, though I know others will have a more conservative attitude towards data that might be precious and unique). I can set the number of retries lower, I can do some other hacks for disks that tell the disk to fail faster, but I think part of the solution is going to have to be failing for some sense-code/ASC/ASCQ tuples that we don't want to fail in upstream or the general case. I was thinking of identifying those and creating a 'global quirk table' that gets applied after the drive-specific quirk table that would let $WORK override the defaults, while letting others keep the current behavior. IMHO, it would be better to have these separate rather than in the global data for tracking upstream...<br> ><br> > Is that clear, or should I give concrete examples?<br> ><br> > Comments?<br> ><br> > Warner<br> <br> Basically, you want to change the retry counts for certain ASC/ASCQ<br> codes only, on a site-by-site basis? That sounds reasonable. Would<br> it be configurable at runtime or only at build time?<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">I'd like to change the default actions. But maybe we just do that for everyone and assume modern drives...</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Also, I've been thinking lately that it would be real nice if READ<br> UNRECOVERABLE could be translated to EINTEGRITY instead of EIO. That<br> would let consumers know that retries are pointless, but that the data<br> is probably healable.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Unlikely, unless you've tuned things to not try for long at recovery... </div><div dir="auto"><br></div><div dir="auto">But regardless... do you have a concrete example of a use case? There's a number of places that map any error to EIO. And I'd like a use case before we expand the errors the lower layers return...</div><div dir="auto"><br></div><div dir="auto">Warner</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> </blockquote></div></div></div>home | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfq5qti5uzWLkZaQEpyd5Q255sQeaR_kC_OQinmE9Qcqaw>
