Date: Wed, 19 Jul 2023 09:41:37 -0600 From: Warner Losh <imp@bsdimp.com> To: Alan Somers <asomers@freebsd.org> Cc: scsi@freebsd.org Subject: Re: ASC/ASCQ Review Message-ID: <CANCZdfptEG=%2Bxa3m31Ngre26ZQxZ_Fqsfjmh%2BtVHgP2XpqhZ7g@mail.gmail.com> In-Reply-To: <CANCZdfr-y8HYBb6GCFqZ7LAarxUAGb36Y6j%2Bbo%2BWiDwUT5uR7A@mail.gmail.com> References: <CANCZdfokEoRtNp0en=9pjLQSQ%2BjtmfwH3OOwz1z09VcwWpE%2Bxg@mail.gmail.com> <CAOtMX2g4%2BSDWg9WKbwZcqh4GpRan593O6qtNf7feoVejVK0YyQ@mail.gmail.com> <CANCZdfq5qti5uzWLkZaQEpyd5Q255sQeaR_kC_OQinmE9Qcqaw@mail.gmail.com> <CAOtMX2iwnpHL6b2-1D4N4Bi4eKoLnGK4=%2BgUowXGS_gtyDOkig@mail.gmail.com> <CANCZdfr-y8HYBb6GCFqZ7LAarxUAGb36Y6j%2Bbo%2BWiDwUT5uR7A@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000084edec0600d8debe Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable btw, it also occurs to me that if I do add a 'secondary' table, then you could use it to generate a unique errno and experiment with that w/o affecting the main code until that stuff was mature. I'm not sure I'll do that now, since I've found maybe 10 asc/ascq pairs that I'd like to tag as 'if trying harder, retry, otherwise fail' since re-retry needs have changed a lot since cam was written in the late 90s and at least some of the asc/ascq pairs I'm looking at haven't changed since the initial import, but that's based on a tiny sampling of the data I have and is preliminary at best. I may just change it to reflect modern usage. Warner On Fri, Jul 14, 2023 at 5:34=E2=80=AFPM Warner Losh <imp@bsdimp.com> wrote: > > > On Fri, Jul 14, 2023 at 12:31=E2=80=AFPM Alan Somers <asomers@freebsd.org= > wrote: > >> On Fri, Jul 14, 2023 at 11:05=E2=80=AFAM Warner Losh <imp@bsdimp.com> wr= ote: >> > >> > >> > >> > On Fri, Jul 14, 2023, 11:12 AM Alan Somers <asomers@freebsd.org> wrote= : >> >> >> >> On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh <imp@bsdimp.com>= wrote: >> >> > >> >> > Greetings, >> >> > >> >> > i've been looking closely at failed drives for $WORK lately. I've >> noticed that a lot of errors that kinda sound like fatal errors have >> SS_RDEF set on them. >> >> > >> >> > What's the process for evaluating whether those error codes are >> worth retrying. There are several errors that we seem to be seeing >> (preliminary read of the data) before the drive gives up the ghost >> altogether. For those cases, I'd like to post more specific lists. Shoul= d I >> do that here? >> >> > >> >> > Independent of that, I may want to have a more aggressive 'fail >> fast' policy than is appropriate for my work load (we have a lot of data >> that's a copy of a copy of a copy, so if we lose it, we don't care: we'l= l >> just delete any files we can't read and get on with life, though I know >> others will have a more conservative attitude towards data that might be >> precious and unique). I can set the number of retries lower, I can do so= me >> other hacks for disks that tell the disk to fail faster, but I think par= t >> of the solution is going to have to be failing for some sense-code/ASC/A= SCQ >> tuples that we don't want to fail in upstream or the general case. I was >> thinking of identifying those and creating a 'global quirk table' that g= ets >> applied after the drive-specific quirk table that would let $WORK overri= de >> the defaults, while letting others keep the current behavior. IMHO, it >> would be better to have these separate rather than in the global data fo= r >> tracking upstream... >> >> > >> >> > Is that clear, or should I give concrete examples? >> >> > >> >> > Comments? >> >> > >> >> > Warner >> >> >> >> Basically, you want to change the retry counts for certain ASC/ASCQ >> >> codes only, on a site-by-site basis? That sounds reasonable. Would >> >> it be configurable at runtime or only at build time? >> > >> > >> > I'd like to change the default actions. But maybe we just do that for >> everyone and assume modern drives... >> > >> >> Also, I've been thinking lately that it would be real nice if READ >> >> UNRECOVERABLE could be translated to EINTEGRITY instead of EIO. That >> >> would let consumers know that retries are pointless, but that the dat= a >> >> is probably healable. >> > >> > >> > Unlikely, unless you've tuned things to not try for long at recovery..= . >> > >> > But regardless... do you have a concrete example of a use case? There'= s >> a number of places that map any error to EIO. And I'd like a use case >> before we expand the errors the lower layers return... >> > >> > Warner >> >> My first use-case is a user-space FUSE file system. It only has >> access to errnos, not ASC/ASCQ codes. If we do as I suggest, then it >> could heal a READ UNRECOVERABLE by rewriting the sector, whereas other >> EIO errors aren't likely to be healed that way. >> > > Yea... but READ UNRECOVERABLE is kinda hit or miss... > > >> My second use-case is ZFS. zfsd treats checksum errors differently >> from I/O errors. A checksum error normally means that a read returned >> wrong data. But I think that READ UNRECOVERABLE should also count. >> After all, that means that the disk's media returned wrong data which >> was detected by the disk's own EDC/ECC. I've noticed that zfsd seems >> to fault disks too eagerly when their only problem is READ >> UNRECOVERABLE errors. Mapping it to EINTEGRITY, or even a new error >> code, would let zfsd be tuned better. >> > > EINTEGRITY would then mean two different things. UFS returns in when > checksums fail for critical filesystem errors. I'm not saying no, per se, > just that it conflates two different errors. > > I think both of these use cases would be better served by CAM's publishin= g > of the errors to devctl today. Here's some example data from a system I'm > looking at: > > system=3DCAM subsystem=3Dperiph type=3Dtimeout device=3Dda36 serial=3D"12= 345" > cam_status=3D"0x44b" timeout=3D30000 CDB=3D"28 00 4e b7 cb a3 00 04 cc 00= " > timestamp=3D1634739729.312068 > system=3DCAM subsystem=3Dperiph type=3Dtimeout device=3Dda36 serial=3D"12= 345" > cam_status=3D"0x44b" timeout=3D30000 CDB=3D"28 00 20 6b d5 56 00 00 c0 00= " > timestamp=3D1634739729.585541 > system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D"1234= 5" > cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 03 11 00" CDB=3D"28= 00 ad 1a > 35 96 00 00 56 00 " timestamp=3D1641979267.469064 > system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D"1234= 5" > cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 03 11 00" CDB=3D"28= 00 ad 1a > 35 96 00 01 5e 00 " timestamp=3D1642252539.693699 > system=3DCAM subsystem=3Dperiph type=3Derror device=3Dda39 serial=3D"1234= 6" > cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 04 02 00" CDB=3D"2a= 00 01 2b > c8 f6 00 07 81 00 " timestamp=3D1669603144.090835 > > Here we get the sense key, the asc and the ascq in the scsi_sense data > (I'm currently looking at expanding this to the entire sense buffer, sinc= e > it includes how hard the drive tried to read the data on media and hardwa= re > errors). It doesn't include nvme data, but does include ata data (I'll > have to add that data, now that I've noticed it is missing). With the > sense data and the CDB you know what kind of error you got, plus what blo= ck > didn't read/write correctly. With the extended sense data, you can find o= ut > even more details that are sense-key dependent... > > So I'm unsure that trying to shoehorn our imperfect knowledge of what's > retriable, fixable, should be written with zeros into the kernel and > converting that to a separate errno would give good results, and tapping > into this stream daemons that want to make more nuanced calls about disks > might be the better way to go. One of the things I'm planning for $WORK i= s > to enable the retry time limit of one of the mode pages so that we fail > faster and can just delete the file with the 'bad' block that we'd get > eventually if we allowed the full, default error processing to run, but > that 'slow path' processing kills performance for all other users of the > drive... I'm unsure how well that will work out (and I know I'm lucky th= at > I can always recover any data for my application since it's just a cache)= . > > I'd be interested to hear what others have to say here thought, since my > focus on this data is through the lense of my rather specialized > application... > > Warner > > P.S. That was generated with this rule if you wanted to play with it... > You'd have to translate absolute disk blocks to a partition and an offset > into the filesystem, then give the filesystem a chance to tell you what o= f > its data/metadata that block is used for... > > # Disk errors > notify 10 { > match "system" "CAM"; > match "subsystem" "periph"; > match "device" "[an]?da[0-9]+"; > action "logger -t diskerr -p daemon.info $_ timestamp=3D$timestam= p"; > }; > > --00000000000084edec0600d8debe Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div>btw, it also occurs to me that if I do add a 'sec= ondary' table, then you could use it to generate a unique errno and exp= eriment</div><div>with that w/o affecting the main code until that stuff wa= s mature.</div><div><br></div><div>I'm not sure I'll do that now, s= ince I've found maybe 10 asc/ascq pairs that I'd like to tag as = 9;if trying harder, retry, otherwise fail' since re-retry needs have ch= anged a lot since cam was written in the late 90s and at least some of the = asc/ascq pairs I'm looking at haven't changed since the initial imp= ort, but that's based on a tiny sampling of the data I have and is prel= iminary at best. I may just change it to reflect modern usage.<br></div><di= v><br></div><div>Warner<br></div></div><br><div class=3D"gmail_quote"><div = dir=3D"ltr" class=3D"gmail_attr">On Fri, Jul 14, 2023 at 5:34=E2=80=AFPM Wa= rner Losh <<a href=3D"mailto:imp@bsdimp.com">imp@bsdimp.com</a>> wrot= e:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0= .8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"l= tr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote"><div dir=3D"l= tr" class=3D"gmail_attr">On Fri, Jul 14, 2023 at 12:31=E2=80=AFPM Alan Some= rs <<a href=3D"mailto:asomers@freebsd.org" target=3D"_blank">asomers@fre= ebsd.org</a>> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D= "margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le= ft:1ex">On Fri, Jul 14, 2023 at 11:05=E2=80=AFAM Warner Losh <<a href=3D= "mailto:imp@bsdimp.com" target=3D"_blank">imp@bsdimp.com</a>> wrote:<br> ><br> ><br> ><br> > On Fri, Jul 14, 2023, 11:12 AM Alan Somers <<a href=3D"mailto:asome= rs@freebsd.org" target=3D"_blank">asomers@freebsd.org</a>> wrote:<br> >><br> >> On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh <<a href= =3D"mailto:imp@bsdimp.com" target=3D"_blank">imp@bsdimp.com</a>> wrote:<= br> >> ><br> >> > Greetings,<br> >> ><br> >> > i've been looking closely at failed drives for $WORK late= ly. I've noticed that a lot of errors that kinda sound like fatal error= s have SS_RDEF set on them.<br> >> ><br> >> > What's the process for evaluating whether those error cod= es are worth retrying. There are several errors that we seem to be seeing (= preliminary read of the data) before the drive gives up the ghost altogethe= r. For those cases, I'd like to post more specific lists. Should I do t= hat here?<br> >> ><br> >> > Independent of that, I may want to have a more aggressive = 9;fail fast' policy than is appropriate for my work load (we have a lot= of data that's a copy of a copy of a copy, so if we lose it, we don= 9;t care: we'll just delete any files we can't read and get on with= life, though I know others will have a more conservative attitude towards = data that might be precious and unique). I can set the number of retries lo= wer, I can do some other hacks for disks that tell the disk to fail faster,= but I think part of the solution is going to have to be failing for some s= ense-code/ASC/ASCQ tuples that we don't want to fail in upstream or the= general case. I was thinking of identifying those and creating a 'glob= al quirk table' that gets applied after the drive-specific quirk table = that would let $WORK override the defaults, while letting others keep the c= urrent behavior. IMHO, it would be better to have these separate rather tha= n in the global data for tracking upstream...<br> >> ><br> >> > Is that clear, or should I give concrete examples?<br> >> ><br> >> > Comments?<br> >> ><br> >> > Warner<br> >><br> >> Basically, you want to change the retry counts for certain ASC/ASC= Q<br> >> codes only, on a site-by-site basis?=C2=A0 That sounds reasonable.= =C2=A0 Would<br> >> it be configurable at runtime or only at build time?<br> ><br> ><br> > I'd like to change the default actions. But maybe we just do that = for everyone and assume modern drives...<br> ><br> >> Also, I've been thinking lately that it would be real nice if = READ<br> >> UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.=C2= =A0 That<br> >> would let consumers know that retries are pointless, but that the = data<br> >> is probably healable.<br> ><br> ><br> > Unlikely, unless you've tuned things to not try for long at recove= ry...<br> ><br> > But regardless... do you have a concrete example of a use case? There&= #39;s a number of places that map any error to EIO. And I'd like a use = case before we expand the errors the lower layers return...<br> ><br> > Warner<br> <br> My first use-case is a user-space FUSE file system.=C2=A0 It only has<br> access to errnos, not ASC/ASCQ codes.=C2=A0 If we do as I suggest, then it<= br> could heal a READ UNRECOVERABLE by rewriting the sector, whereas other<br> EIO errors aren't likely to be healed that way.<br></blockquote><div><b= r></div><div>Yea... but READ UNRECOVERABLE is kinda hit or miss...</div><di= v>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px= 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> My second use-case is ZFS.=C2=A0 zfsd treats checksum errors differently<br= > from I/O errors.=C2=A0 A checksum error normally means that a read returned= <br> wrong data.=C2=A0 But I think that READ UNRECOVERABLE should also count.<br= > After all, that means that the disk's media returned wrong data which<b= r> was detected by the disk's own EDC/ECC.=C2=A0 I've noticed that zfs= d seems<br> to fault disks too eagerly when their only problem is READ<br> UNRECOVERABLE errors.=C2=A0 Mapping it to EINTEGRITY, or even a new error<b= r> code, would let zfsd be tuned better.<br></blockquote><div><br></div><div>E= INTEGRITY would then mean two different things. UFS returns in when checksu= ms fail for critical=C2=A0filesystem errors. I'm not saying no, per se,= just that it conflates two different errors.</div><div><br></div><div>I th= ink both of these use cases would be better served by CAM's publishing = of the errors to devctl today. Here's some example data from a system I= 'm looking at:</div><div><br></div><div>system=3DCAM subsystem=3Dperiph= type=3Dtimeout device=3Dda36 serial=3D"12345" cam_status=3D"= ;0x44b" timeout=3D30000 CDB=3D"28 00 4e b7 cb a3 00 04 cc 00 &quo= t; =C2=A0timestamp=3D1634739729.312068<br>system=3DCAM subsystem=3Dperiph t= ype=3Dtimeout device=3Dda36 serial=3D"12345" cam_status=3D"0= x44b" timeout=3D30000 CDB=3D"28 00 20 6b d5 56 00 00 c0 00 "= =C2=A0timestamp=3D1634739729.585541<br>system=3DCAM subsystem=3Dperiph typ= e=3Derror device=3Dda36 serial=3D"12345" cam_status=3D"0x4cc= " scsi_status=3D2 scsi_sense=3D"72 03 11 00" CDB=3D"28 = 00 ad 1a 35 96 00 00 56 00 " timestamp=3D1641979267.469064<br>system= =3DCAM subsystem=3Dperiph type=3Derror device=3Dda36 serial=3D"12345&q= uot; cam_status=3D"0x4cc" scsi_status=3D2 scsi_sense=3D"72 0= 3 11 00" CDB=3D"28 00 ad 1a 35 96 00 01 5e 00 " =C2=A0timest= amp=3D1642252539.693699<br></div><div>system=3DCAM subsystem=3Dperiph type= =3Derror device=3Dda39 serial=3D"12346" cam_status=3D"0x4cc&= quot; scsi_status=3D2 scsi_sense=3D"72 04 02 00" CDB=3D"2a 0= 0 01 2b c8 f6 00 07 81 00 " =C2=A0timestamp=3D1669603144.090835<br></d= iv><div><br></div><div>Here we get the sense key, the asc and the ascq in t= he scsi_sense data (I'm currently looking at expanding this to the enti= re sense buffer, since it includes how hard the drive tried to read the dat= a on media and hardware errors).=C2=A0 It doesn't include nvme data, bu= t does include ata data (I'll have to add that data, now that I've = noticed it is missing).=C2=A0 With the sense data and the CDB you know what= kind of error you got, plus what block didn't read/write correctly. Wi= th the extended sense data, you can find out even more details that are sen= se-key dependent...</div><div><br></div><div>So I'm unsure that trying = to shoehorn our imperfect knowledge of what's retriable, fixable, shoul= d be written with zeros into the kernel and converting that to a separate e= rrno would give good results, and tapping into this stream daemons that wan= t to make more nuanced calls about disks might be the better way to go. One= of the things I'm planning for $WORK is to enable the retry time limit= of one of the mode pages so that we fail faster and can just delete the fi= le with the 'bad' block that we'd get eventually if we allowed = the full, default error processing to run, but that 'slow path' pro= cessing kills performance for all other users of the drive...=C2=A0 I'm= unsure how well that will work out (and I know I'm lucky that I can al= ways recover any data for my application since it's just a cache).</div= ><div><br></div><div>I'd be interested to hear what others have to say = here thought, since my focus on this data is through the lense of my rather= specialized application...</div><div><br></div><div>Warner</div><div><br><= /div><div>P.S. That was generated with this rule if you wanted to play with= it... You'd have to translate absolute disk blocks to a partition and = an offset into the filesystem, then give the filesystem a chance to tell yo= u what of its data/metadata that block is used for...</div><div><br></div><= div># Disk errors<br>notify 10 {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 match "= ;system" =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"CAM";<br>=C2=A0 = =C2=A0 =C2=A0 =C2=A0 match "subsystem" =C2=A0 =C2=A0 =C2=A0 "= ;periph";<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 match "device" =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"[an]?da[0-9]+";<br>=C2=A0 =C2=A0 = =C2=A0 =C2=A0 action "logger -t diskerr -p <a href=3D"http://daemon.in= fo" target=3D"_blank">daemon.info</a> $_ timestamp=3D$timestamp";<br>}= ;<br></div><div><br></div></div></div> </blockquote></div> --00000000000084edec0600d8debe--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfptEG=%2Bxa3m31Ngre26ZQxZ_Fqsfjmh%2BtVHgP2XpqhZ7g>