Date: Sat, 25 Nov 2017 19:36:19 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: Scott Long <scottl@samsco.org> Cc: Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom Message-ID: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> In-Reply-To: <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org> <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 25/11/2017 12:54, Scott Long wrote: > > >> On Nov 24, 2017, at 10:17 AM, Andriy Gapon <avg@FreeBSD.org> wrote: >> >> >>>> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first read, get >>>> the error back sooner and try the other disk sooner. Only if I know that there >>>> are no other copies to try, then I would use the normal read with all the retrying. >>>> >>> >>> I agree with Warner that what you are proposing is not correct. It weakens the >>> contract between the disk layer and the upper layers, making it less clear who is >>> responsible for retries and less clear what “EIO” means. That contract is already >>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I >>> are working on a plan to fix that. >> >> Well... I do realize now that there is some problem in this area, both you and >> Warner mentioned it. But knowing that it exists is not the same as knowing what >> it is :-) >> I understand that it could be rather complex and not easy to describe in a short >> email… >> > > There are too many questions to ask, I will do my best to keep the conversation > logical. First, how do you propose to distinguish between EIO due to a lengthy > set of timeouts, vs EIO due to an immediate error returned by the disk hardware? At what layer / component? If I am the issuer of the request then I know how I issued that request and what kind of request it was. If I am an intermediate layer, then what do I care. > CAM has an extensive table-driven error recovery protocol who’s purpose is to > decide whether or not to do retries based on hardware state information that is > not made available to the upper layers. Do you have a test case that demonstrates > the problem that you’re trying to solve? Maybe the error recovery table is wrong > and you’re encountering a case that should not be retried. If that’s what’s going on, > we should fix CAM instead of inventing a new work-around. Let's assume that I am talking about the case of not being able to read an HDD sector that is gone bad. Here is a real world example: Jun 16 10:40:18 trant kernel: ahcich0: NCQ error, slot = 20, port = -1 Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): Retrying command Jun 16 10:40:20 trant kernel: ahcich0: NCQ error, slot = 22, port = -1 Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): Retrying command Jun 16 10:40:22 trant kernel: ahcich0: NCQ error, slot = 24, port = -1 Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): Retrying command Jun 16 10:40:25 trant kernel: ahcich0: NCQ error, slot = 26, port = -1 Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): Retrying command Jun 16 10:40:27 trant kernel: ahcich0: NCQ error, slot = 28, port = -1 Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): Error 5, Retries exhausted I do not see anything wrong in what CAM / ahci /ata_da did here. They did what I would expect them to do. They tried very hard to get data that I told them I need. Timestamp of the first error is Jun 16 10:40:18. Timestamp of the last error is Jun 16 10:40:27. So, it took additional 9 seconds to finally produce EIO. That disk is a part of a ZFS mirror. If the request was failed after the first attempt, then ZFS would be able to get the data from a good disk much sooner. And don't take me wrong, I do NOT want CAM or GEOM to make that decision by itself. I want ZFS to be able to tell the lower layers when they should try as hard as they normally do and when they should report an I/O error as soon as it happens without any retries. > Second, what about disk subsystems that do retries internally, out of the control > of the FreeBSD driver? This would include most hardware RAID controllers. > Should what you are proposing only work for a subset of the kinds of storage > systems that are available and in common use? Yes. I do not worry about things that are beyond my control. Those subsystems would behave as they do now. So, nothing would get worse. > Third, let’s say that you run out of alternate copies to try, and as you stated > originally, that will force you to retry the copies that had returned EIO. How > will you know when you can retry? How will you know how many times you > will retry? How will you know that a retry is even possible? I am continuing to use ZFS as an example. It already has all the logic built in. If all vdev zio-s (requests to mirror members as an example) fail, then their parent zio (a logical read from the mirror) will be retried (by ZFS) and when ZFS retries it sets a special flag (ZIO_FLAG_IO_RETRY) on that zio and on its future child zio-s. Essentially, my answer is you have to program it correctly, there is no magic. > Should the retries > be able to be canceled? I think that this is an orthogonal question. I do not have any answer and I am not ready to discuss this at all. > Why is overloading EIO so bad? brelse() will call bdirty() when a BIO_WRITE > command has failed with EIO. Calling bdirty() has the effect of retrying the I/O. > This disregards the fact that disk drivers only return EIO when they’ve decided > that the I/O cannot be retried. It has no termination condition for the retries, and > will endlessly retry I/O in vain; I’ve seen this quite frequently. It also disregards > the fact that I/O marked as B_PAGING can’t be retried in this fashion, and will > trigger a panic. Because we pretend that EIO can be retried, we are left with > a system that is very fragile when I/O actually does fail. Instead of adding > more special cases and blurred lines, I want to go back to enforcing strict > contracts between the layers and force the core parts of the system to respect > those contracts and handle errors properly, instead of just retrying and > hoping for the best. So, I suggest that the buffer layer (all the b* functions) does not use the proposed flag. Any problems that exist in it should be resolved first. ZFS does not use that layer. >> But then, this flag is optional, it's off by default and no one is forced to >> used it. If it's used only by ZFS, then it would not be horrible. >> Unless it makes things very hard for the infrastructure. >> But I am circling back to not knowing what problem(s) you and Warner are >> planning to fix. >> > > Saying that a feature is optional means nothing; while consumers of the API > might be able to ignore it, the producers of the API cannot ignore it. It is > these producers who are sick right now and should be fixed, instead of > creating new ways to get even more sick. I completely agree. But which producers of the API do you mean specifically? So far, you mentioned only the consumer level problems with the b-layer. Having said all of the above, I must admit one thing. When I proposed BIO_NORETRY I had only the simplest GEOM topology in mind: ZFS -> [partition] -> disk. Now that I start to think about more complex topologies I am not really sure how the flag should be handled by geom-s with complex internal behavior. If that can be implemented reasonably and clearly, if the flag will create a big mess. E.g., things like gmirrors on top of gmirrors and so on. Maybe the flag, if it ever accepted, should never be propagated automatically. Only geom-s that are aware of it should propagate or request it. That should be safer. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?33101e6c-0c74-34b7-ee92-f9c4a11685d5>