Date: Sat, 25 Nov 2017 10:57:38 -0700 From: Warner Losh <imp@bsdimp.com> To: Andriy Gapon <avg@freebsd.org> Cc: Scott Long <scottl@samsco.org>, FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom Message-ID: <CANCZdfrZfuKZAMURu-biRMYYDD_=05ODbevsWEF9uZayvdnaQg@mail.gmail.com> In-Reply-To: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org> <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org> <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Nov 25, 2017 at 10:36 AM, Andriy Gapon <avg@freebsd.org> wrote: > > Timestamp of the first error is Jun 16 10:40:18. > Timestamp of the last error is Jun 16 10:40:27. > So, it took additional 9 seconds to finally produce EIO. > That disk is a part of a ZFS mirror. If the request was failed after the > first > attempt, then ZFS would be able to get the data from a good disk much > sooner. > > And don't take me wrong, I do NOT want CAM or GEOM to make that decision by > itself. I want ZFS to be able to tell the lower layers when they should > try as > hard as they normally do and when they should report an I/O error as soon > as it > happens without any retries. Let's walk through this. You see that it takes a long time to fail an I/O. Perfectly reasonable observation. There's two reasons for this. One is that the disks take a while to make an attempt to get the data. The second is that the system has a global policy that's biased towards 'recover the data' over 'fail fast'. These can be fixed by reducing the timeouts, or lowing the read-retry count for a given drive or globally as a policy decision made by the system administrator. It may be perfectly reasonable to ask the lower layers to 'fail fast' and have either a hard or a soft deadline on the I/O for a subset of I/O. A hard deadline would return ETIMEDOUT or something when it's passed and cancel the I/O. This gives better determinism in the system, but some systems can't cancel just 1 I/O (like SATA drives), so we have to flush the whole queue. If we get a lot of these, performance suffers. However, for some class of drives, you know that if it doesn't succeed in 1s after you submit it to the drive, it's unlikely to complete successfully and it's worth the performance hit on a drive that's already acting up. You could have a soft timeout, which says 'don't do any additional action after X time has elapsed and you get word about this I/O. This is similar to the hard timeout, but just stops retrying after the deadline has passed. This scenario is better on the other users of the drive, assuming that the read-recovery operations aren't starving them. It's also easier to implement, but has worse worst case performance characteristics. You aren't asking to limit retries. You're really asking to the I/O subsystem to limit, where it can, the amount of time on an I/O so you can try another one. You're means to doing this is to tell it not to retry. That's the wrong means. It shouldn't be listed in the API that it's a 'NO RETRY' request. It should be a QoS request flag: fail fast. Part of why I'm being so difficult is that you don't understand this and are proposing a horrible API. It should have a different name. The other reason is that I absolutely do not want to overload EIO. You must return a different error back up the stack. You've show no interest in this past, which is also a needless argument. We've given good reasons, and you've poopooed them with bad arguments. Also, this isn't the data I asked for. I know things can fail slowly. I was asking for how it would improve systems running like this. As in "I implemented it, and was able to fail over to this other drive faster" or something like that. Actual drive failure scenarios vary widely, and optimizing for this one failure is unwise. It may be the right optimization, but it may not. There's lots of tricky edges in this space. Warner
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfrZfuKZAMURu-biRMYYDD_=05ODbevsWEF9uZayvdnaQg>