Date: Tue, 12 Dec 2017 18:36:49 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: Warner Losh <imp@bsdimp.com> Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, Scott Long <scottl@samsco.org> Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom Message-ID: <38122ea9-ab20-f18a-90a2-04d2681e2ac9@FreeBSD.org> In-Reply-To: <CANCZdfpLvMYBzvtU_ez_mOrkxk9LHf0sOvq4eHdDxgHgjf527A@mail.gmail.com> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <CANCZdfrBtYm_Jxcb6tXP%2BdtMq7dhRKmVOzvshG%2ByB%2B%2BARx1qOQ@mail.gmail.com> <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org> <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com> <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> <CANCZdfpLvMYBzvtU_ez_mOrkxk9LHf0sOvq4eHdDxgHgjf527A@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 26/11/2017 00:17, Warner Losh wrote: > > > On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon <avg@freebsd.org > <mailto:avg@freebsd.org>> wrote: > > > Before anything else, I would like to say that I got an impression that we speak > from so different angles that we either don't understand each other's words or, > even worse, misinterpret them. > > > I understand what you are suggesting. Don't take my disagreement with your > proposal as willful misinterpretation. You are proposing something that's a > quick hack. Very true. > Maybe a useful one, but it's still problematical because it has the > upper layers telling the lower layers what to do (don't do your retry), rather > than what service to provide (I prefer a fast error exit to over every effort to > recover the data). Also true. > And it also does it by overloading the meaning of EIO, which > has real problems which you've not been open to listening, I assume due to your > narrow use case apparently blinding you to the bigger picture issues with that > route. Quite likely. > However, there's a way forward which I think that will solve these objections. > First, designate that I/O that fails due to short-circuiting the normal recovery > process, return ETIMEDOUT. The I/O stack currently doesn't use this at all (it > was introduced for the network side of things). This is a general catch-all for > an I/O that we complete before the lower layers have given it the maximum amount > of effort to recover the data, at the user request. Next, don't use a flag. > Instead add a 32-bit field that is call bio_qos for quality of service hints and > another 32-bit field for bio_qos_param. This allows us to pass down specific > quality of service desires from the filesystem to the lower layers. The > parameter will be unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name > for a value to set it to (at the moment, just use 1). We'll assign the other QOS > values later for other things. It would allow us to implement the other sorts of > QoS things I talked about as well. That's a very interesting and workable suggestion. I will try to work on it. > As for B_FAILFAST, it's quite unlike what you're proposing, except in one > incidental detail. It's a complicated state machine that the sd driver in > solaris implemented. It's an entire protocol. When the device gets errors, it > goes into this failfast state machine. The state machine makes a determination > that the errors are indicators the device is GONE, at least for the moment, and > it will fail I/Os in various ways from there. Any new I/Os that are submitted > will be failed (there's conditional behavior here: depending on a global setting > it's either all I/O or just B_FAILFAST I/O). Yeah, I realized that B_FAILFAST was quite different from the first impression that I got from its name. Thank you for doing and sharing your analysis of how it actually works. > ZFS appears to set this bit for its > discovery code only, when a device not being there would significantly delay > things. I think that ZFS sets the bit for all 'first-attempt' I/O. It's the various retries / recovery where this bit is not set. > Anyway, when the device returns (basically an I/O gets through or maybe > some other event happens), the driver exists this mode and returns to normal > operation. It appears to be designed not for the use case that you described, > but rather for a drive that's failing all over the place so that any pending > I/Os get out of the way quickly. Your use case is only superficially similar to > that use case, so the Solaris / Illumos experiences are mildly interesting, but > due to the differences not a strong argument for doing this. This facility in > Illumos is interesting, but would require significantly more retooling of the > lower I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just > Solaris) a daemon that looks at failures to manage them at a higher level, which > might make for a better user experience for FreeBSD, so that's something that > needs to be weighed as well. Okay. > We've known for some time that HDD retry algorithms take a long time. Same is > true of some SSD or NVMe algorithms, but not all. The other objection I have to > 'noretry' namingĀ is that it bakes the current observed HDD behavior and > recovery into the API. This is undesirable as other storage technologies have > retry mechanisms that happen quite quickly (and sometimes in the drive itself). > The cutoff between fast and slow recovery is device specific, as are the methods > used. For example, there's new proposals out in NVMe (and maybe T10/T13 land) to > have new types of READ commands that specify the quality of service you expect, > including providing some sort of deadline hint to clip how much effort is > expended in trying to recover the data. It would be nice to design a mechanism > that allows us to start using these commands when drives are available with > them, and possibly using timeouts to allow for a faster abort. Most of your HDD > I/O will complete within maybe ~150ms, with a long tail out to maybe as long as > ~400ms. It might be desirable to set a policy that says 'don't let any I/Os > remain in the device longer than a second' and use this mechanism to enforce > that. Or don't let any I/Os last more than 20x the most recent median I/O time. > A single bit is insufficiently expressive to allow these sorts of things, which > is another reason for my objection to your proposal. With the QOS fields being > independent, the clone routines just copies them and makes no judgement value on > them. I now agree with this. Thank you for the detailed explanation. > So, those are my problems with your proposal, and also some hopefully useful > ways to move forward. I've chatted with others for years about introducing QoS > things into the I/O stack, so I know most of the above won't be too contentious > (though ETIMEDOUT I haven't socialized, so that may be an area of concern for > people). Thank you! -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?38122ea9-ab20-f18a-90a2-04d2681e2ac9>