Date: Sat, 25 Nov 2017 19:40:59 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: Warner Losh <imp@bsdimp.com> Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, Scott Long <scottl@samsco.org> Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom Message-ID: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> In-Reply-To: <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <CANCZdfrBtYm_Jxcb6tXP%2BdtMq7dhRKmVOzvshG%2ByB%2B%2BARx1qOQ@mail.gmail.com> <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org> <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Before anything else, I would like to say that I got an impression that we speak from so different angles that we either don't understand each other's words or, even worse, misinterpret them. On 25/11/2017 18:36, Warner Losh wrote: > > > On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon <avg@freebsd.org > <mailto:avg@freebsd.org>> wrote: > > On 24/11/2017 18:33, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org> > > <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote: > > > > On 24/11/2017 15:08, Warner Losh wrote: > > > > > > > > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org> <mailto:avg@freebsd.org > <mailto:avg@freebsd.org>> > > > <mailto:avg@freebsd.org <mailto:avg@freebsd.org> <mailto:avg@freebsd.org > <mailto:avg@freebsd.org>>>> wrote: > > > > > > > > > https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224> > > <https://reviews.freebsd.org/D13224 > <https://reviews.freebsd.org/D13224>> <https://reviews.freebsd.org/D13224 > <https://reviews.freebsd.org/D13224> > > <https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>>> > > > > > > Anyone interested is welcome to join the review. > > > > > > > > > I think it's a really bad idea. It introduces a 'one-size-fits-all' > notion of > > > QoS that seems misguided. It conflates a shorter timeout with don't > retry. And > > > why is retrying bad? It seems more a notion of 'fail fast' or so > other concept. > > > There's so many other ways you'd want to use it. And it uses the > same return > > > code (EIO) to mean something new. It's generally meant 'The lower > layers have > > > retried this, and it failed, do not submit it again as it will not > succeed' with > > > 'I gave it a half-assed attempt, and that failed, but resubmission > might work'. > > > This breaks a number of assumptions in the BUF/BIO layer as well as > parts of CAM > > > even more than they are broken now. > > > > > > So let's step back a bit: what problem is it trying to solve? > > > > A simple example. I have a mirror, I issue a read to one of its > members. Let's > > assume there is some trouble with that particular block on that > particular disk. > > The disk may spend a lot of time trying to read it and would still > fail. With > > the current defaults I would wait 5x that time to finally get the > error back. > > Then I go to another mirror member and get my data from there. > > IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first > read, get > > the error back sooner and try the other disk sooner. Only if I know > that there > > are no other copies to try, then I would use the normal read with all the > > retrying. > > > > > > It sounds like you are optimizing the wrong thing and taking an overly > > simplistic view of quality of service. > > First, failing blocks on a disk is fairly rare. Do you really want to optimize > > for that case? > > If it can be done without any harm to the sunny day scenario, then why not? > I think that 'robustness' is the word here, not 'optimization'. > > > I fail to see how it is a robustness issue. You've not made that case. You want > the I/O to fail fast so you can give another disk a shot sooner. That's > optimization. Then you can call a protection against denial-of-service an optimization too. You want to do things faster, right? > > Second, you're really saying 'If you can't read it fast, fail" since we only > > control the software side of read retry. > > Am I? > That's not what I wanted to say, really. I just wanted to say, if this I/O > fails, don't retry it, leave it to me. > This is very simple, simplistic as you say, but I like simple. > > > Right. Simple doesn't make it right. In fact, simple often makes it wrong. I agree. The same applies to complex well. Let's stop at this. > We > have big issues with the nvd device today because it's mindlessly queues all the > trim requests to the NVMe device w/o collapsing them, resulting in horrible > performance. > > > There's new op codes being proposed > > that say 'read or fail within Xms' which is really what you want: if it's taking > > too long on disk A you want to move to disk B. The notion here was we'd return > > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation > > in software for drives that don't support this. You'd tweak this number to > > control performance. You're likely to get a much bigger performance win all the > > time by scheduling I/O to drives that have the best recent latency. > > ZFS already does some latency based decisions. > The things that you describe are very interesting, but they are for the future. > > > Third, do you have numbers that show this is actually a win? > > I do not have any numbers right now. > What kind of numbers would you like? What kind of scenarios? > > > The usual kind. How is latency for I/O improved when you have a disk with a few > failing sectors that take a long time to read (which isn't a given: some sectors > fail fast). Today I gave an example of how four retries added about 9 seconds of additional delay. I think that that is significant. > What happens when you have a failed disk? etc. How does this compare > with the current system. I haven't done such an experiment. I guess it depends on how exactly the disk fails. There is a big difference between a disk dropping a link and a disk turning into a black hole. > Basically, how do you know this will really make things better and isn't some > kind of 'feel good' thing about 'doing something clever' about the problem that > may actually make things worse. > > > This is a terrible > > thing from an architectural view. > > You have said this several times, but unfortunately you haven't explained it > yet. > > > I have explained it. You weren't listening. This is the first time I see the below list or anything like it. > 1. It breaks the EIO contract that's currently in place. This needs further explanation. > 2. It presumes to know what kind of retries should be done at the upper layers > where today we have a system that's more black and white. I don't understand this argument. If your upper level code does not know how to do retries, then it should not concern itself with that and should not use the flag. > You don't know the > same info the low layers have to know whether to try another drive, or just > retry this one. Eh? Either we have different definitions of upper and lower layers or I don't understand how lower layers (e.g. CAM) can know about another drive. > 3. It assumes that retries are the source of latency in the system. they aren't > necessarily. I am not assuming that at all for the general case. > 4. It assumes retries are necessarily slow: they may be, they might not be. All > depends on the drive (SSDs repeated I/O are often faster than actual I/O). Of course. But X plus epsilon is always greater than X. And we know than in many practical cases epsilon can be rather large. > 5. It's just one bit when you really need more complex nuances to get good QoE > out of the I/O system. Retries is an incidental detail that's not that > important, while latency is what you care most about minimizing. You wouldn't > care if I tried to read the data 20 times if it got the result faster than going > to a different drive. That's a good point. But then again, it's the upper layers that have a better chance of predicting this kind of thing. That is, if I know that my backup storage is extremely slow, then I will allow the fast primary storage do all retries it wants to do. It's not CAM nor scsi_da nor a specific SIM that can make those decisions. It's an issuer of the I/O request [or an intermediate geom that encapsulates that knowledge and effectively acts as an issuer of I/O-s to the lower geoms]. > 6. It's putting the wrong kind of specific hints into the mix. This needs further explanation. > > Absent numbers that show it's a big win, I'm > > very hesitant to say OK. > > > > Forth, there's a large number of places in the stack today that need to > > communicate their I/O is more urgent, and we don't have any good way to > > communicate even that simple concept down the stack. > > That's unfortunately, but my proposal has quite little to do with I/O > scheduling, priorities, etc. > > > Except it does. It dictates error recovery policy which is I/O scheduling. > > > Finally, the only places that ZFS uses the TRYHARDER flag are for things like > > the super block if I'm reading the code right. It doesn't do it for normal I/O. > > Right. But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the > same way as ZIO_FLAG_TRYHARD. > > > There's no code to cope with what would happen if all the copies of a block > > couldn't be read with the NORETRY flag. One of them might contain the data. > > ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above. > > > Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when an > I/O fails' It doesn't control retries at all in Solaris. It's a different > concept entirely, and one badly thought out. I think that it does control retries. And it does even more. My understanding is that bio-s with B_FAILFAST can be failed immediately in the situation roughly equivalent to a CAM devq (or simq) being frozen. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9f23f97d-3614-e4d2-62fe-99723c5e3879>