From owner-freebsd-geom@freebsd.org Sat Nov 25 17:41:03 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB5D7DEBD5E; Sat, 25 Nov 2017 17:41:03 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f42.google.com (mail-lf0-f42.google.com [209.85.215.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 6A68067AB8; Sat, 25 Nov 2017 17:41:03 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f42.google.com with SMTP id o41so28431705lfi.2; Sat, 25 Nov 2017 09:41:03 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=XPeLSjy+VQezLXaWyLqVGxRiCz8i9BQr86mV3Lj2YAY=; b=P5aLhmuYvRBS2YzstZB/VysjO8tDdJJbkA4wc5dEAqR37aBMeuaP2jmTHwBSSXeZoZ 0/2t++942PSBvM4SrNJidCkQSPuCaHdvGBF5xAkDIaPTmKg6msEv9PINiWbmNcYkLE3i vmXz5BwDSRT/pGxmmHsim4Zlm6/ifAOlBZAJRArWDncGHwtgtTrM/BjaFwrPg+9suF2q gUxNmXczOtc90V3bTXcDD6x8RCsqpVPSX2NIw8eolApVVfBsl/wF+oKJF8hpowlPprof E341dmmpia2l/OkYqRHuWxAPA/S3nnZCG8TfSC2fGc6RsaWZ8/3BeIyYuYb9VLIXjrs3 J0ug== X-Gm-Message-State: AJaThX5VQgIz99QNQimDK5dBVUgDFPyYvPs1xAHfwmF285C9KdPkF+Db 8WvrI088SaihgKke3mGbprI= X-Google-Smtp-Source: AGs4zMaKaiPdOiBetpyS/MEYxVNA+8qROI/0uCcumggNbeOONXIBzJLKs7oDi2qoq0qJD4SHl4FwIg== X-Received: by 10.46.99.211 with SMTP id s80mr11250056lje.7.1511631660987; Sat, 25 Nov 2017 09:41:00 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id m26sm5121410ljb.61.2017.11.25.09.40.59 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 25 Nov 2017 09:41:00 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Warner Losh Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> From: Andriy Gapon Message-ID: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> Date: Sat, 25 Nov 2017 19:40:59 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 17:41:04 -0000 Before anything else, I would like to say that I got an impression that we speak from so different angles that we either don't understand each other's words or, even worse, misinterpret them. On 25/11/2017 18:36, Warner Losh wrote: > > > On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon > wrote: > > On 24/11/2017 18:33, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon > > >> wrote: > > > >     On 24/11/2017 15:08, Warner Losh wrote: > >     > > >     > > >     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > >     > >>> wrote: > >     > > >     > > >     >     https://reviews.freebsd.org/D13224 > >      > > >     >> > >     > > >     >     Anyone interested is welcome to join the review. > >     > > >     > > >     > I think it's a really bad idea. It introduces a 'one-size-fits-all' > notion of > >     > QoS that seems misguided. It conflates a shorter timeout with don't > retry. And > >     > why is retrying bad? It seems more a notion of 'fail fast' or so > other concept. > >     > There's so many other ways you'd want to use it. And it uses the > same return > >     > code (EIO) to mean something new. It's generally meant 'The lower > layers have > >     > retried this, and it failed, do not submit it again as it will not > succeed' with > >     > 'I gave it a half-assed attempt, and that failed, but resubmission > might work'. > >     > This breaks a number of assumptions in the BUF/BIO layer as well as > parts of CAM > >     > even more than they are broken now. > >     > > >     > So let's step back a bit: what problem is it trying to solve? > > > >     A simple example.  I have a mirror, I issue a read to one of its > members.  Let's > >     assume there is some trouble with that particular block on that > particular disk. > >      The disk may spend a lot of time trying to read it and would still > fail.  With > >     the current defaults I would wait 5x that time to finally get the > error back. > >     Then I go to another mirror member and get my data from there. > >     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first > read, get > >     the error back sooner and try the other disk sooner.  Only if I know > that there > >     are no other copies to try, then I would use the normal read with all the > >     retrying. > > > > > > It sounds like you are optimizing the wrong thing and taking an overly > > simplistic view of quality of service. > > First, failing blocks on a disk is fairly rare. Do you really want to optimize > > for that case? > > If it can be done without any harm to the sunny day scenario, then why not? > I think that 'robustness' is the word here, not 'optimization'. > > > I fail to see how it is a robustness issue. You've not made that case. You want > the I/O to fail fast so you can give another disk a shot sooner. That's > optimization. Then you can call a protection against denial-of-service an optimization too. You want to do things faster, right? > > Second, you're really saying 'If you can't read it fast, fail" since we only > > control the software side of read retry. > > Am I? > That's not what I wanted to say, really.  I just wanted to say, if this I/O > fails, don't retry it, leave it to me. > This is very simple, simplistic as you say, but I like simple. > > > Right. Simple doesn't make it right. In fact, simple often makes it wrong. I agree. The same applies to complex well. Let's stop at this. > We > have big issues with the nvd device today because it's mindlessly queues all the > trim requests to the NVMe device w/o collapsing them, resulting in horrible > performance. > > > There's new op codes being proposed > > that say 'read or fail within Xms' which is really what you want: if it's taking > > too long on disk A you want to move to disk B. The notion here was we'd return > > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation > > in software for drives that don't support this. You'd tweak this number to > > control performance. You're likely to get a much bigger performance win all the > > time by scheduling I/O to drives that have the best recent latency. > > ZFS already does some latency based decisions. > The things that you describe are very interesting, but they are for the future. > > > Third, do you have numbers that show this is actually a win? > > I do not have any numbers right now. > What kind of numbers would you like?  What kind of scenarios? > > > The usual kind. How is latency for I/O improved when you have a disk with a few > failing sectors that take a long time to read (which isn't a given: some sectors > fail fast). Today I gave an example of how four retries added about 9 seconds of additional delay. I think that that is significant. > What happens when you have a failed disk? etc. How does this compare > with the current system. I haven't done such an experiment. I guess it depends on how exactly the disk fails. There is a big difference between a disk dropping a link and a disk turning into a black hole. > Basically, how do you know this will really make things better and isn't some > kind of 'feel good' thing about 'doing something clever' about the problem that > may actually make things worse. > > > This is a terrible > > thing from an architectural view. > > You have said this several times, but unfortunately you haven't explained it > yet. > > > I have explained it. You weren't listening. This is the first time I see the below list or anything like it. > 1. It breaks the EIO contract that's currently in place. This needs further explanation. > 2. It presumes to know what kind of retries should be done at the upper layers > where today we have a system that's more black and white. I don't understand this argument. If your upper level code does not know how to do retries, then it should not concern itself with that and should not use the flag. > You don't know the > same info the low layers have to know whether to try another drive, or just > retry this one. Eh? Either we have different definitions of upper and lower layers or I don't understand how lower layers (e.g. CAM) can know about another drive. > 3. It assumes that retries are the source of latency in the system. they aren't > necessarily. I am not assuming that at all for the general case. > 4. It assumes retries are necessarily slow: they may be, they might not be. All > depends on the drive (SSDs repeated I/O are often faster than actual I/O). Of course. But X plus epsilon is always greater than X. And we know than in many practical cases epsilon can be rather large. > 5. It's just one bit when you really need more complex nuances to get good QoE > out of the I/O system. Retries is an incidental detail that's not that > important, while latency is what you care most about minimizing. You wouldn't > care if I tried to read the data 20 times if it got the result faster than going > to a different drive. That's a good point. But then again, it's the upper layers that have a better chance of predicting this kind of thing. That is, if I know that my backup storage is extremely slow, then I will allow the fast primary storage do all retries it wants to do. It's not CAM nor scsi_da nor a specific SIM that can make those decisions. It's an issuer of the I/O request [or an intermediate geom that encapsulates that knowledge and effectively acts as an issuer of I/O-s to the lower geoms]. > 6. It's putting the wrong kind of specific hints into the mix. This needs further explanation. > > Absent numbers that show it's a big win, I'm > > very hesitant to say OK. > > > > Forth, there's a large number of places in the stack today that need to > > communicate their I/O is more urgent, and we don't have any good way to > > communicate even that simple concept down the stack. > > That's unfortunately, but my proposal has quite little to do with I/O > scheduling, priorities, etc. > > > Except it does. It dictates error recovery policy which is I/O scheduling. > > > Finally, the only places that ZFS uses the TRYHARDER flag are for things like > > the super block if I'm reading the code right. It doesn't do it for normal I/O. > > Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the > same way as ZIO_FLAG_TRYHARD. > > > There's no code to cope with what would happen if all the copies of a block > > couldn't be read with the NORETRY flag. One of them might contain the data. > > ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above. > > > Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when an > I/O fails' It doesn't control retries at all in Solaris. It's a different > concept entirely, and one badly thought out. I think that it does control retries. And it does even more. My understanding is that bio-s with B_FAILFAST can be failed immediately in the situation roughly equivalent to a CAM devq (or simq) being frozen. -- Andriy Gapon