From owner-freebsd-geom@freebsd.org  Sat Nov 25 17:41:03 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB5D7DEBD5E;
 Sat, 25 Nov 2017 17:41:03 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f42.google.com (mail-lf0-f42.google.com
 [209.85.215.42])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 6A68067AB8;
 Sat, 25 Nov 2017 17:41:03 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f42.google.com with SMTP id o41so28431705lfi.2;
 Sat, 25 Nov 2017 09:41:03 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=XPeLSjy+VQezLXaWyLqVGxRiCz8i9BQr86mV3Lj2YAY=;
 b=P5aLhmuYvRBS2YzstZB/VysjO8tDdJJbkA4wc5dEAqR37aBMeuaP2jmTHwBSSXeZoZ
 0/2t++942PSBvM4SrNJidCkQSPuCaHdvGBF5xAkDIaPTmKg6msEv9PINiWbmNcYkLE3i
 vmXz5BwDSRT/pGxmmHsim4Zlm6/ifAOlBZAJRArWDncGHwtgtTrM/BjaFwrPg+9suF2q
 gUxNmXczOtc90V3bTXcDD6x8RCsqpVPSX2NIw8eolApVVfBsl/wF+oKJF8hpowlPprof
 E341dmmpia2l/OkYqRHuWxAPA/S3nnZCG8TfSC2fGc6RsaWZ8/3BeIyYuYb9VLIXjrs3
 J0ug==
X-Gm-Message-State: AJaThX5VQgIz99QNQimDK5dBVUgDFPyYvPs1xAHfwmF285C9KdPkF+Db
 8WvrI088SaihgKke3mGbprI=
X-Google-Smtp-Source: AGs4zMaKaiPdOiBetpyS/MEYxVNA+8qROI/0uCcumggNbeOONXIBzJLKs7oDi2qoq0qJD4SHl4FwIg==
X-Received: by 10.46.99.211 with SMTP id s80mr11250056lje.7.1511631660987;
 Sat, 25 Nov 2017 09:41:00 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id m26sm5121410ljb.61.2017.11.25.09.40.59
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 25 Nov 2017 09:41:00 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Warner Losh <imp@bsdimp.com>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org,
 Scott Long <scottl@samsco.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
 <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
 <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org>
Date: Sat, 25 Nov 2017 19:40:59 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 17:41:04 -0000


Before anything else, I would like to say that I got an impression that we speak
from so different angles that we either don't understand each other's words or,
even worse, misinterpret them.

On 25/11/2017 18:36, Warner Losh wrote:
> 
> 
> On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>> wrote:
> 
>     On 24/11/2017 18:33, Warner Losh wrote:
>     >
>     >
>     > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org>
>     > <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote:
>     >
>     >     On 24/11/2017 15:08, Warner Losh wrote:
>     >     >
>     >     >
>     >     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org> <mailto:avg@freebsd.org
>     <mailto:avg@freebsd.org>>
>     >     > <mailto:avg@freebsd.org <mailto:avg@freebsd.org> <mailto:avg@freebsd.org
>     <mailto:avg@freebsd.org>>>> wrote:
>     >     >
>     >     >
>     >     >     https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>
>     >     <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>> <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>
>     >     <https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>>>
>     >     >
>     >     >     Anyone interested is welcome to join the review.
>     >     >
>     >     >
>     >     > I think it's a really bad idea. It introduces a 'one-size-fits-all'
>     notion of
>     >     > QoS that seems misguided. It conflates a shorter timeout with don't
>     retry. And
>     >     > why is retrying bad? It seems more a notion of 'fail fast' or so
>     other concept.
>     >     > There's so many other ways you'd want to use it. And it uses the
>     same return
>     >     > code (EIO) to mean something new. It's generally meant 'The lower
>     layers have
>     >     > retried this, and it failed, do not submit it again as it will not
>     succeed' with
>     >     > 'I gave it a half-assed attempt, and that failed, but resubmission
>     might work'.
>     >     > This breaks a number of assumptions in the BUF/BIO layer as well as
>     parts of CAM
>     >     > even more than they are broken now.
>     >     >
>     >     > So let's step back a bit: what problem is it trying to solve?
>     >
>     >     A simple example.  I have a mirror, I issue a read to one of its
>     members.  Let's
>     >     assume there is some trouble with that particular block on that
>     particular disk.
>     >      The disk may spend a lot of time trying to read it and would still
>     fail.  With
>     >     the current defaults I would wait 5x that time to finally get the
>     error back.
>     >     Then I go to another mirror member and get my data from there.
>     >     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
>     read, get
>     >     the error back sooner and try the other disk sooner.  Only if I know
>     that there
>     >     are no other copies to try, then I would use the normal read with all the
>     >     retrying.
>     >
>     >
>     > It sounds like you are optimizing the wrong thing and taking an overly
>     > simplistic view of quality of service.
>     > First, failing blocks on a disk is fairly rare. Do you really want to optimize
>     > for that case?
> 
>     If it can be done without any harm to the sunny day scenario, then why not?
>     I think that 'robustness' is the word here, not 'optimization'.
> 
> 
> I fail to see how it is a robustness issue. You've not made that case. You want
> the I/O to fail fast so you can give another disk a shot sooner. That's
> optimization.

Then you can call a protection against denial-of-service an optimization too.
You want to do things faster, right?

>     > Second, you're really saying 'If you can't read it fast, fail" since we only
>     > control the software side of read retry.
> 
>     Am I?
>     That's not what I wanted to say, really.  I just wanted to say, if this I/O
>     fails, don't retry it, leave it to me.
>     This is very simple, simplistic as you say, but I like simple.
> 
> 
> Right. Simple doesn't make it right. In fact, simple often makes it wrong.

I agree.  The same applies to complex well.  Let's stop at this.

> We
> have big issues with the nvd device today because it's mindlessly queues all the
> trim requests to the NVMe device w/o collapsing them, resulting in horrible
> performance.
> 
>     > There's new op codes being proposed
>     > that say 'read or fail within Xms' which is really what you want: if it's taking
>     > too long on disk A you want to move to disk B. The notion here was we'd return
>     > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation
>     > in software for drives that don't support this. You'd tweak this number to
>     > control performance. You're likely to get a much bigger performance win all the
>     > time by scheduling I/O to drives that have the best recent latency.
> 
>     ZFS already does some latency based decisions.
>     The things that you describe are very interesting, but they are for the future.
> 
>     > Third, do you have numbers that show this is actually a win?
> 
>     I do not have any numbers right now.
>     What kind of numbers would you like?  What kind of scenarios?
> 
> 
> The usual kind. How is latency for I/O improved when you have a disk with a few
> failing sectors that take a long time to read (which isn't a given: some sectors
> fail fast).

Today I gave an example of how four retries added about 9 seconds of additional
delay.  I think that that is significant.

> What happens when you have a failed disk? etc. How does this compare
> with the current system.

I haven't done such an experiment.  I guess it depends on how exactly the disk
fails.  There is a big difference between a disk dropping a link and a disk
turning into a black hole.

> Basically, how do you know this will really make things better and isn't some
> kind of 'feel good' thing about 'doing something clever' about the problem that
> may actually make things worse.
> 
>     > This is a terrible
>     > thing from an architectural view.
> 
>     You have said this several times, but unfortunately you haven't explained it
>     yet.
> 
> 
> I have explained it. You weren't listening.

This is the first time I see the below list or anything like it.

> 1. It breaks the EIO contract that's currently in place.

This needs further explanation.

> 2. It presumes to know what kind of retries should be done at the upper layers
> where today we have a system that's more black and white.

I don't understand this argument.  If your upper level code does not know how to
do retries, then it should not concern itself with that and should not use the flag.

> You don't know the
> same info the low layers have to know whether to try another drive, or just
> retry this one.

Eh?  Either we have different definitions of upper and lower layers or I don't
understand how lower layers (e.g. CAM) can know about another drive.

> 3. It assumes that retries are the source of latency in the system. they aren't
> necessarily.

I am not assuming that at all for the general case.

> 4. It assumes retries are necessarily slow: they may be, they might not be. All
> depends on the drive (SSDs repeated I/O are often faster than actual I/O).

Of course.  But X plus epsilon is always greater than X.  And we know than in
many practical cases epsilon can be rather large.

> 5. It's just one bit when you really need more complex nuances to get good QoE
> out of the I/O system. Retries is an incidental detail that's not that
> important, while latency is what you care most about minimizing. You wouldn't
> care if I tried to read the data 20 times if it got the result faster than going
> to a different drive.

That's a good point.  But then again, it's the upper layers that have a better
chance of predicting this kind of thing.  That is, if I know that my backup
storage is extremely slow, then I will allow the fast primary storage do all
retries it wants to do.  It's not CAM nor scsi_da nor a specific SIM that can
make those decisions.  It's an issuer of the I/O request [or an intermediate
geom that encapsulates that knowledge and effectively acts as an issuer of I/O-s
to the lower geoms].

> 6. It's putting the wrong kind of specific hints into the mix.

This needs further explanation.

>     > Absent numbers that show it's a big win, I'm
>     > very hesitant to say OK.
>     >
>     > Forth, there's a large number of places in the stack today that need to
>     > communicate their I/O is more urgent, and we don't have any good way to
>     > communicate even that simple concept down the stack.
> 
>     That's unfortunately, but my proposal has quite little to do with I/O
>     scheduling, priorities, etc.
> 
> 
> Except it does. It dictates error recovery policy which is I/O scheduling.
> 
>     > Finally, the only places that ZFS uses the TRYHARDER flag are for things like
>     > the super block if I'm reading the code right. It doesn't do it for normal I/O.
> 
>     Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the
>     same way as ZIO_FLAG_TRYHARD.
> 
>     > There's no code to cope with what would happen if all the copies of a block
>     > couldn't be read with the NORETRY flag. One of them might contain the data.
> 
>     ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.
> 
> 
> Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when an
> I/O fails' It doesn't control retries at all in Solaris. It's a different
> concept entirely, and one badly thought out.

I think that it does control retries.
And it does even more. My understanding is that bio-s with B_FAILFAST can be
failed immediately in the situation roughly equivalent to a CAM devq (or simq)
being frozen.


-- 
Andriy Gapon