From owner-freebsd-fs@freebsd.org  Sat Nov 25 16:36:31 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 35785DEA70E
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sat, 25 Nov 2017 16:36:31 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-io0-x232.google.com (mail-io0-x232.google.com
 [IPv6:2607:f8b0:4001:c06::232])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EA7286579E
 for <freebsd-fs@freebsd.org>; Sat, 25 Nov 2017 16:36:30 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-io0-x232.google.com with SMTP id x63so32167206ioe.6
 for <freebsd-fs@freebsd.org>; Sat, 25 Nov 2017 08:36:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=PAGMPg1oviQBjanQMuXwwRvreJHA+xXSsbF+rVnmABo=;
 b=jwHO36iyQQyGYD/VJxZLY+Y1Slru/PtgkR7iCGVoSnfmXTMd/DE5Dwk3ExlWJ77/n8
 6QXOTuIUWi1p1iP4G5bpPHqQj1DmBiDwAokIONmXoTlM+njgNK8eYlevnbbow9B0pc+U
 5aGyyipRR0UZqSRLKebh0+42VMwnqNRS9oecjKD5mMSpQesXBDI+Xh3Mkg6yiPyE5RjQ
 /OaifUrOwdVeethDdE86HE1hdhGckNlYrA9euWMPn699PJ7l7Hk3dfFp+Q9Dok5fT9Oe
 UwzfJnK2OURLd9z4p1E1CeA8DIO64YpH4Zm31e+ruacOonWrF0yL/7f5ldJvcov9Z1G+
 rnlw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=PAGMPg1oviQBjanQMuXwwRvreJHA+xXSsbF+rVnmABo=;
 b=koKu0zDFs8pDtoILrcCq1z0JMBr0F6sURKAb3dFfHxUhuaVu0cOr80h5PGitOagOn7
 Y47LftTn9P2QP7UnBEmRfB6YjLUGSGiyHAiocG/GGLiseM7UH6z6UhbrMc4PqyagyrGV
 40OvGpt+oKCTssibdY5nVykELa4ckTl3rqD3tde2pDln+h2cF+ChaebYvE7I36RnjowV
 a6ms1dmjjzgNtgtP1jodhlXhi1SFolsQE8cGftDNJBlWwYj4IZSROTlk64p+qI6VNvXd
 FiOS0Y9e3uYWLFf4q9kC+ihkkuTxOoVbo0ExwytsBrNf22zqYvNjVyOftOS6VsRofg5k
 GTrQ==
X-Gm-Message-State: AJaThX4+KavuFDOtdAQsOz4qBHDTknAuGIWIqhVqFsEbfsUysaPuclaV
 YJbgqzfS2CtFnnfVyC4xWlUtqmYco92Eo2amOhQHQQ==
X-Google-Smtp-Source: AGs4zMaOq0NE2XAnPb2aSMAenvQYIra9r9I9OqDGjqHPMAzZj/hOPvKqt8rkqyfWTC0aR1H3wfmt/f6FieY5jmzAZpU=
X-Received: by 10.107.30.81 with SMTP id e78mr22143577ioe.130.1511627790118;
 Sat, 25 Nov 2017 08:36:30 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 08:36:29 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd]
In-Reply-To: <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
 <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 25 Nov 2017 09:36:29 -0700
X-Google-Sender-Auth: cPV8WY30lXU33pTYiZKqTUQRXtA
Message-ID: <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, 
 Scott Long <scottl@samsco.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 16:36:31 -0000

On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon <avg@freebsd.org> wrote:

> On 24/11/2017 18:33, Warner Losh wrote:
> >
> >
> > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg@freebsd.org
> > <mailto:avg@freebsd.org>> wrote:
> >
> >     On 24/11/2017 15:08, Warner Losh wrote:
> >     >
> >     >
> >     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>
> >     > <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote:
> >     >
> >     >
> >     >     https://reviews.freebsd.org/D13224
> >     <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/
> D13224
> >     <https://reviews.freebsd.org/D13224>>
> >     >
> >     >     Anyone interested is welcome to join the review.
> >     >
> >     >
> >     > I think it's a really bad idea. It introduces a
> 'one-size-fits-all' notion of
> >     > QoS that seems misguided. It conflates a shorter timeout with
> don't retry. And
> >     > why is retrying bad? It seems more a notion of 'fail fast' or so
> other concept.
> >     > There's so many other ways you'd want to use it. And it uses the
> same return
> >     > code (EIO) to mean something new. It's generally meant 'The lower
> layers have
> >     > retried this, and it failed, do not submit it again as it will not
> succeed' with
> >     > 'I gave it a half-assed attempt, and that failed, but resubmission
> might work'.
> >     > This breaks a number of assumptions in the BUF/BIO layer as well
> as parts of CAM
> >     > even more than they are broken now.
> >     >
> >     > So let's step back a bit: what problem is it trying to solve?
> >
> >     A simple example.  I have a mirror, I issue a read to one of its
> members.  Let's
> >     assume there is some trouble with that particular block on that
> particular disk.
> >      The disk may spend a lot of time trying to read it and would still
> fail.  With
> >     the current defaults I would wait 5x that time to finally get the
> error back.
> >     Then I go to another mirror member and get my data from there.
> >     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
> read, get
> >     the error back sooner and try the other disk sooner.  Only if I know
> that there
> >     are no other copies to try, then I would use the normal read with
> all the
> >     retrying.
> >
> >
> > It sounds like you are optimizing the wrong thing and taking an overly
> > simplistic view of quality of service.
> > First, failing blocks on a disk is fairly rare. Do you really want to
> optimize
> > for that case?
>
> If it can be done without any harm to the sunny day scenario, then why not?
> I think that 'robustness' is the word here, not 'optimization'.


I fail to see how it is a robustness issue. You've not made that case. You
want the I/O to fail fast so you can give another disk a shot sooner.
That's optimization.

> Second, you're really saying 'If you can't read it fast, fail" since we
> only
> > control the software side of read retry.
>
> Am I?
> That's not what I wanted to say, really.  I just wanted to say, if this I/O
> fails, don't retry it, leave it to me.
> This is very simple, simplistic as you say, but I like simple.


Right. Simple doesn't make it right. In fact, simple often makes it wrong.
We have big issues with the nvd device today because it's mindlessly queues
all the trim requests to the NVMe device w/o collapsing them, resulting in
horrible performance.

> There's new op codes being proposed
> > that say 'read or fail within Xms' which is really what you want: if
> it's taking
> > too long on disk A you want to move to disk B. The notion here was we'd
> return
> > EAGAIN (or some other error) if it failed after Xms, and maybe do some
> emulation
> > in software for drives that don't support this. You'd tweak this number
> to
> > control performance. You're likely to get a much bigger performance win
> all the
> > time by scheduling I/O to drives that have the best recent latency.
>
> ZFS already does some latency based decisions.
> The things that you describe are very interesting, but they are for the
> future.
>
> > Third, do you have numbers that show this is actually a win?
>
> I do not have any numbers right now.
> What kind of numbers would you like?  What kind of scenarios?


The usual kind. How is latency for I/O improved when you have a disk with a
few failing sectors that take a long time to read (which isn't a given:
some sectors fail fast). What happens when you have a failed disk? etc. How
does this compare with the current system.

Basically, how do you know this will really make things better and isn't
some kind of 'feel good' thing about 'doing something clever' about the
problem that may actually make things worse.

> This is a terrible
> > thing from an architectural view.
>
> You have said this several times, but unfortunately you haven't explained
> it yet.


I have explained it. You weren't listening.

1. It breaks the EIO contract that's currently in place.
2. It presumes to know what kind of retries should be done at the upper
layers where today we have a system that's more black and white. You don't
know the same info the low layers have to know whether to try another
drive, or just retry this one.
3. It assumes that retries are the source of latency in the system. they
aren't necessarily.
4. It assumes retries are necessarily slow: they may be, they might not be.
All depends on the drive (SSDs repeated I/O are often faster than actual
I/O).
5. It's just one bit when you really need more complex nuances to get good
QoE out of the I/O system. Retries is an incidental detail that's not that
important, while latency is what you care most about minimizing. You
wouldn't care if I tried to read the data 20 times if it got the result
faster than going to a different drive.
6. It's putting the wrong kind of specific hints into the mix.

> Absent numbers that show it's a big win, I'm
> > very hesitant to say OK.
> >
> > Forth, there's a large number of places in the stack today that need to
> > communicate their I/O is more urgent, and we don't have any good way to
> > communicate even that simple concept down the stack.
>
> That's unfortunately, but my proposal has quite little to do with I/O
> scheduling, priorities, etc.


Except it does. It dictates error recovery policy which is I/O scheduling.

> Finally, the only places that ZFS uses the TRYHARDER flag are for things
> like
> > the super block if I'm reading the code right. It doesn't do it for
> normal I/O.
>
> Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in
> the
> same way as ZIO_FLAG_TRYHARD.
>
> > There's no code to cope with what would happen if all the copies of a
> block
> > couldn't be read with the NORETRY flag. One of them might contain the
> data.
>
> ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.
>

Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when
an I/O fails' It doesn't control retries at all in Solaris. It's a
different concept entirely, and one badly thought out.

Warner