From owner-freebsd-fs@freebsd.org  Fri Nov 24 16:33:58 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 529DEDEB554
 for <freebsd-fs@mailman.ysv.freebsd.org>; Fri, 24 Nov 2017 16:33:58 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x229.google.com (mail-it0-x229.google.com
 [IPv6:2607:f8b0:4001:c0b::229])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 059267AE6F
 for <freebsd-fs@freebsd.org>; Fri, 24 Nov 2017 16:33:58 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x229.google.com with SMTP id y71so2400768ita.1
 for <freebsd-fs@freebsd.org>; Fri, 24 Nov 2017 08:33:57 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=;
 b=oEuCnvZ2uOi0mqEEuCMF+Ea3KIQMW8xD4XIOzTDCGoKTFGxUdG8pKgCWcqLfEk63kB
 talGVqrxz368QnLhZqPCzYo+LUkeNRkAlTGomWkYu2Ub7QuPvtaoAKp40xt6ffwmQrI1
 6bhPj6xeTqwjqqFGmnLbausFNdZ9KfBW/NiIgowQgVw8jOgUNPkDlucrsRZL4I9VIlqF
 zgkNAsNDKJ/7903BgsEKiaQm+8vo0GLzqp8Pl1pGYnwRWKHT2sha56QWI+/pcqLie5y7
 N6jvMdc9N5PWnTQ47uUezW9hDMv0kGSpnrXcGiDVOg0A/icNn/iIc1VF9SD6mwKi6LK6
 P0QQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=;
 b=h9bc0+DlBBd6DrnQ+A2bn8RdX3V7QqJwAcJ84f2QNJW3f0k6ck7pGe8NqwNIW5Tsfi
 //RMKNclWdVjGqvlScW7jLVhI/D+0Nu8AMA26SNFeq9XN4n+a9ITdCVv3lJXieOxjVHO
 LKjeK0msiEEDGPlNSb+lVZaXodW+tL/rW2vIPTl7A5ADNyN0/7s7ijui+K0qJJTIQ6JX
 bRoT+TM6u6bXNx8znaB2QX7Lic4fiWl9g30eZb6MsFcjjbfw+zAdHOEu2UwJvjaGSYMh
 ni0nBSyxFSm+gKnxbipAkeYb/oDagiU45N6FC1O8KPZ1CGFF6rNubg3NK5jIP7BE8UIh
 B8XQ==
X-Gm-Message-State: AJaThX5dasDbBsz4AXep+Guml56r6Bw6gU/23UMPun8lOo7BmrKFjK9+
 gVsUZDIZy7HMZ/IZc6YG1+d2YI5jt71IY4e+N2RA/Q==
X-Google-Smtp-Source: AGs4zMb2qcX4SYQ8dG9gG7QijUzFvld3wUlWcmNehzHGRemSK6Qjy7T0UBzw5z1gKtWtiDfacKGNpjOpyFOqluW63Yg=
X-Received: by 10.36.77.143 with SMTP id l137mr17155000itb.50.1511541237108;
 Fri, 24 Nov 2017 08:33:57 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Fri, 24 Nov 2017 08:33:56 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:f964:7c3e:d2:aac5]
In-Reply-To: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Fri, 24 Nov 2017 09:33:56 -0700
X-Google-Sender-Auth: 9YkRcBZ2y_mwK-i8n9pyzcXRGJA
Message-ID: <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, 
 Scott Long <scottl@samsco.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 16:33:58 -0000

On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg@freebsd.org> wrote:

> On 24/11/2017 15:08, Warner Losh wrote:
> >
> >
> > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
> > <mailto:avg@freebsd.org>> wrote:
> >
> >
> >     https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/
> D13224>
> >
> >     Anyone interested is welcome to join the review.
> >
> >
> > I think it's a really bad idea. It introduces a 'one-size-fits-all'
> notion of
> > QoS that seems misguided. It conflates a shorter timeout with don't
> retry. And
> > why is retrying bad? It seems more a notion of 'fail fast' or so other
> concept.
> > There's so many other ways you'd want to use it. And it uses the same
> return
> > code (EIO) to mean something new. It's generally meant 'The lower layers
> have
> > retried this, and it failed, do not submit it again as it will not
> succeed' with
> > 'I gave it a half-assed attempt, and that failed, but resubmission might
> work'.
> > This breaks a number of assumptions in the BUF/BIO layer as well as
> parts of CAM
> > even more than they are broken now.
> >
> > So let's step back a bit: what problem is it trying to solve?
>
> A simple example.  I have a mirror, I issue a read to one of its members.
> Let's
> assume there is some trouble with that particular block on that particular
> disk.
>  The disk may spend a lot of time trying to read it and would still fail.
> With
> the current defaults I would wait 5x that time to finally get the error
> back.
> Then I go to another mirror member and get my data from there.
> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read,
> get
> the error back sooner and try the other disk sooner.  Only if I know that
> there
> are no other copies to try, then I would use the normal read with all the
> retrying.
>

It sounds like you are optimizing the wrong thing and taking an overly
simplistic view of quality of service.

First, failing blocks on a disk is fairly rare. Do you really want to
optimize for that case?

Second, you're really saying 'If you can't read it fast, fail" since we
only control the software side of read retry. There's new op codes being
proposed that say 'read or fail within Xms' which is really what you want:
if it's taking too long on disk A you want to move to disk B. The notion
here was we'd return EAGAIN (or some other error) if it failed after Xms,
and maybe do some emulation in software for drives that don't support this.
You'd tweak this number to control performance. You're likely to get a much
bigger performance win all the time by scheduling I/O to drives that have
the best recent latency.

Third, do you have numbers that show this is actually a win? This is a
terrible thing from an architectural view. Absent numbers that show it's a
big win, I'm very hesitant to say OK.

Forth, there's a large number of places in the stack today that need to
communicate their I/O is more urgent, and we don't have any good way to
communicate even that simple concept down the stack.

Finally, the only places that ZFS uses the TRYHARDER flag are for things
like the super block if I'm reading the code right. It doesn't do it for
normal I/O. There's no code to cope with what would happen if all the
copies of a block couldn't be read with the NORETRY flag. One of them might
contain the data.

Warner