From owner-freebsd-geom@freebsd.org  Sat Nov 25 22:17:51 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id DC285DF2033
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Sat, 25 Nov 2017 22:17:51 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x22d.google.com (mail-it0-x22d.google.com
 [IPv6:2607:f8b0:4001:c0b::22d])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9D63E71CD8
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 22:17:51 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x22d.google.com with SMTP id x13so16849348iti.4
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 14:17:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=LJmZtGwOy8guPaxwjyXIT3tK6dOEY/LFRWx1tpnpJSQ=;
 b=RghMCEjfdPhoF8iirse/K/MLeeQwCyBBUBDY4GIH4TSvyo6PeJvihzYn5hso1rNlLS
 +tnqaKeLnRgMLGXoyhERUjT7jVHpi6mWQqbaE5BG1d/iz/Yi8lFIvSpFjp83KXBLDvzl
 mfGLNV228BEqzpy7IuGp+mk3G1IzPGrVULVFy5xZMNEBfNY7ukdLtdMcOIivnrJUHBQF
 MMZqu50cjK8yirLEsSf2SNIid288qKfqZ50lTqRa1G2qg8DaOkrlnfwGIFfhaPM7nWCk
 9uvOhbf6/QzBkEBHFWCbW+pcDUuYoB53Zj157D7Up9UZff18kpkmcbCaYG620pYhofaI
 IJkw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=LJmZtGwOy8guPaxwjyXIT3tK6dOEY/LFRWx1tpnpJSQ=;
 b=AF1k/Ew3L8mqYFPuzqNrvk+LEgxEQNAQlIt+FJv42dMvb1pKFu8gVkfyY0WdSpHr2Q
 GauqBw57RThHk7UWZZ37dvtP6M6oAztvZ8pVhwuyk8QBvd261M9jLI6k//e+OWnqg7dS
 /gewE8+V2pVQJAfH+NIwvUZ0e6Q8+lm3+5ihaKq1WcuzHbXY/iGjCUjrnvqhOUkwGSTE
 GhdxAyvV7Y7UtB7xhbYwOjiWyf6q8o8BOSviHVzUG3Ui8xYU1yHdX/g2GeciSRivXqYV
 82liynPiNgT81CYKvXqT30EDNhXc6spdxlqPpaoAkzKKGQspLM1vTLEFL1oALowpWsWH
 nYeQ==
X-Gm-Message-State: AJaThX72e4utA5FGx2x423j2LLYTB1rVeGRpmiaif9ILOEkKYPgDoSDL
 N1V+2pAa6I8T0CMpFA5ZpQ3kIRwSmHjbDzti3/5c2w==
X-Google-Smtp-Source: AGs4zMbF8Ht5E9L9iEvEwyz0HA3vAeq7KZT0urdLQtXp+Ntp5RarLahEh59YUo6SC+4j4DZaU+ZV3bjeOtF98AIV5Mw=
X-Received: by 10.36.164.13 with SMTP id z13mr22853940ite.115.1511648270873;
 Sat, 25 Nov 2017 14:17:50 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 14:17:49 -0800 (PST)
X-Originating-IP: [50.253.99.174]
In-Reply-To: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
 <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
 <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
 <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 25 Nov 2017 15:17:49 -0700
X-Google-Sender-Auth: OFWKq_KlfodVn8Jxgh-5DhOWhSE
Message-ID: <CANCZdfpLvMYBzvtU_ez_mOrkxk9LHf0sOvq4eHdDxgHgjf527A@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, 
 Scott Long <scottl@samsco.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 22:17:52 -0000

On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon <avg@freebsd.org> wrote:

>
> Before anything else, I would like to say that I got an impression that we
> speak
> from so different angles that we either don't understand each other's
> words or,
> even worse, misinterpret them.


I understand what you are suggesting. Don't take my disagreement with your
proposal as willful misinterpretation. You are proposing something that's a
quick hack. Maybe a useful one, but it's still problematical because it has
the upper layers telling the lower layers what to do (don't do your retry),
rather than what service to provide (I prefer a fast error exit to over
every effort to recover the data). And it also does it by overloading the
meaning of EIO, which has real problems which you've not been open to
listening, I assume due to your narrow use case apparently blinding you to
the bigger picture issues with that route.

However, there's a way forward which I think that will solve these
objections. First, designate that I/O that fails due to short-circuiting
the normal recovery process, return ETIMEDOUT. The I/O stack currently
doesn't use this at all (it was introduced for the network side of things).
This is a general catch-all for an I/O that we complete before the lower
layers have given it the maximum amount of effort to recover the data, at
the user request. Next, don't use a flag. Instead add a 32-bit field that
is call bio_qos for quality of service hints and another 32-bit field for
bio_qos_param. This allows us to pass down specific quality of service
desires from the filesystem to the lower layers. The parameter will be
unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name for a value
to set it to (at the moment, just use 1). We'll assign the other QOS values
later for other things. It would allow us to implement the other sorts of
QoS things I talked about as well.

As for B_FAILFAST, it's quite unlike what you're proposing, except in one
incidental detail. It's a complicated state machine that the sd driver in
solaris implemented. It's an entire protocol. When the device gets errors,
it goes into this failfast state machine. The state machine makes a
determination that the errors are indicators the device is GONE, at least
for the moment, and it will fail I/Os in various ways from there. Any new
I/Os that are submitted will be failed (there's conditional behavior here:
depending on a global setting it's either all I/O or just B_FAILFAST I/O).
ZFS appears to set this bit for its discovery code only, when a device not
being there would significantly delay things. Anyway, when the device
returns (basically an I/O gets through or maybe some other event happens),
the driver exists this mode and returns to normal operation. It appears to
be designed not for the use case that you described, but rather for a drive
that's failing all over the place so that any pending I/Os get out of the
way quickly. Your use case is only superficially similar to that use case,
so the Solaris / Illumos experiences are mildly interesting, but due to the
differences not a strong argument for doing this. This facility in Illumos
is interesting, but would require significantly more retooling of the lower
I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just
Solaris) a daemon that looks at failures to manage them at a higher level,
which might make for a better user experience for FreeBSD, so that's
something that needs to be weighed as well.

We've known for some time that HDD retry algorithms take a long time. Same
is true of some SSD or NVMe algorithms, but not all. The other objection I
have to 'noretry' naming  is that it bakes the current observed HDD
behavior and recovery into the API. This is undesirable as other storage
technologies have retry mechanisms that happen quite quickly (and sometimes
in the drive itself). The cutoff between fast and slow recovery is device
specific, as are the methods used. For example, there's new proposals out
in NVMe (and maybe T10/T13 land) to have new types of READ commands that
specify the quality of service you expect, including providing some sort of
deadline hint to clip how much effort is expended in trying to recover the
data. It would be nice to design a mechanism that allows us to start using
these commands when drives are available with them, and possibly using
timeouts to allow for a faster abort. Most of your HDD I/O will complete
within maybe ~150ms, with a long tail out to maybe as long as ~400ms. It
might be desirable to set a policy that says 'don't let any I/Os remain in
the device longer than a second' and use this mechanism to enforce that. Or
don't let any I/Os last more than 20x the most recent median I/O time. A
single bit is insufficiently expressive to allow these sorts of things,
which is another reason for my objection to your proposal. With the QOS
fields being independent, the clone routines just copies them and makes no
judgement value on them.

So, those are my problems with your proposal, and also some hopefully
useful ways to move forward. I've chatted with others for years about
introducing QoS things into the I/O stack, so I know most of the above
won't be too contentious (though ETIMEDOUT I haven't socialized, so that
may be an area of concern for people).

Warner