From owner-freebsd-geom@freebsd.org  Tue Dec 12 16:36:59 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2DD17E9E63F;
 Tue, 12 Dec 2017 16:36:59 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f43.google.com (mail-lf0-f43.google.com
 [209.85.215.43])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D333C64E29;
 Tue, 12 Dec 2017 16:36:58 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f43.google.com with SMTP id x204so23863680lfa.11;
 Tue, 12 Dec 2017 08:36:58 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=YRt6rkLRn6TLNgYxe3Sd2HrmEMbOa3nQmThHZmE1NhA=;
 b=JGP4BBO1/SthiEYwMJNjP/hyOpkGrshgsrW4YMWTVNpQoxAZ6jKsOngzcMsijuElbf
 YJRrU8IVU/ogC6c80oeWUt38EURtXEVUy5n2dfZQH5Osvcw4IyAu9KaBbBAv1zCw6YaT
 khUE0KG1EZKH/Vo4x3teR4IBYEmqAWkPHJeNqebo9M5s5xRPXwUYgbxYApszo62Ut9WB
 wIXQ5DB+UkevMRNGldXS/tlIOQG5ZqMxLvTvk3Em+w3xYexAgyHB+yLRdQi6yAT7doYV
 cuWmNMwtvF09KVjLfQ9mD/az3qwu3vh6lIrIkIYv048qwnIUTr8hhwxncke2mkmmXXZx
 jOEA==
X-Gm-Message-State: AKGB3mIKfHuYt44OGt8vM0x0rn+4XAL8q44tSaj0QyLqAYv/CeItlLpp
 wWgsVTkepLyOSfldHiRH2Dc=
X-Google-Smtp-Source: ACJfBotCEvH1N0NRbhohYdAxW4Kz1Jn40hOAP1fDk/naRm/qt9d5CTPR+UaCS6nsTStO/pVtcIaSVg==
X-Received: by 10.46.83.9 with SMTP id h9mr2445871ljb.68.1513096611383;
 Tue, 12 Dec 2017 08:36:51 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id h11sm3226856lfd.54.2017.12.12.08.36.50
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 12 Dec 2017 08:36:50 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Warner Losh <imp@bsdimp.com>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org,
 Scott Long <scottl@samsco.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
 <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
 <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
 <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org>
 <CANCZdfpLvMYBzvtU_ez_mOrkxk9LHf0sOvq4eHdDxgHgjf527A@mail.gmail.com>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <38122ea9-ab20-f18a-90a2-04d2681e2ac9@FreeBSD.org>
Date: Tue, 12 Dec 2017 18:36:49 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <CANCZdfpLvMYBzvtU_ez_mOrkxk9LHf0sOvq4eHdDxgHgjf527A@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Dec 2017 16:36:59 -0000

On 26/11/2017 00:17, Warner Losh wrote:
> 
> 
> On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>> wrote:
> 
> 
>     Before anything else, I would like to say that I got an impression that we speak
>     from so different angles that we either don't understand each other's words or,
>     even worse, misinterpret them.
> 
> 
> I understand what you are suggesting. Don't take my disagreement with your
> proposal as willful misinterpretation. You are proposing something that's a
> quick hack.

Very true.

> Maybe a useful one, but it's still problematical because it has the
> upper layers telling the lower layers what to do (don't do your retry), rather
> than what service to provide (I prefer a fast error exit to over every effort to
> recover the data).

Also true.

> And it also does it by overloading the meaning of EIO, which
> has real problems which you've not been open to listening, I assume due to your
> narrow use case apparently blinding you to the bigger picture issues with that
> route.

Quite likely.

> However, there's a way forward which I think that will solve these objections.
> First, designate that I/O that fails due to short-circuiting the normal recovery
> process, return ETIMEDOUT. The I/O stack currently doesn't use this at all (it
> was introduced for the network side of things). This is a general catch-all for
> an I/O that we complete before the lower layers have given it the maximum amount
> of effort to recover the data, at the user request. Next, don't use a flag.
> Instead add a 32-bit field that is call bio_qos for quality of service hints and
> another 32-bit field for bio_qos_param. This allows us to pass down specific
> quality of service desires from the filesystem to the lower layers. The
> parameter will be unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name
> for a value to set it to (at the moment, just use 1). We'll assign the other QOS
> values later for other things. It would allow us to implement the other sorts of
> QoS things I talked about as well.

That's a very interesting and workable suggestion.
I will try to work on it.

> As for B_FAILFAST, it's quite unlike what you're proposing, except in one
> incidental detail. It's a complicated state machine that the sd driver in
> solaris implemented. It's an entire protocol. When the device gets errors, it
> goes into this failfast state machine. The state machine makes a determination
> that the errors are indicators the device is GONE, at least for the moment, and
> it will fail I/Os in various ways from there. Any new I/Os that are submitted
> will be failed (there's conditional behavior here: depending on a global setting
> it's either all I/O or just B_FAILFAST I/O).

Yeah, I realized that B_FAILFAST was quite different from the first impression
that I got from its name.
Thank you for doing and sharing your analysis of how it actually works.

> ZFS appears to set this bit for its
> discovery code only, when a device not being there would significantly delay
> things.

I think that ZFS sets the bit for all 'first-attempt' I/O.
It's the various retries / recovery where this bit is not set.

> Anyway, when the device returns (basically an I/O gets through or maybe
> some other event happens), the driver exists this mode and returns to normal
> operation. It appears to be designed not for the use case that you described,
> but rather for a drive that's failing all over the place so that any pending
> I/Os get out of the way quickly. Your use case is only superficially similar to
> that use case, so the Solaris / Illumos experiences are mildly interesting, but
> due to the differences not a strong argument for doing this. This facility in
> Illumos is interesting, but would require significantly more retooling of the
> lower I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just
> Solaris) a daemon that looks at failures to manage them at a higher level, which
> might make for a better user experience for FreeBSD, so that's something that
> needs to be weighed as well.

Okay.

> We've known for some time that HDD retry algorithms take a long time. Same is
> true of some SSD or NVMe algorithms, but not all. The other objection I have to
> 'noretry' naming  is that it bakes the current observed HDD behavior and
> recovery into the API. This is undesirable as other storage technologies have
> retry mechanisms that happen quite quickly (and sometimes in the drive itself).
> The cutoff between fast and slow recovery is device specific, as are the methods
> used. For example, there's new proposals out in NVMe (and maybe T10/T13 land) to
> have new types of READ commands that specify the quality of service you expect,
> including providing some sort of deadline hint to clip how much effort is
> expended in trying to recover the data. It would be nice to design a mechanism
> that allows us to start using these commands when drives are available with
> them, and possibly using timeouts to allow for a faster abort. Most of your HDD
> I/O will complete within maybe ~150ms, with a long tail out to maybe as long as
> ~400ms. It might be desirable to set a policy that says 'don't let any I/Os
> remain in the device longer than a second' and use this mechanism to enforce
> that. Or don't let any I/Os last more than 20x the most recent median I/O time.
> A single bit is insufficiently expressive to allow these sorts of things, which
> is another reason for my objection to your proposal. With the QOS fields being
> independent, the clone routines just copies them and makes no judgement value on
> them.

I now agree with this.
Thank you for the detailed explanation.

> So, those are my problems with your proposal, and also some hopefully useful
> ways to move forward. I've chatted with others for years about introducing QoS
> things into the I/O stack, so I know most of the above won't be too contentious
> (though ETIMEDOUT I haven't socialized, so that may be an area of concern for
> people).

Thank you!


-- 
Andriy Gapon