From owner-freebsd-fs@freebsd.org  Sat Nov 25 10:54:06 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B2DE1DDE418;
 Sat, 25 Nov 2017 10:54:06 +0000 (UTC)
 (envelope-from scottl@samsco.org)
Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com
 [66.111.4.28])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 74B507D2CE;
 Sat, 25 Nov 2017 10:54:06 +0000 (UTC)
 (envelope-from scottl@samsco.org)
Received: from compute6.internal (compute6.nyi.internal [10.202.2.46])
 by mailout.nyi.internal (Postfix) with ESMTP id 3DBAD20C14;
 Sat, 25 Nov 2017 05:54:04 -0500 (EST)
Received: from frontend1 ([10.202.2.160])
 by compute6.internal (MEProxy); Sat, 25 Nov 2017 05:54:04 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsco.org; h=cc
 :content-transfer-encoding:content-type:date:from:in-reply-to
 :message-id:mime-version:references:subject:to:x-me-sender
 :x-me-sender:x-sasl-enc; s=fm1; bh=RaB/AKEW6ssBr1WgIVTDCxOVmtpaK
 56Amk4OLGmTdd0=; b=hktOnAyGXDhIs2jROcS4re2WNoJh4EeJcnIG3biKqwep0
 akgP6d18WTEK6J1ICq9cpd2J+xSeYABCXdEmdICogmastpBYdhIKhtfzvndsl79D
 i5EmDeyE93bqaV04cBchYRjRnZETgmhl93xVqXNR+pHLdmkJFNnCCqcDR10JHNKd
 sW4nmpmR4WvxZgVG7LkbmFiaQRBshJ/11h3azhVaDWOz4j+npO8EkjMMqVwGMFoz
 O0r8uzyyxCywUzYnmpMcFubBfZZqzMvrMltmhIS722Ke6+buQ3DeHJVqz5eiyu8N
 p4rAYrqIiGUwhoJzNuD35XfIldWe74sR5ZzBnOFvA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:content-transfer-encoding:content-type
 :date:from:in-reply-to:message-id:mime-version:references
 :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=RaB/AK
 EW6ssBr1WgIVTDCxOVmtpaK56Amk4OLGmTdd0=; b=m6bVt2S+eF+rrbG6orqHnO
 zLcu2MALeeuZEucMSts3+TTTGB/11L6qVJz04n5Hzhy36weGuekbGzE7peUo/W5v
 wkAa4wi5FhjOTO0BIJGwKS5raiEIdaPRTsl5aSmvzrLrio76NFtRRClWs+1+yesY
 +QW8EfoEgQ3Sh+3je5TIy/j5sC4GHZj0e4DBQ5UT3LttTzXU4ZU1NcRc499IsvFq
 c/ulQSKhAl+o+AedGUD43cBWkuaZl0s57+wqEvEbfQ6uzieN7QeVYEXoYpHlpWBq
 zAXov7tTEaWXhk9icxs1hduwtOYNUXsHBluVO5bAZB6ccqa9uCjhU4Pb3LVUXdKQ
 ==
X-ME-Sender: <xms:zEsZWnOC8o27rV4eEGwwFeNoaPf4SDSBtcs3uXnuLhuvHpu_ntmQaA>
Received: from [192.168.0.106] (unknown [161.97.249.191])
 by mail.messagingengine.com (Postfix) with ESMTPA id BDD0E7E6EE;
 Sat, 25 Nov 2017 05:54:03 -0500 (EST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
From: Scott Long <scottl@samsco.org>
In-Reply-To: <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
Date: Sat, 25 Nov 2017 03:54:01 -0700
Cc: Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
To: Andriy Gapon <avg@FreeBSD.org>
X-Mailer: Apple Mail (2.3445.4.7)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 10:54:06 -0000


> On Nov 24, 2017, at 10:17 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>=20
>=20
>>> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first =
read, get
>>> the error back sooner and try the other disk sooner.  Only if I know =
that there
>>> are no other copies to try, then I would use the normal read with =
all the retrying.
>>>=20
>>=20
>> I agree with Warner that what you are proposing is not correct.  It =
weakens the
>> contract between the disk layer and the upper layers, making it less =
clear who is
>> responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D =
means.  That contract is already
>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and =
I
>> are working on a plan to fix that.
>=20
> Well...  I do realize now that there is some problem in this area, =
both you and
> Warner mentioned it.  But knowing that it exists is not the same as =
knowing what
> it is :-)
> I understand that it could be rather complex and not easy to describe =
in a short
> email=E2=80=A6
>=20

There are too many questions to ask, I will do my best to keep the =
conversation
logical.  First, how do you propose to distinguish between EIO due to a =
lengthy
set of timeouts, vs EIO due to an immediate error returned by the disk =
hardware?
CAM has an extensive table-driven error recovery protocol who=E2=80=99s =
purpose is to
decide whether or not to do retries based on hardware state information =
that is
not made available to the upper layers.  Do you have a test case that =
demonstrates
the problem that you=E2=80=99re trying to solve?  Maybe the error =
recovery table is wrong
and you=E2=80=99re encountering a case that should not be retried.  If =
that=E2=80=99s what=E2=80=99s going on,
we should fix CAM instead of inventing a new work-around.

Second, what about disk subsystems that do retries internally, out of =
the control
of the FreeBSD driver?  This would include most hardware RAID =
controllers.
Should what you are proposing only work for a subset of the kinds of =
storage
systems that are available and in common use?

Third, let=E2=80=99s say that you run out of alternate copies to try, =
and as you stated
originally, that will force you to retry the copies that had returned =
EIO.  How
will you know when you can retry?  How will you know how many times you
will retry?  How will you know that a retry is even possible?  Should =
the retries
be able to be canceled?

Why is overloading EIO so bad?  brelse() will call bdirty() when a =
BIO_WRITE
command has failed with EIO.  Calling bdirty() has the effect of =
retrying the I/O.
This disregards the fact that disk drivers only return EIO when =
they=E2=80=99ve decided
that the I/O cannot be retried.  It has no termination condition for the =
retries, and
will endlessly retry I/O in vain; I=E2=80=99ve seen this quite =
frequently.  It also disregards
the fact that I/O marked as B_PAGING can=E2=80=99t be retried in this =
fashion, and will
trigger a panic.  Because we pretend that EIO can be retried, we are =
left with
a system that is very fragile when I/O actually does fail.  Instead of =
adding
more special cases and blurred lines, I want to go back to enforcing =
strict
contracts between the layers and force the core parts of the system to =
respect
those contracts and handle errors properly, instead of just retrying and
hoping for the best.


> But then, this flag is optional, it's off by default and no one is =
forced to
> used it.  If it's used only by ZFS, then it would not be horrible.
> Unless it makes things very hard for the infrastructure.
> But I am circling back to not knowing what problem(s) you and Warner =
are
> planning to fix.
>=20

Saying that a feature is optional means nothing; while consumers of the =
API
might be able to ignore it, the producers of the API cannot ignore it.  =
It is
these producers who are sick right now and should be fixed, instead of
creating new ways to get even more sick.

Scott