Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 25 Nov 2017 03:54:01 -0700
From:      Scott Long <scottl@samsco.org>
To:        Andriy Gapon <avg@FreeBSD.org>
Cc:        Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org
Subject:   Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom
Message-ID:  <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
In-Reply-To: <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
References:  <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help


> On Nov 24, 2017, at 10:17 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>=20
>=20
>>> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first =
read, get
>>> the error back sooner and try the other disk sooner.  Only if I know =
that there
>>> are no other copies to try, then I would use the normal read with =
all the retrying.
>>>=20
>>=20
>> I agree with Warner that what you are proposing is not correct.  It =
weakens the
>> contract between the disk layer and the upper layers, making it less =
clear who is
>> responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D =
means.  That contract is already
>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and =
I
>> are working on a plan to fix that.
>=20
> Well...  I do realize now that there is some problem in this area, =
both you and
> Warner mentioned it.  But knowing that it exists is not the same as =
knowing what
> it is :-)
> I understand that it could be rather complex and not easy to describe =
in a short
> email=E2=80=A6
>=20

There are too many questions to ask, I will do my best to keep the =
conversation
logical.  First, how do you propose to distinguish between EIO due to a =
lengthy
set of timeouts, vs EIO due to an immediate error returned by the disk =
hardware?
CAM has an extensive table-driven error recovery protocol who=E2=80=99s =
purpose is to
decide whether or not to do retries based on hardware state information =
that is
not made available to the upper layers.  Do you have a test case that =
demonstrates
the problem that you=E2=80=99re trying to solve?  Maybe the error =
recovery table is wrong
and you=E2=80=99re encountering a case that should not be retried.  If =
that=E2=80=99s what=E2=80=99s going on,
we should fix CAM instead of inventing a new work-around.

Second, what about disk subsystems that do retries internally, out of =
the control
of the FreeBSD driver?  This would include most hardware RAID =
controllers.
Should what you are proposing only work for a subset of the kinds of =
storage
systems that are available and in common use?

Third, let=E2=80=99s say that you run out of alternate copies to try, =
and as you stated
originally, that will force you to retry the copies that had returned =
EIO.  How
will you know when you can retry?  How will you know how many times you
will retry?  How will you know that a retry is even possible?  Should =
the retries
be able to be canceled?

Why is overloading EIO so bad?  brelse() will call bdirty() when a =
BIO_WRITE
command has failed with EIO.  Calling bdirty() has the effect of =
retrying the I/O.
This disregards the fact that disk drivers only return EIO when =
they=E2=80=99ve decided
that the I/O cannot be retried.  It has no termination condition for the =
retries, and
will endlessly retry I/O in vain; I=E2=80=99ve seen this quite =
frequently.  It also disregards
the fact that I/O marked as B_PAGING can=E2=80=99t be retried in this =
fashion, and will
trigger a panic.  Because we pretend that EIO can be retried, we are =
left with
a system that is very fragile when I/O actually does fail.  Instead of =
adding
more special cases and blurred lines, I want to go back to enforcing =
strict
contracts between the layers and force the core parts of the system to =
respect
those contracts and handle errors properly, instead of just retrying and
hoping for the best.


> But then, this flag is optional, it's off by default and no one is =
forced to
> used it.  If it's used only by ZFS, then it would not be horrible.
> Unless it makes things very hard for the infrastructure.
> But I am circling back to not knowing what problem(s) you and Warner =
are
> planning to fix.
>=20

Saying that a feature is optional means nothing; while consumers of the =
API
might be able to ignore it, the producers of the API cannot ignore it.  =
It is
these producers who are sick right now and should be fixed, instead of
creating new ways to get even more sick.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?DC23D104-F5F3-4844-8638-4644DC9DD411>