Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 Nov 2017 08:29:11 -0700
From:      Scott Long <scottl@samsco.org>
To:        Andriy Gapon <avg@FreeBSD.org>
Cc:        Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org
Subject:   Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom
Message-ID:  <783CA790-C0D3-442D-A5A2-4CB424FCBD62@samsco.org>
In-Reply-To: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>
References:  <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org> <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org> <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help


> On Nov 25, 2017, at 10:36 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>=20
> On 25/11/2017 12:54, Scott Long wrote:
>>=20
>>=20
>>> On Nov 24, 2017, at 10:17 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>>>=20
>>>=20
>>>>> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the =
first read, get
>>>>> the error back sooner and try the other disk sooner.  Only if I =
know that there
>>>>> are no other copies to try, then I would use the normal read with =
all the retrying.
>>>>>=20
>>>>=20
>>>> I agree with Warner that what you are proposing is not correct.  It =
weakens the
>>>> contract between the disk layer and the upper layers, making it =
less clear who is
>>>> responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D =
means.  That contract is already
>>>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner =
and I
>>>> are working on a plan to fix that.
>>>=20
>>> Well...  I do realize now that there is some problem in this area, =
both you and
>>> Warner mentioned it.  But knowing that it exists is not the same as =
knowing what
>>> it is :-)
>>> I understand that it could be rather complex and not easy to =
describe in a short
>>> email=E2=80=A6
>>>=20
>>=20
>> There are too many questions to ask, I will do my best to keep the =
conversation
>> logical.  First, how do you propose to distinguish between EIO due to =
a lengthy
>> set of timeouts, vs EIO due to an immediate error returned by the =
disk hardware?
>=20
> At what layer / component?
> If I am the issuer of the request then I know how I issued that =
request and what
> kind of request it was.  If I am an intermediate layer, then what do I =
care.
>=20

I don=E2=80=99t understand your response, I think we are talking about 2 =
different things.

>> CAM has an extensive table-driven error recovery protocol who=E2=80=99s=
 purpose is to
>> decide whether or not to do retries based on hardware state =
information that is
>> not made available to the upper layers.  Do you have a test case that =
demonstrates
>> the problem that you=E2=80=99re trying to solve?  Maybe the error =
recovery table is wrong
>> and you=E2=80=99re encountering a case that should not be retried.  =
If that=E2=80=99s what=E2=80=99s going on,
>> we should fix CAM instead of inventing a new work-around.
>=20
> Let's assume that I am talking about the case of not being able to =
read an HDD
> sector that is gone bad.
> Here is a real world example:
>=20
> Jun 16 10:40:18 trant kernel: ahcich0: NCQ error, slot =3D 20, port =3D =
-1
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. =
ACB: 60
> 00 00 58 62 40 2c 00 00 08 00 00
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA =
Status Error
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 =
(DRDY ERR),
> error: 40 (UNC )
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 =
62 40 2c 00
> 00 00 00
> Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
> Jun 16 10:40:20 trant kernel: ahcich0: NCQ error, slot =3D 22, port =3D =
-1
...
> I do not see anything wrong in what CAM / ahci /ata_da did here.
> They did what I would expect them to do.  They tried very hard to get =
data that
> I told them I need.

Two things I see here.  The first is that the drive is trying for 2 =
seconds to get good
data off of the media.  The second is that it=E2=80=99s failing and =
reporting the error as
uncorrectable.  I think that retries at the OS/driver
layer are therefore useless; the drive is already doing a bunch of its =
own retries and
is failing, and is telling you that it=E2=80=99s failing.  In the past, =
the conventional wisdom has
been to do retries, because 30 years ago drives had minimal firmware and =
didn=E2=80=99t do
a good job at managing data integrity.  Now they do an extensive amount =
of
analysis-driven error correction and retries, so I think it=E2=80=99s =
time to change the=20
conventional wisdom.  I=E2=80=99d propose that for direct-attach SSDs =
and HDDs we treat this
error as non-retriable.  Normally this would be a one-line change, but I =
think it needs
to be nuanced to distinguish between optical drives, simple flash media =
drives, and
regular HDDs and SSDs.

An interim solution would be to just set the kern.cam.ada.retry_count to =
0.

>=20
> Timestamp of the first error is Jun 16 10:40:18.
> Timestamp of the last error is Jun 16 10:40:27.
> So, it took additional 9 seconds to finally produce EIO.
> That disk is a part of a ZFS mirror.  If the request was failed after =
the first
> attempt, then ZFS would be able to get the data from a good disk much =
sooner.

It=E2=80=99s the fault of ZFS for not reading all members in parallel.  =
I ask people to
consider and possibly accept that ZFS is not perfect.

>=20
> And don't take me wrong, I do NOT want CAM or GEOM to make that =
decision by
> itself.  I want ZFS to be able to tell the lower layers when they =
should try as
> hard as they normally do and when they should report an I/O error as =
soon as it
> happens without any retries.
>=20

Again, I=E2=80=99m not sure I understand your answer.  It is the job of =
the disk drivers to
reliably complete storage requests, not to just try once and then assume =
that an
upper layer will figure things out.  As I said in my previous email, the =
upper layers,
and in this case I mean everything above the drivers and CAM, do not =
have the
error information to know why something failed, and thus have only =
limited
knowledge of what to do to remediate a failure.  The error recovery =
knowledge is
in CAM and the drivers.

It sounds like you=E2=80=99re proposing to make these simple, dumb =
layers and move the
error recovery up to ZFS. I understand what you=E2=80=99re saying about =
wanting this
being opt-in and not affecting
other filesystems, but I think that you are missing the point of what =
CAM does and
provides.  By definition the =E2=80=98da=E2=80=99 and =E2=80=98ada=E2=80=99=
 drivers provide reliable I/O transactions to
the drive.  That means that they handle error recovery, retries, etc.  =
If you do not
want reliable I/O transactions, then you should not use the da/ada =
drivers, and
instead should either write your own periph driver (let=E2=80=99s call =
it =E2=80=9Czfsda=E2=80=9D), or you
should have ZFS communicate with the disks via the pass driver.  That is =
what
userland tools like cdparanoia do when they want to bypass the =
traditional
operational characteristics and constraints of the standard periphs.

Think of it this way, would you propose a flag to a TCP socket that =
tells TCP to
not do retransmits?  No!  If you wanted that semantic then you=E2=80=99d =
use UDP
and invent your own retry/reliability layer.  I think it=E2=80=99s a big =
job to re-invent error
recovery into ZFS, while it would be relatively easy to
adapt the CAM retry policy to the behavior of modern disks, i.e. believe =
them
when they say that there=E2=80=99s an uncorrectable read error and =
don=E2=80=99t blindly attempt
a retry.

>> Second, what about disk subsystems that do retries internally, out of =
the control
>> of the FreeBSD driver?  This would include most hardware RAID =
controllers.
>> Should what you are proposing only work for a subset of the kinds of =
storage
>> systems that are available and in common use?
>=20
> Yes.  I do not worry about things that are beyond my control.
> Those subsystems would behave as they do now.  So, nothing would get =
worse.

It gets worse because it builds inconsistency into the system and makes =
expectations
for behavior unclear.  That turns into more hacky and incorrect code =
over time.

>=20
>> Third, let=E2=80=99s say that you run out of alternate copies to try, =
and as you stated
>> originally, that will force you to retry the copies that had returned =
EIO.  How
>> will you know when you can retry?  How will you know how many times =
you
>> will retry?  How will you know that a retry is even possible?
>=20
> I am continuing to use ZFS as an example.  It already has all the =
logic built
> in.  If all vdev zio-s (requests to mirror members as an example) =
fail, then
> their parent zio (a logical read from the mirror) will be retried (by =
ZFS) and
> when ZFS retries it sets a special flag (ZIO_FLAG_IO_RETRY) on that =
zio and on
> its future child zio-s.
>=20
> Essentially, my answer is you have to program it correctly, there is =
no magic.

Again, I think that you=E2=80=99re missing the point of what I=E2=80=99m =
asking.  All that ZFS
can see is =E2=80=9CEIO=E2=80=9D.  Maybe the EIO was generated because =
of an uncorrectable
media error, and all attempts to retry will be futile.  Maybe it was =
generated
because of a correctable error, and a retry might be useful.  There are =
many,
many, many error cases that are tested for an handled in CAM.  You are =
collapsing
them down to 1 case, and I don=E2=80=99t agree with that.

>=20
>> Should the retries
>> be able to be canceled?
>=20
> I think that this is an orthogonal question.
> I do not have any answer and I am not ready to discuss this at all.
>=20

Please be ready to discuss this.

>> Why is overloading EIO so bad?  brelse() will call bdirty() when a =
BIO_WRITE
>> command has failed with EIO.  Calling bdirty() has the effect of =
retrying the I/O.
>> This disregards the fact that disk drivers only return EIO when =
they=E2=80=99ve decided
>> that the I/O cannot be retried.  It has no termination condition for =
the retries, and
>> will endlessly retry I/O in vain; I=E2=80=99ve seen this quite =
frequently.  It also disregards
>> the fact that I/O marked as B_PAGING can=E2=80=99t be retried in this =
fashion, and will
>> trigger a panic.  Because we pretend that EIO can be retried, we are =
left with
>> a system that is very fragile when I/O actually does fail.  Instead =
of adding
>> more special cases and blurred lines, I want to go back to enforcing =
strict
>> contracts between the layers and force the core parts of the system =
to respect
>> those contracts and handle errors properly, instead of just retrying =
and
>> hoping for the best.
>=20
> So, I suggest that the buffer layer (all the b* functions) does not =
use the
> proposed flag.  Any problems that exist in it should be resolved =
first.
> ZFS does not use that layer.
>=20

I think it=E2=80=99s foolish to propose a change in the disk API and =
then say =E2=80=9Conly
ZFS should use this=E2=80=9D

>>> But then, this flag is optional, it's off by default and no one is =
forced to
>>> used it.  If it's used only by ZFS, then it would not be horrible.
>>> Unless it makes things very hard for the infrastructure.
>>> But I am circling back to not knowing what problem(s) you and Warner =
are
>>> planning to fix.
>>>=20
>>=20
>> Saying that a feature is optional means nothing; while consumers of =
the API
>> might be able to ignore it, the producers of the API cannot ignore =
it.  It is
>> these producers who are sick right now and should be fixed, instead =
of
>> creating new ways to get even more sick.
>=20
> I completely agree.
> But which producers of the API do you mean specifically?
> So far, you mentioned only the consumer level problems with the =
b-layer.
>=20
> Having said all of the above, I must admit one thing.
> When I proposed BIO_NORETRY I had only the simplest GEOM topology in =
mind:
> ZFS -> [partition] -> disk.
> Now that I start to think about more complex topologies I am not =
really sure how
> the flag should be handled by geom-s with complex internal behavior.  =
If that
> can be implemented reasonably and clearly, if the flag will create a =
big mess.
> E.g., things like gmirrors on top of gmirrors and so on.
> Maybe the flag, if it ever accepted, should never be propagated =
automatically.
> Only geom-s that are aware of it should propagate or request it.  That =
should be
> safer.
>=20

Yes, it gets chaotic and unreliable once you have multiple layers each =
thinking
that they know better than the previous layer.  That is the situation =
that FreeBSD
is currently in with CAM, GEOM, and VFS-BIO.  I do not propose to make =
it even
more chaotic.  Through just basic rules of abstraction, basic computer =
science
principles, it follows that the layers of software closest to the =
operation (in this
case, the hardware) have the least abstraction and the most amount of
information on how to act.  Saying that ZFS knows best how the hardware =
works,
when it has no access to the actual control channel information of the =
hardware,
is confusing and illogical to me.

I think that there=E2=80=99s a legitimate problem with CAM error =
recovery trying too hard
to do retries when the disk is clearly telling it that it=E2=80=99s =
futile.  I don=E2=80=99t have a quick
patch for that because I would like to think how to make it dependent on =
the
media type.  However, it=E2=80=99s something that I=E2=80=99m willing to =
discuss further and work
on.

Scott




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?783CA790-C0D3-442D-A5A2-4CB424FCBD62>