From owner-freebsd-fs@freebsd.org  Sat Nov 25 17:36:31 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9894ADEBB29;
 Sat, 25 Nov 2017 17:36:31 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f45.google.com (mail-lf0-f45.google.com
 [209.85.215.45])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 173D96767F;
 Sat, 25 Nov 2017 17:36:30 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f45.google.com with SMTP id k66so28461669lfg.3;
 Sat, 25 Nov 2017 09:36:30 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=iOl1D78pVkm1p2WO2iSswIdicfhFfzoRB2Bc9x8J0XA=;
 b=FwAhgIqykPRVE9Q778D+OPfmyn95E5zd8SIsartEGdWg1Ssj9VpFs/Gvcwq5N5rI9H
 8jCfpIwC2dfgsmCoT7YdNA79XbZLsWf5M6TxrkJ5wTl7EiCNWYVFQHmxM/tN758mxqGU
 CXD6NL6lpvPFgTIjKy8/3ruVgDirqO4wOwg+LFRlaXftY4mRdtYkpTaTZGIuE3sk1fRS
 tMSOrAqrkxaNCqFBIDnxdKwu19Mqw2DTfK9+fcU3dNtKwo/tyRcsZVseXfphZ/t0lpFo
 uTO7GWHARKCOSkjmUCmBoJwIxNjPyzyNYRU+oq2OoJexdlVuqRKjQrMZexOnAD8HI3ZU
 Aplg==
X-Gm-Message-State: AJaThX4VvLGCfaJtIBGAuRmop0Jb1Xk7CKTLNuZkpUJ67LBcdxotKHfj
 520NpS6rN+tiPce5ICUf/OhRAutEV/c=
X-Google-Smtp-Source: AGs4zMbBpiuwlbgmOFur1itv4JJ+4S3jA/+BP4yojPXM0feB1c+MAnucw6G9lWk0Z6UcVuGQhWCJQQ==
X-Received: by 10.46.84.86 with SMTP id y22mr4011842ljd.89.1511631382512;
 Sat, 25 Nov 2017 09:36:22 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id f66sm1845239lfl.72.2017.11.25.09.36.20
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 25 Nov 2017 09:36:20 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Scott Long <scottl@samsco.org>
Cc: Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
 <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>
Date: Sat, 25 Nov 2017 19:36:19 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 17:36:31 -0000

On 25/11/2017 12:54, Scott Long wrote:
> 
> 
>> On Nov 24, 2017, at 10:17 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>>
>>
>>>> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
>>>> the error back sooner and try the other disk sooner.  Only if I know that there
>>>> are no other copies to try, then I would use the normal read with all the retrying.
>>>>
>>>
>>> I agree with Warner that what you are proposing is not correct.  It weakens the
>>> contract between the disk layer and the upper layers, making it less clear who is
>>> responsible for retries and less clear what “EIO” means.  That contract is already
>>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
>>> are working on a plan to fix that.
>>
>> Well...  I do realize now that there is some problem in this area, both you and
>> Warner mentioned it.  But knowing that it exists is not the same as knowing what
>> it is :-)
>> I understand that it could be rather complex and not easy to describe in a short
>> email…
>>
> 
> There are too many questions to ask, I will do my best to keep the conversation
> logical.  First, how do you propose to distinguish between EIO due to a lengthy
> set of timeouts, vs EIO due to an immediate error returned by the disk hardware?

At what layer / component?
If I am the issuer of the request then I know how I issued that request and what
kind of request it was.  If I am an intermediate layer, then what do I care.

> CAM has an extensive table-driven error recovery protocol who’s purpose is to
> decide whether or not to do retries based on hardware state information that is
> not made available to the upper layers.  Do you have a test case that demonstrates
> the problem that you’re trying to solve?  Maybe the error recovery table is wrong
> and you’re encountering a case that should not be retried.  If that’s what’s going on,
> we should fix CAM instead of inventing a new work-around.

Let's assume that I am talking about the case of not being able to read an HDD
sector that is gone bad.
Here is a real world example:

Jun 16 10:40:18 trant kernel: ahcich0: NCQ error, slot = 20, port = -1
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun 16 10:40:20 trant kernel: ahcich0: NCQ error, slot = 22, port = -1
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun 16 10:40:22 trant kernel: ahcich0: NCQ error, slot = 24, port = -1
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun 16 10:40:25 trant kernel: ahcich0: NCQ error, slot = 26, port = -1
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun 16 10:40:27 trant kernel: ahcich0: NCQ error, slot = 28, port = -1
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): Error 5, Retries exhausted

I do not see anything wrong in what CAM / ahci /ata_da did here.
They did what I would expect them to do.  They tried very hard to get data that
I told them I need.

Timestamp of the first error is Jun 16 10:40:18.
Timestamp of the last error is Jun 16 10:40:27.
So, it took additional 9 seconds to finally produce EIO.
That disk is a part of a ZFS mirror.  If the request was failed after the first
attempt, then ZFS would be able to get the data from a good disk much sooner.

And don't take me wrong, I do NOT want CAM or GEOM to make that decision by
itself.  I want ZFS to be able to tell the lower layers when they should try as
hard as they normally do and when they should report an I/O error as soon as it
happens without any retries.

> Second, what about disk subsystems that do retries internally, out of the control
> of the FreeBSD driver?  This would include most hardware RAID controllers.
> Should what you are proposing only work for a subset of the kinds of storage
> systems that are available and in common use?

Yes.  I do not worry about things that are beyond my control.
Those subsystems would behave as they do now.  So, nothing would get worse.

> Third, let’s say that you run out of alternate copies to try, and as you stated
> originally, that will force you to retry the copies that had returned EIO.  How
> will you know when you can retry?  How will you know how many times you
> will retry?  How will you know that a retry is even possible?

I am continuing to use ZFS as an example.  It already has all the logic built
in.  If all vdev zio-s (requests to mirror members as an example) fail, then
their parent zio (a logical read from the mirror) will be retried (by ZFS) and
when ZFS retries it sets a special flag (ZIO_FLAG_IO_RETRY) on that zio and on
its future child zio-s.

Essentially, my answer is you have to program it correctly, there is no magic.

> Should the retries
> be able to be canceled?

I think that this is an orthogonal question.
I do not have any answer and I am not ready to discuss this at all.

> Why is overloading EIO so bad?  brelse() will call bdirty() when a BIO_WRITE
> command has failed with EIO.  Calling bdirty() has the effect of retrying the I/O.
> This disregards the fact that disk drivers only return EIO when they’ve decided
> that the I/O cannot be retried.  It has no termination condition for the retries, and
> will endlessly retry I/O in vain; I’ve seen this quite frequently.  It also disregards
> the fact that I/O marked as B_PAGING can’t be retried in this fashion, and will
> trigger a panic.  Because we pretend that EIO can be retried, we are left with
> a system that is very fragile when I/O actually does fail.  Instead of adding
> more special cases and blurred lines, I want to go back to enforcing strict
> contracts between the layers and force the core parts of the system to respect
> those contracts and handle errors properly, instead of just retrying and
> hoping for the best.

So, I suggest that the buffer layer (all the b* functions) does not use the
proposed flag.  Any problems that exist in it should be resolved first.
ZFS does not use that layer.

>> But then, this flag is optional, it's off by default and no one is forced to
>> used it.  If it's used only by ZFS, then it would not be horrible.
>> Unless it makes things very hard for the infrastructure.
>> But I am circling back to not knowing what problem(s) you and Warner are
>> planning to fix.
>>
> 
> Saying that a feature is optional means nothing; while consumers of the API
> might be able to ignore it, the producers of the API cannot ignore it.  It is
> these producers who are sick right now and should be fixed, instead of
> creating new ways to get even more sick.

I completely agree.
But which producers of the API do you mean specifically?
So far, you mentioned only the consumer level problems with the b-layer.

Having said all of the above, I must admit one thing.
When I proposed BIO_NORETRY I had only the simplest GEOM topology in mind:
ZFS -> [partition] -> disk.
Now that I start to think about more complex topologies I am not really sure how
the flag should be handled by geom-s with complex internal behavior.  If that
can be implemented reasonably and clearly, if the flag will create a big mess.
E.g., things like gmirrors on top of gmirrors and so on.
Maybe the flag, if it ever accepted, should never be propagated automatically.
Only geom-s that are aware of it should propagate or request it.  That should be
safer.

-- 
Andriy Gapon