From owner-freebsd-fs@freebsd.org  Sat Nov 25 17:57:40 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 206DDDEC22C
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sat, 25 Nov 2017 17:57:40 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x236.google.com (mail-it0-x236.google.com
 [IPv6:2607:f8b0:4001:c0b::236])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D5D0F681C6
 for <freebsd-fs@freebsd.org>; Sat, 25 Nov 2017 17:57:39 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x236.google.com with SMTP id b5so16490570itc.3
 for <freebsd-fs@freebsd.org>; Sat, 25 Nov 2017 09:57:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=sDDg+58CGPKx0+tTc1ED8m6HEe5BrIsIIQB/kA3ChwU=;
 b=qPFfmzjRNBbP3DmUvRWgC5jMjFTpdNySHNEs8H/xdoUucHPhTJiEXi40KmJeeH77Ey
 sY1c6XJ+8SS4Z0oqTJQ9XQn1y9K6v09yaqrPtqKmfWVtforMcMiq7TYqaT9DLUuNZA2A
 WiGcKFBDw7GXRV9Z3D6RsIJ4XMyMU7TZNOuYra1Q8c9dnWl5m3DqvlaeK1OeayjCg8jB
 wIXzdVHAxvnLU05sVIu+j+ECf9iI8/rotfDVUVQOtJmW77SwHCZBIYDUAwBkj7yosMCR
 56TfIrZhYVrxHbWJQFjg/YMXIHP/MXKrfEAcRxZ78HvVBH5c6Vqc1OvzZKJVarY8JVdX
 zAsw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=sDDg+58CGPKx0+tTc1ED8m6HEe5BrIsIIQB/kA3ChwU=;
 b=SWumRy3BbW9Nn2/eWGFSRujfdSqZ5cdFwU5QDGq8EX53mcIHBQyLhl9O+peJKJBvAe
 tR+r6hC150KdOjG+7i0SeFdudg2S5Tujn4gD6hRh/EK7Pg7FwU6+hRzHiU8lXT1iFBoa
 lAlCcvF01/jxNJVM2qcrLxhRv0y/aYdp1K49XqzRDBAGS2DKY+FrcugqTzTYk/v1CdOe
 N1Fg8v0mtfb5SjzOA4aEW8XfnMI8/AuzlOEHJbn7X0cIlMzXQkamZo0E5drn+OGExK9X
 mZMqbV+gfxnAWIca3C5Ul+E+AGtwlVsVpgtyzNgGFp92JS+cFXtIYLLpQUZKA057++4Y
 kEMw==
X-Gm-Message-State: AJaThX5TN2OPyB3SrEQDcQAqGlafDqf4XjWJeA3th4XIpIzWxzHtiE30
 U00Z4HISjbKY9v7xp/X0pNnd+fu396UNgmsb48G+/Q==
X-Google-Smtp-Source: AGs4zMYazlz2WrPFSx5PeIGxX4EhK1qMgTHVpcFmS/KlPnGtbEi0uWZuo3DIO5vwYgUNvIHh2OtB3xdssoaEXn1dta8=
X-Received: by 10.36.164.13 with SMTP id z13mr22202600ite.115.1511632659185;
 Sat, 25 Nov 2017 09:57:39 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 09:57:38 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd]
In-Reply-To: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
 <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
 <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 25 Nov 2017 10:57:38 -0700
X-Google-Sender-Auth: 6c28FYqH0hmmAfPKJFmQ73oXmro
Message-ID: <CANCZdfrZfuKZAMURu-biRMYYDD_=05ODbevsWEF9uZayvdnaQg@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: Scott Long <scottl@samsco.org>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 17:57:40 -0000

On Sat, Nov 25, 2017 at 10:36 AM, Andriy Gapon <avg@freebsd.org> wrote:

>
> Timestamp of the first error is Jun 16 10:40:18.
> Timestamp of the last error is Jun 16 10:40:27.
> So, it took additional 9 seconds to finally produce EIO.
> That disk is a part of a ZFS mirror.  If the request was failed after the
> first
> attempt, then ZFS would be able to get the data from a good disk much
> sooner.
>
> And don't take me wrong, I do NOT want CAM or GEOM to make that decision by
> itself.  I want ZFS to be able to tell the lower layers when they should
> try as
> hard as they normally do and when they should report an I/O error as soon
> as it
> happens without any retries.


Let's walk through this. You see that it takes a long time to fail an I/O.
Perfectly reasonable observation. There's two reasons for this. One is that
the disks take a while to make an attempt to get the data. The second is
that the system has a global policy that's biased towards 'recover the
data' over 'fail fast'. These can be fixed by reducing the timeouts, or
lowing the read-retry count for a given drive or globally as a policy
decision made by the system administrator.

It may be perfectly reasonable to ask the lower layers to 'fail fast' and
have either a hard or a soft deadline on the I/O for a subset of I/O. A
hard deadline would return ETIMEDOUT or something when it's passed and
cancel the I/O. This gives better determinism in the system, but some
systems can't cancel just 1 I/O (like SATA drives), so we have to flush the
whole queue. If we get a lot of these, performance suffers. However, for
some class of drives, you know that if it doesn't succeed in 1s after you
submit it to the drive, it's unlikely to complete successfully and it's
worth the performance hit on a drive that's already acting up.

You could have a soft timeout, which says 'don't do any additional action
after X time has elapsed and you get word about this I/O. This is similar
to the hard timeout, but just stops retrying after the deadline has passed.
This scenario is better on the other users of the drive, assuming that the
read-recovery operations aren't starving them. It's also easier to
implement, but has worse worst case performance characteristics.

You aren't asking to limit retries. You're really asking to the I/O
subsystem to limit, where it can, the amount of time on an I/O so you can
try another one. You're means to doing this is to tell it not to retry.
That's the wrong means. It shouldn't be listed in the API that it's a 'NO
RETRY' request. It should be a QoS request flag: fail fast.

Part of why I'm being so difficult is that you don't understand this and
are proposing a horrible API. It should have a different name. The other
reason is that I  absolutely do not want to overload EIO. You must return a
different error back up the stack. You've show no interest in this past,
which is also a needless argument. We've given good reasons, and you've
poopooed them with bad arguments.

Also, this isn't the data I asked for. I know things can fail slowly. I was
asking for how it would improve systems running like this. As in "I
implemented it, and was able to fail over to this other drive faster" or
something like that. Actual drive failure scenarios vary widely, and
optimizing for this one failure is unwise. It may be the right
optimization, but it may not. There's lots of tricky edges in this space.

Warner