From owner-freebsd-geom@freebsd.org  Fri Nov 24 09:20:07 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id CC1BCDC1D23
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Fri, 24 Nov 2017 09:20:07 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id BA9B26C513
 for <freebsd-geom@FreeBSD.org>; Fri, 24 Nov 2017 09:20:07 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id vAO9K76i088285
 for <freebsd-geom@FreeBSD.org>; Fri, 24 Nov 2017 09:20:07 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-geom@FreeBSD.org
Subject: [Bug 223838] g_bio_clone vs g_bio_duplicate
Date: Fri, 24 Nov 2017 09:20:07 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: CURRENT
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Only Me
X-Bugzilla-Who: avg@FreeBSD.org
X-Bugzilla-Status: New
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: freebsd-geom@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc assigned_to
Message-ID: <bug-223838-14739-l8UycLTtLj@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-223838-14739@https.bugs.freebsd.org/bugzilla/>
References: <bug-223838-14739@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 09:20:07 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D223838

Andriy Gapon <avg@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pjd@FreeBSD.org
           Assignee|freebsd-bugs@FreeBSD.org    |freebsd-geom@FreeBSD.org

--=20
You are receiving this mail because:
You are the assignee for the bug.=

From owner-freebsd-geom@freebsd.org  Fri Nov 24 11:01:26 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 71A6FDDFBF9;
 Fri, 24 Nov 2017 11:01:26 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f46.google.com (mail-lf0-f46.google.com
 [209.85.215.46])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 0E7776F69E;
 Fri, 24 Nov 2017 11:01:25 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f46.google.com with SMTP id x76so20389789lfb.6;
 Fri, 24 Nov 2017 03:01:25 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:to:from:subject:message-id:date:user-agent
 :mime-version:content-language:content-transfer-encoding;
 bh=6FrtmZIx7kW/3Ko/v0TC6QdvxvoEkfXpbpwL8HYZisI=;
 b=IUMUkPNkPFQMj7F9F4b5udDFW+1cezZmOP+NH1KkcbfLZOkZOziUqv06AxNK7MG5Nv
 yuqsUqESHY5mlrZRTXMUJ+t8GiskiOl4i6IXtBvURrpFfNhloJLELkl/tfF6bMf8LftV
 fgOb11TEH+xjus3aLxEnZZpVmt3hxudDBqt8TIhmIN+Dq6H7aSkHPXmkSD2Vyd3CGWNO
 8ud3wVFUO1NrNwUwt9bXAoT+Xn8q5VTeqHV+x7GTxkIpjeOR0zcoKXpbuDRxjetmzSqJ
 ffXDgPMoKYzyLVtf9xNNR0nxdO9EgBnKJxMfIXgoCGXGjlZvELWNr9bUKrJ1jSOf0L0u
 pyHQ==
X-Gm-Message-State: AJaThX76Jl0HOS8bJlFhrqYcln5AUYqZqggcRmekAsvlBnJ2bwUhsAVH
 AFY0tTJRqKE41SfncnPFqUeJyxHzkFs=
X-Google-Smtp-Source: AGs4zMY683jXmpsJMe3oL3xJpzNM1zW+ffcjrGJ3523EBi/7T/NgRHjLoysysdel4AG7cbyvWqi/ew==
X-Received: by 10.46.68.195 with SMTP id b64mr8733440ljf.121.1511519433271;
 Fri, 24 Nov 2017 02:30:33 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id m95sm3694573lfi.76.2017.11.24.02.30.32
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 24 Nov 2017 02:30:32 -0800 (PST)
To: freebsd-fs@FreeBSD.org, freebsd-geom@FreeBSD.org
From: Andriy Gapon <avg@FreeBSD.org>
Subject: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
Message-ID: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
Date: Fri, 24 Nov 2017 12:30:30 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 11:01:26 -0000


https://reviews.freebsd.org/D13224

Anyone interested is welcome to join the review.
Thanks.
-- 
Andriy Gapon

From owner-freebsd-geom@freebsd.org  Fri Nov 24 13:08:10 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1CDC2DE50D9
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Fri, 24 Nov 2017 13:08:10 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-io0-x22f.google.com (mail-io0-x22f.google.com
 [IPv6:2607:f8b0:4001:c06::22f])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id C8A9B73C3E
 for <freebsd-geom@freebsd.org>; Fri, 24 Nov 2017 13:08:09 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-io0-x22f.google.com with SMTP id i38so29543369iod.2
 for <freebsd-geom@freebsd.org>; Fri, 24 Nov 2017 05:08:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=wOUgQZRszG2Cc8EXcF4I4ct1fBYHQfUdMYXIWs+F+kM=;
 b=isUkA4ECI3JesPLAYxUKKYXeVMi2zrx/o0OP8GdV9QaT6lGbmir4UFVIyNYDzeWoK4
 +Gh2PDDBVRyQhALliKCbNPQLYwW5g0o4avQoA84U/EKvj8J3LcVpDh5H8n2iVv+rwCd5
 T+fNeh0gVujBdYv16TG9dlsP8zhzc4MEl+yoGn3W+a1c4UdBEQuocPLoe46snTPi73fw
 wTCbX1qijQhDWL/aQz2REYg8qYM77zxui9siy77UlEQPzKF2nGzNCbDj8mlmgbX2fAY/
 wLWvj0afhHXsnq32X67GK7ksGpZ2PMec9riGRPI7fCJvhW1zTI+msQ2pYO78iaOaYVcv
 S3bg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=wOUgQZRszG2Cc8EXcF4I4ct1fBYHQfUdMYXIWs+F+kM=;
 b=iXUcyquR3YgrG0a+JQGyt68Kovlq18Cl7yggq7ALk5wwPCyj7f8ob0Wbu/2lgU12rD
 MVg1uBy2LD534B4PUPNS4NAtUoAAmUvtUOU226AtZ1bQ5pHnxBCkP3/yGeZ0kTGpLWPM
 aZhfssjCpuZEYVZHpi7VVmHcw0oN25qzzxpiUaLMiUIsBZ8tWrNR5s3t9gNJKwIG6WY5
 SSa9nby+X5bArjuZWV5joZqpRVNoVfUWIJtk99wru6z2jq7W0wwrTyJdCcgik0vZITUq
 15fdDFG+1xy0JXG6XnbnF5Snp4tRLWdIbZrNXhS5Ij/6lCTVLhHRSYs673OHcX1GkbDl
 vT7w==
X-Gm-Message-State: AJaThX7GCM4tEEZ079Dp6SCDBinz3o7koEfIlHs4+mPwuQ5e5fG1VsMz
 hganxeA7zvqEs5nXBvVntaC5fmy1uhzBvCtrHWlMSQ==
X-Google-Smtp-Source: AGs4zMY4MoBgdZFANsotjdt+Pxukf0cEindNxrZgT9zCGlGViUGX63RvX4ZZmzK0IC5ZT5xNwZUl0/rOkuCUehmtndY=
X-Received: by 10.107.104.18 with SMTP id d18mr28887141ioc.136.1511528888799; 
 Fri, 24 Nov 2017 05:08:08 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Fri, 24 Nov 2017 05:08:08 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:f964:7c3e:d2:aac5]
In-Reply-To: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Fri, 24 Nov 2017 06:08:08 -0700
X-Google-Sender-Auth: tj7KN5Upw_bTsDr4uHtec8Eefns
Message-ID: <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, 
 Scott Long <scottl@samsco.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 13:08:10 -0000

On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org> wrote:

>
> https://reviews.freebsd.org/D13224
>
> Anyone interested is welcome to join the review.
>

I think it's a really bad idea. It introduces a 'one-size-fits-all' notion
of QoS that seems misguided. It conflates a shorter timeout with don't
retry. And why is retrying bad? It seems more a notion of 'fail fast' or so
other concept. There's so many other ways you'd want to use it. And it uses
the same return code (EIO) to mean something new. It's generally meant 'The
lower layers have retried this, and it failed, do not submit it again as it
will not succeed' with 'I gave it a half-assed attempt, and that failed,
but resubmission might work'. This breaks a number of assumptions in the
BUF/BIO layer as well as parts of CAM even more than they are broken now.

So let's step back a bit: what problem is it trying to solve?

Warner

From owner-freebsd-geom@freebsd.org  Fri Nov 24 13:41:27 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 31195DE6395;
 Fri, 24 Nov 2017 13:41:27 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f54.google.com (mail-lf0-f54.google.com
 [209.85.215.54])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id DE9487522F;
 Fri, 24 Nov 2017 13:41:26 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f54.google.com with SMTP id k66so25549237lfg.3;
 Fri, 24 Nov 2017 05:41:26 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=r36jXUEGMIy23uM2+d2siF/0w4/bQvR41bQZSuWZoOo=;
 b=Dfi8pWBslpSts5kHV2uMfnBKr0qLawEXf/LFbwNONjeNdHjS+GRcKi/JkswWoybJC0
 /rPaKsOtIEtTpj7Do/u1aD6AX6tw567BrhkrD9adt5cwuoBNvZb54sD01vQkUPf7aS3m
 CVZ55F+MkfPL9bX+3pn5P/CfwP9ToD35ND8iUgT/J0KT37/OEhWDnRJu45k1wXq8Ugjx
 mlEonz7+DWAdBHZgzhbjc/MD7RfNd03TJBCy0uV7KGCUD75ldSU+vKr/DWrD8WJc1zEt
 F5OTJmnvp3OpZOIodt9e/cDvXa9ES/ig+dvFe/sPuzjnBhn11Y3KnAfY9eZWuu7nRx3W
 YSJQ==
X-Gm-Message-State: AJaThX4pRU8wTx/muTap/6DsWa6erbTJirBRhltJXyn3PGXfz4wPPN6F
 1Vddzi37QxkMPQ6Nh9h1dIs=
X-Google-Smtp-Source: AGs4zMbnrPvRtz68h9+hnPjT0RKaFjc6RWusF10UJUlzr7pZDgsEUvOC4aqpTDJh8D/LuFrtWW/MAQ==
X-Received: by 10.25.163.11 with SMTP id m11mr9390033lfe.179.1511530479138;
 Fri, 24 Nov 2017 05:34:39 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id s66sm4550021lje.40.2017.11.24.05.34.37
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 24 Nov 2017 05:34:38 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Warner Losh <imp@bsdimp.com>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org,
 Scott Long <scottl@samsco.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
Date: Fri, 24 Nov 2017 15:34:36 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 13:41:27 -0000

On 24/11/2017 15:08, Warner Losh wrote:
> 
> 
> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>> wrote:
> 
> 
>     https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>
> 
>     Anyone interested is welcome to join the review.
> 
> 
> I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of
> QoS that seems misguided. It conflates a shorter timeout with don't retry. And
> why is retrying bad? It seems more a notion of 'fail fast' or so other concept.
> There's so many other ways you'd want to use it. And it uses the same return
> code (EIO) to mean something new. It's generally meant 'The lower layers have
> retried this, and it failed, do not submit it again as it will not succeed' with
> 'I gave it a half-assed attempt, and that failed, but resubmission might work'.
> This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM
> even more than they are broken now.
> 
> So let's step back a bit: what problem is it trying to solve?

A simple example.  I have a mirror, I issue a read to one of its members.  Let's
assume there is some trouble with that particular block on that particular disk.
 The disk may spend a lot of time trying to read it and would still fail.  With
the current defaults I would wait 5x that time to finally get the error back.
Then I go to another mirror member and get my data from there.
IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
the error back sooner and try the other disk sooner.  Only if I know that there
are no other copies to try, then I would use the normal read with all the retrying.


-- 
Andriy Gapon

From owner-freebsd-geom@freebsd.org  Fri Nov 24 14:58:05 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 34698DE856C;
 Fri, 24 Nov 2017 14:58:05 +0000 (UTC)
 (envelope-from scottl@samsco.org)
Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com
 [66.111.4.28])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 046CA77CEF;
 Fri, 24 Nov 2017 14:58:04 +0000 (UTC)
 (envelope-from scottl@samsco.org)
Received: from compute6.internal (compute6.nyi.internal [10.202.2.46])
 by mailout.nyi.internal (Postfix) with ESMTP id 498B620C6A;
 Fri, 24 Nov 2017 09:57:58 -0500 (EST)
Received: from frontend1 ([10.202.2.160])
 by compute6.internal (MEProxy); Fri, 24 Nov 2017 09:57:58 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsco.org; h=cc
 :content-transfer-encoding:content-type:date:from:in-reply-to
 :message-id:mime-version:references:subject:to:x-me-sender
 :x-me-sender:x-sasl-enc; s=fm1; bh=qVIuYEAsfpjDy6GFHeaqmYUXQPio2
 uVChaJc9Aeq8Os=; b=v12oJ9CWdrq/ywPFKeZNkEVXl44tShaEcGNEDtUctXBOD
 6GuZSe3fcFw0J69JaTrFf//o0WtjReDC5OEdf7fhCftRb/7wjUnYDsVv1AoB5oAN
 +h89BrWtYsC/e+hszmFr5+KN34hbdP/y0XAaBlOGMJoXPMziGMigoxm4l+ol9n87
 p1iHdq9ABoD4jGFB/2BiHNfGBpRIfl+us8gWrUpDXLoJiQvkGQHwVWIhXfdRti7l
 uptZnVMxkoQ3qexTdGZo5YxTVPHG938ohoMbqiNslP8A2TficFtezumEq21E4zDG
 9ItPZbW0BTlJ9YjLV2u2k+OOm7jMURWGTIdCJVNnA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:content-transfer-encoding:content-type
 :date:from:in-reply-to:message-id:mime-version:references
 :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=qVIuYE
 AsfpjDy6GFHeaqmYUXQPio2uVChaJc9Aeq8Os=; b=j6q8dcfIr5Qevxq0taQVzK
 9EPL3kDNzXJ7jvrkSjOmhxxuS07mrLbq9iOh7jehiUnqooJSuTxMExb4W5HVWUKt
 BzKqvskYxR1m07E5wF3N75utkhyNDfhpGPCWyiHfrEaOUNmu1PwHkPRE14Otym2h
 rMu3yQoBhC/s9uEa6T/fCVNkWRDnLkorBFzcRtgS1U+3XsGUJfZwXGknPHia8N+4
 WYoIemhfWKEVC/G2kaVnT30pXNPX/kPxN/OrIW/4ouxPt8eRwNblT9mz9ZLMtdli
 6UMnssR2IBlHFhbjxHD/xwFFRu9PpwzzK/NWQiWwjvD4BuQ8ER6SO2bvySWFBo0g
 ==
X-ME-Sender: <xms:djMYWstQ5Ii36qDm7l_ZTFlJam46aiAduGxObuspceU-1rxtlFRLGQ>
Received: from [192.168.0.103] (unknown [161.97.249.191])
 by mail.messagingengine.com (Postfix) with ESMTPA id C09497F35E;
 Fri, 24 Nov 2017 09:57:57 -0500 (EST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
From: Scott Long <scottl@samsco.org>
In-Reply-To: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
Date: Fri, 24 Nov 2017 07:57:55 -0700
Cc: Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
To: Andriy Gapon <avg@FreeBSD.org>
X-Mailer: Apple Mail (2.3445.4.7)
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 14:58:05 -0000


> On Nov 24, 2017, at 6:34 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>=20
> On 24/11/2017 15:08, Warner Losh wrote:
>>=20
>>=20
>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
>> <mailto:avg@freebsd.org>> wrote:
>>=20
>>=20
>>    https://reviews.freebsd.org/D13224 =
<https://reviews.freebsd.org/D13224>
>>=20
>>    Anyone interested is welcome to join the review.
>>=20
>>=20
>> I think it's a really bad idea. It introduces a 'one-size-fits-all' =
notion of
>> QoS that seems misguided. It conflates a shorter timeout with don't =
retry. And
>> why is retrying bad? It seems more a notion of 'fail fast' or so =
other concept.
>> There's so many other ways you'd want to use it. And it uses the same =
return
>> code (EIO) to mean something new. It's generally meant 'The lower =
layers have
>> retried this, and it failed, do not submit it again as it will not =
succeed' with
>> 'I gave it a half-assed attempt, and that failed, but resubmission =
might work'.
>> This breaks a number of assumptions in the BUF/BIO layer as well as =
parts of CAM
>> even more than they are broken now.
>>=20
>> So let's step back a bit: what problem is it trying to solve?
>=20
> A simple example.  I have a mirror, I issue a read to one of its =
members.  Let's
> assume there is some trouble with that particular block on that =
particular disk.
> The disk may spend a lot of time trying to read it and would still =
fail.  With
> the current defaults I would wait 5x that time to finally get the =
error back.
> Then I go to another mirror member and get my data from there.

There are many RAID stacks that already solve this problem by having a =
policy
of always reading all disk members for every transaction, and throwing =
away the
sub-transactions that arrive late.  It=E2=80=99s not a policy that is =
always desired, but it
serves a useful purpose for low-latency needs.

> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first =
read, get
> the error back sooner and try the other disk sooner.  Only if I know =
that there
> are no other copies to try, then I would use the normal read with all =
the retrying.
>=20

I agree with Warner that what you are proposing is not correct.  It =
weakens the
contract between the disk layer and the upper layers, making it less =
clear who is
responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D means. =
 That contract is already
weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
are working on a plan to fix that. =20

Scott


From owner-freebsd-geom@freebsd.org  Fri Nov 24 16:33:58 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 530B5DEB555
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Fri, 24 Nov 2017 16:33:58 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x231.google.com (mail-it0-x231.google.com
 [IPv6:2607:f8b0:4001:c0b::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 103EF7AE70
 for <freebsd-geom@freebsd.org>; Fri, 24 Nov 2017 16:33:58 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x231.google.com with SMTP id n134so14236534itg.3
 for <freebsd-geom@freebsd.org>; Fri, 24 Nov 2017 08:33:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=;
 b=oEuCnvZ2uOi0mqEEuCMF+Ea3KIQMW8xD4XIOzTDCGoKTFGxUdG8pKgCWcqLfEk63kB
 talGVqrxz368QnLhZqPCzYo+LUkeNRkAlTGomWkYu2Ub7QuPvtaoAKp40xt6ffwmQrI1
 6bhPj6xeTqwjqqFGmnLbausFNdZ9KfBW/NiIgowQgVw8jOgUNPkDlucrsRZL4I9VIlqF
 zgkNAsNDKJ/7903BgsEKiaQm+8vo0GLzqp8Pl1pGYnwRWKHT2sha56QWI+/pcqLie5y7
 N6jvMdc9N5PWnTQ47uUezW9hDMv0kGSpnrXcGiDVOg0A/icNn/iIc1VF9SD6mwKi6LK6
 P0QQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=;
 b=kAZce0Pwh74Hf+0VNGYhrLALjpwwj2Yju67Dw9qybI+88JRtPcaeutFKRku3sIDYN4
 rl4T1/ilqv67uup41C274cAzXdOIb5uDrVNDttJEA6nfaK7rt4HW5MIBYVkX3jH3s9cv
 TI0BqF1cWZ10N5SGvJ5QAA8RDdJM1f8yFKTH3oJ3EaLq/Sqm8Oe/cZJ6tEAitF3JryiX
 CdZVB2ywGYYPGdBljvTohphTuldWu0M+oDrVY4iNjGKkO8enM/Ezzg2Tzs/JoAsy5vo0
 d0qOby0zpMT2aKtVQEbhGJto7YEz3TZ1ekXRgP8EXum41Qk4QuTBAb3GQYde/TsX9lY2
 22zQ==
X-Gm-Message-State: AJaThX6NVnjhNto+wjebDUIRwcdb57c2NxRNhxp5NVz5LnfFo2WApE96
 H5UU6hl0HUltULJ+eaF4k5t4h9ITgzXI1/HYj3eqRA==
X-Google-Smtp-Source: AGs4zMb2qcX4SYQ8dG9gG7QijUzFvld3wUlWcmNehzHGRemSK6Qjy7T0UBzw5z1gKtWtiDfacKGNpjOpyFOqluW63Yg=
X-Received: by 10.36.77.143 with SMTP id l137mr17155000itb.50.1511541237108;
 Fri, 24 Nov 2017 08:33:57 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Fri, 24 Nov 2017 08:33:56 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:f964:7c3e:d2:aac5]
In-Reply-To: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Fri, 24 Nov 2017 09:33:56 -0700
X-Google-Sender-Auth: 9YkRcBZ2y_mwK-i8n9pyzcXRGJA
Message-ID: <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, 
 Scott Long <scottl@samsco.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 16:33:58 -0000

On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg@freebsd.org> wrote:

> On 24/11/2017 15:08, Warner Losh wrote:
> >
> >
> > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
> > <mailto:avg@freebsd.org>> wrote:
> >
> >
> >     https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/
> D13224>
> >
> >     Anyone interested is welcome to join the review.
> >
> >
> > I think it's a really bad idea. It introduces a 'one-size-fits-all'
> notion of
> > QoS that seems misguided. It conflates a shorter timeout with don't
> retry. And
> > why is retrying bad? It seems more a notion of 'fail fast' or so other
> concept.
> > There's so many other ways you'd want to use it. And it uses the same
> return
> > code (EIO) to mean something new. It's generally meant 'The lower layers
> have
> > retried this, and it failed, do not submit it again as it will not
> succeed' with
> > 'I gave it a half-assed attempt, and that failed, but resubmission might
> work'.
> > This breaks a number of assumptions in the BUF/BIO layer as well as
> parts of CAM
> > even more than they are broken now.
> >
> > So let's step back a bit: what problem is it trying to solve?
>
> A simple example.  I have a mirror, I issue a read to one of its members.
> Let's
> assume there is some trouble with that particular block on that particular
> disk.
>  The disk may spend a lot of time trying to read it and would still fail.
> With
> the current defaults I would wait 5x that time to finally get the error
> back.
> Then I go to another mirror member and get my data from there.
> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read,
> get
> the error back sooner and try the other disk sooner.  Only if I know that
> there
> are no other copies to try, then I would use the normal read with all the
> retrying.
>

It sounds like you are optimizing the wrong thing and taking an overly
simplistic view of quality of service.

First, failing blocks on a disk is fairly rare. Do you really want to
optimize for that case?

Second, you're really saying 'If you can't read it fast, fail" since we
only control the software side of read retry. There's new op codes being
proposed that say 'read or fail within Xms' which is really what you want:
if it's taking too long on disk A you want to move to disk B. The notion
here was we'd return EAGAIN (or some other error) if it failed after Xms,
and maybe do some emulation in software for drives that don't support this.
You'd tweak this number to control performance. You're likely to get a much
bigger performance win all the time by scheduling I/O to drives that have
the best recent latency.

Third, do you have numbers that show this is actually a win? This is a
terrible thing from an architectural view. Absent numbers that show it's a
big win, I'm very hesitant to say OK.

Forth, there's a large number of places in the stack today that need to
communicate their I/O is more urgent, and we don't have any good way to
communicate even that simple concept down the stack.

Finally, the only places that ZFS uses the TRYHARDER flag are for things
like the super block if I'm reading the code right. It doesn't do it for
normal I/O. There's no code to cope with what would happen if all the
copies of a block couldn't be read with the NORETRY flag. One of them might
contain the data.

Warner

From owner-freebsd-geom@freebsd.org  Fri Nov 24 17:18:05 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0C723DEC656;
 Fri, 24 Nov 2017 17:18:05 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f50.google.com (mail-lf0-f50.google.com
 [209.85.215.50])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id AD0F97C6EF;
 Fri, 24 Nov 2017 17:18:04 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f50.google.com with SMTP id y2so25239532lfj.4;
 Fri, 24 Nov 2017 09:18:04 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=BiB9h//yrNXLyrWIgkH9wcDBFEPZO9r5royyXbkPszM=;
 b=rObH8Xv1smrxW3ti4DvJH4xYwdM24DS/JVGZok9eC8WL1tWhwBy2lxNUiaBUuSZJ4j
 USyifnq/NDQpM6XE1bPvKRSEO/fU9STVQ5XagK8nKirMo+AJpOhBP0sZA2Kgw+VvjrrF
 dRqzvo6oUmbHHZ+pVkrCyM6XQrzA0al/Nkv0y+2+ReH8APnRD9Q9y66acRDps4hXCMaj
 0QhJDg4D856AUFIRXg7b4i/dvJiKzyKP7jsrmzBYzHrAxuRWJ7FMEFUtx2Fqo4duGaaQ
 qB2vpUdINYr4OrTQ+n0r4xj/+JBe9wqG/HLPx3PuK6C5myL7Sh8dLACDreW1KD3wB5PD
 /KRA==
X-Gm-Message-State: AJaThX494WCGv8gs3ts/QhzCqwOTeGMcO2Nl+iivO22aERVKhVID3hsD
 7LgLjRs+f+X3ypJTJq3Sby5fOHYgvDc=
X-Google-Smtp-Source: AGs4zMb416hW/35IIKGKmxO+LzubmkdxPtdvJXKkuLDu0Uvk6va7LTAbJxT+uYVIqnY6E+XvCnWX0Q==
X-Received: by 10.46.15.25 with SMTP id 25mr2779272ljp.119.1511543881502;
 Fri, 24 Nov 2017 09:18:01 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id e74sm1404384ljf.43.2017.11.24.09.17.59
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 24 Nov 2017 09:18:00 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Scott Long <scottl@samsco.org>
Cc: Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
Date: Fri, 24 Nov 2017 19:17:58 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 17:18:05 -0000

On 24/11/2017 16:57, Scott Long wrote:
> 
> 
>> On Nov 24, 2017, at 6:34 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>>
>> On 24/11/2017 15:08, Warner Losh wrote:
>>>
>>>
>>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
>>> <mailto:avg@freebsd.org>> wrote:
>>>
>>>
>>>    https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>
>>>
>>>    Anyone interested is welcome to join the review.
>>>
>>>
>>> I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of
>>> QoS that seems misguided. It conflates a shorter timeout with don't retry. And
>>> why is retrying bad? It seems more a notion of 'fail fast' or so other concept.
>>> There's so many other ways you'd want to use it. And it uses the same return
>>> code (EIO) to mean something new. It's generally meant 'The lower layers have
>>> retried this, and it failed, do not submit it again as it will not succeed' with
>>> 'I gave it a half-assed attempt, and that failed, but resubmission might work'.
>>> This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM
>>> even more than they are broken now.
>>>
>>> So let's step back a bit: what problem is it trying to solve?
>>
>> A simple example.  I have a mirror, I issue a read to one of its members.  Let's
>> assume there is some trouble with that particular block on that particular disk.
>> The disk may spend a lot of time trying to read it and would still fail.  With
>> the current defaults I would wait 5x that time to finally get the error back.
>> Then I go to another mirror member and get my data from there.
> 
> There are many RAID stacks that already solve this problem by having a policy
> of always reading all disk members for every transaction, and throwing away the
> sub-transactions that arrive late.  It’s not a policy that is always desired, but it
> serves a useful purpose for low-latency needs.

That's another possible and useful strategy.

>> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
>> the error back sooner and try the other disk sooner.  Only if I know that there
>> are no other copies to try, then I would use the normal read with all the retrying.
>>
> 
> I agree with Warner that what you are proposing is not correct.  It weakens the
> contract between the disk layer and the upper layers, making it less clear who is
> responsible for retries and less clear what “EIO” means.  That contract is already
> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
> are working on a plan to fix that.

Well...  I do realize now that there is some problem in this area, both you and
Warner mentioned it.  But knowing that it exists is not the same as knowing what
it is :-)
I understand that it could be rather complex and not easy to describe in a short
email...

But then, this flag is optional, it's off by default and no one is forced to
used it.  If it's used only by ZFS, then it would not be horrible.
Unless it makes things very hard for the infrastructure.
But I am circling back to not knowing what problem(s) you and Warner are
planning to fix.


-- 
Andriy Gapon

From owner-freebsd-geom@freebsd.org  Fri Nov 24 17:21:01 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6ECE4DEC737;
 Fri, 24 Nov 2017 17:21:01 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f44.google.com (mail-lf0-f44.google.com
 [209.85.215.44])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EEA307C82E;
 Fri, 24 Nov 2017 17:21:00 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f44.google.com with SMTP id k66so26187653lfg.3;
 Fri, 24 Nov 2017 09:21:00 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=DDjvUMqXfUPMiy37wIeC40DYy9cXSNrEzbE7WjarZrI=;
 b=gNyJjZr6bI99/qbi+V8XV4IA1T1S2+/fRUqb39AZDQCgg3PX7D9gN2ysLvQtwLZ8H5
 Yo7Gsdk1wkVEYO+y2LIqqkrKyj+zXF6gn6cC6Ur4JEHvNvdhgpvFlFWZoAWcm/yHbm1L
 Mce63ad4OhogFX0zKcswSZgMd6KnYhT3d1CKey2eKYlHZ4+BnOHBC3d1kcFB8xdpgbWh
 tjAtXW8h8AtAf4r6HBVG7kKkvyNiuIqVkpAdFSKYTWa9SXhhRijAg8wrcVa7exz/u/A+
 3IKDqrTJH3I0FJWsspvQt8M35TRFxkHwiJCGPG66fezK8SHENVOnMe2zNJ+5aWwxBT6Y
 ySmg==
X-Gm-Message-State: AJaThX7zGQlMJydDAZghpL0gzV4NGfjTRXsdMzD9x5or3fOnDNnc9pSd
 oxYHDrdgStmvNNdh6t87syk=
X-Google-Smtp-Source: AGs4zMYf29dJUpe8orSQznnn8j7bzE3GZVUW/OlfKvye3SHBc7zy493mcDJyGGVSAlryNcXXTU3M2A==
X-Received: by 10.46.117.28 with SMTP id q28mr10136527ljc.14.1511544052915;
 Fri, 24 Nov 2017 09:20:52 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id u17sm3771525lfi.97.2017.11.24.09.20.51
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 24 Nov 2017 09:20:52 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Warner Losh <imp@bsdimp.com>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org,
 Scott Long <scottl@samsco.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
Date: Fri, 24 Nov 2017 19:20:51 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 24 Nov 2017 17:21:01 -0000

On 24/11/2017 18:33, Warner Losh wrote:
> 
> 
> On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>> wrote:
> 
>     On 24/11/2017 15:08, Warner Losh wrote:
>     >
>     >
>     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org>
>     > <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote:
>     >
>     >
>     >     https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>>
>     >
>     >     Anyone interested is welcome to join the review.
>     >
>     >
>     > I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of
>     > QoS that seems misguided. It conflates a shorter timeout with don't retry. And
>     > why is retrying bad? It seems more a notion of 'fail fast' or so other concept.
>     > There's so many other ways you'd want to use it. And it uses the same return
>     > code (EIO) to mean something new. It's generally meant 'The lower layers have
>     > retried this, and it failed, do not submit it again as it will not succeed' with
>     > 'I gave it a half-assed attempt, and that failed, but resubmission might work'.
>     > This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM
>     > even more than they are broken now.
>     >
>     > So let's step back a bit: what problem is it trying to solve?
> 
>     A simple example.  I have a mirror, I issue a read to one of its members.  Let's
>     assume there is some trouble with that particular block on that particular disk.
>      The disk may spend a lot of time trying to read it and would still fail.  With
>     the current defaults I would wait 5x that time to finally get the error back.
>     Then I go to another mirror member and get my data from there.
>     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
>     the error back sooner and try the other disk sooner.  Only if I know that there
>     are no other copies to try, then I would use the normal read with all the
>     retrying.
> 
> 
> It sounds like you are optimizing the wrong thing and taking an overly
> simplistic view of quality of service.
> First, failing blocks on a disk is fairly rare. Do you really want to optimize
> for that case?

If it can be done without any harm to the sunny day scenario, then why not?
I think that 'robustness' is the word here, not 'optimization'.

> Second, you're really saying 'If you can't read it fast, fail" since we only
> control the software side of read retry.

Am I?
That's not what I wanted to say, really.  I just wanted to say, if this I/O
fails, don't retry it, leave it to me.
This is very simple, simplistic as you say, but I like simple.

> There's new op codes being proposed
> that say 'read or fail within Xms' which is really what you want: if it's taking
> too long on disk A you want to move to disk B. The notion here was we'd return
> EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation
> in software for drives that don't support this. You'd tweak this number to
> control performance. You're likely to get a much bigger performance win all the
> time by scheduling I/O to drives that have the best recent latency.

ZFS already does some latency based decisions.
The things that you describe are very interesting, but they are for the future.

> Third, do you have numbers that show this is actually a win?

I do not have any numbers right now.
What kind of numbers would you like?  What kind of scenarios?

> This is a terrible
> thing from an architectural view.

You have said this several times, but unfortunately you haven't explained it yet.

> Absent numbers that show it's a big win, I'm
> very hesitant to say OK.
> 
> Forth, there's a large number of places in the stack today that need to
> communicate their I/O is more urgent, and we don't have any good way to
> communicate even that simple concept down the stack.

That's unfortunately, but my proposal has quite little to do with I/O
scheduling, priorities, etc.

> Finally, the only places that ZFS uses the TRYHARDER flag are for things like
> the super block if I'm reading the code right. It doesn't do it for normal I/O.

Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the
same way as ZIO_FLAG_TRYHARD.

> There's no code to cope with what would happen if all the copies of a block
> couldn't be read with the NORETRY flag. One of them might contain the data.

ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.

-- 
Andriy Gapon

From owner-freebsd-geom@freebsd.org  Sat Nov 25 10:54:06 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B2DE1DDE418;
 Sat, 25 Nov 2017 10:54:06 +0000 (UTC)
 (envelope-from scottl@samsco.org)
Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com
 [66.111.4.28])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 74B507D2CE;
 Sat, 25 Nov 2017 10:54:06 +0000 (UTC)
 (envelope-from scottl@samsco.org)
Received: from compute6.internal (compute6.nyi.internal [10.202.2.46])
 by mailout.nyi.internal (Postfix) with ESMTP id 3DBAD20C14;
 Sat, 25 Nov 2017 05:54:04 -0500 (EST)
Received: from frontend1 ([10.202.2.160])
 by compute6.internal (MEProxy); Sat, 25 Nov 2017 05:54:04 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsco.org; h=cc
 :content-transfer-encoding:content-type:date:from:in-reply-to
 :message-id:mime-version:references:subject:to:x-me-sender
 :x-me-sender:x-sasl-enc; s=fm1; bh=RaB/AKEW6ssBr1WgIVTDCxOVmtpaK
 56Amk4OLGmTdd0=; b=hktOnAyGXDhIs2jROcS4re2WNoJh4EeJcnIG3biKqwep0
 akgP6d18WTEK6J1ICq9cpd2J+xSeYABCXdEmdICogmastpBYdhIKhtfzvndsl79D
 i5EmDeyE93bqaV04cBchYRjRnZETgmhl93xVqXNR+pHLdmkJFNnCCqcDR10JHNKd
 sW4nmpmR4WvxZgVG7LkbmFiaQRBshJ/11h3azhVaDWOz4j+npO8EkjMMqVwGMFoz
 O0r8uzyyxCywUzYnmpMcFubBfZZqzMvrMltmhIS722Ke6+buQ3DeHJVqz5eiyu8N
 p4rAYrqIiGUwhoJzNuD35XfIldWe74sR5ZzBnOFvA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:content-transfer-encoding:content-type
 :date:from:in-reply-to:message-id:mime-version:references
 :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=RaB/AK
 EW6ssBr1WgIVTDCxOVmtpaK56Amk4OLGmTdd0=; b=m6bVt2S+eF+rrbG6orqHnO
 zLcu2MALeeuZEucMSts3+TTTGB/11L6qVJz04n5Hzhy36weGuekbGzE7peUo/W5v
 wkAa4wi5FhjOTO0BIJGwKS5raiEIdaPRTsl5aSmvzrLrio76NFtRRClWs+1+yesY
 +QW8EfoEgQ3Sh+3je5TIy/j5sC4GHZj0e4DBQ5UT3LttTzXU4ZU1NcRc499IsvFq
 c/ulQSKhAl+o+AedGUD43cBWkuaZl0s57+wqEvEbfQ6uzieN7QeVYEXoYpHlpWBq
 zAXov7tTEaWXhk9icxs1hduwtOYNUXsHBluVO5bAZB6ccqa9uCjhU4Pb3LVUXdKQ
 ==
X-ME-Sender: <xms:zEsZWnOC8o27rV4eEGwwFeNoaPf4SDSBtcs3uXnuLhuvHpu_ntmQaA>
Received: from [192.168.0.106] (unknown [161.97.249.191])
 by mail.messagingengine.com (Postfix) with ESMTPA id BDD0E7E6EE;
 Sat, 25 Nov 2017 05:54:03 -0500 (EST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
From: Scott Long <scottl@samsco.org>
In-Reply-To: <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
Date: Sat, 25 Nov 2017 03:54:01 -0700
Cc: Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
To: Andriy Gapon <avg@FreeBSD.org>
X-Mailer: Apple Mail (2.3445.4.7)
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 10:54:06 -0000


> On Nov 24, 2017, at 10:17 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>=20
>=20
>>> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first =
read, get
>>> the error back sooner and try the other disk sooner.  Only if I know =
that there
>>> are no other copies to try, then I would use the normal read with =
all the retrying.
>>>=20
>>=20
>> I agree with Warner that what you are proposing is not correct.  It =
weakens the
>> contract between the disk layer and the upper layers, making it less =
clear who is
>> responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D =
means.  That contract is already
>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and =
I
>> are working on a plan to fix that.
>=20
> Well...  I do realize now that there is some problem in this area, =
both you and
> Warner mentioned it.  But knowing that it exists is not the same as =
knowing what
> it is :-)
> I understand that it could be rather complex and not easy to describe =
in a short
> email=E2=80=A6
>=20

There are too many questions to ask, I will do my best to keep the =
conversation
logical.  First, how do you propose to distinguish between EIO due to a =
lengthy
set of timeouts, vs EIO due to an immediate error returned by the disk =
hardware?
CAM has an extensive table-driven error recovery protocol who=E2=80=99s =
purpose is to
decide whether or not to do retries based on hardware state information =
that is
not made available to the upper layers.  Do you have a test case that =
demonstrates
the problem that you=E2=80=99re trying to solve?  Maybe the error =
recovery table is wrong
and you=E2=80=99re encountering a case that should not be retried.  If =
that=E2=80=99s what=E2=80=99s going on,
we should fix CAM instead of inventing a new work-around.

Second, what about disk subsystems that do retries internally, out of =
the control
of the FreeBSD driver?  This would include most hardware RAID =
controllers.
Should what you are proposing only work for a subset of the kinds of =
storage
systems that are available and in common use?

Third, let=E2=80=99s say that you run out of alternate copies to try, =
and as you stated
originally, that will force you to retry the copies that had returned =
EIO.  How
will you know when you can retry?  How will you know how many times you
will retry?  How will you know that a retry is even possible?  Should =
the retries
be able to be canceled?

Why is overloading EIO so bad?  brelse() will call bdirty() when a =
BIO_WRITE
command has failed with EIO.  Calling bdirty() has the effect of =
retrying the I/O.
This disregards the fact that disk drivers only return EIO when =
they=E2=80=99ve decided
that the I/O cannot be retried.  It has no termination condition for the =
retries, and
will endlessly retry I/O in vain; I=E2=80=99ve seen this quite =
frequently.  It also disregards
the fact that I/O marked as B_PAGING can=E2=80=99t be retried in this =
fashion, and will
trigger a panic.  Because we pretend that EIO can be retried, we are =
left with
a system that is very fragile when I/O actually does fail.  Instead of =
adding
more special cases and blurred lines, I want to go back to enforcing =
strict
contracts between the layers and force the core parts of the system to =
respect
those contracts and handle errors properly, instead of just retrying and
hoping for the best.


> But then, this flag is optional, it's off by default and no one is =
forced to
> used it.  If it's used only by ZFS, then it would not be horrible.
> Unless it makes things very hard for the infrastructure.
> But I am circling back to not knowing what problem(s) you and Warner =
are
> planning to fix.
>=20

Saying that a feature is optional means nothing; while consumers of the =
API
might be able to ignore it, the producers of the API cannot ignore it.  =
It is
these producers who are sick right now and should be fixed, instead of
creating new ways to get even more sick.

Scott


From owner-freebsd-geom@freebsd.org  Sat Nov 25 11:37:35 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4B0F0DDF02E;
 Sat, 25 Nov 2017 11:37:35 +0000 (UTC)
 (envelope-from phk@phk.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
 by mx1.freebsd.org (Postfix) with ESMTP id 0C77E7E234;
 Sat, 25 Nov 2017 11:37:35 +0000 (UTC)
 (envelope-from phk@phk.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.55.3])
 by phk.freebsd.dk (Postfix) with ESMTP id 3507927347;
 Sat, 25 Nov 2017 11:37:27 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
 by critter.freebsd.dk (8.15.2/8.15.2) with ESMTP id vAPBbAan030380;
 Sat, 25 Nov 2017 11:37:11 GMT (envelope-from phk@phk.freebsd.dk)
To: Scott Long <scottl@samsco.org>
cc: Andriy Gapon <avg@FreeBSD.org>, FreeBSD FS <freebsd-fs@freebsd.org>,
 Warner Losh <imp@bsdimp.com>, freebsd-geom@freebsd.org
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
In-reply-to: <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
 <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-ID: <30378.1511609830.1@critter.freebsd.dk>
Content-Transfer-Encoding: quoted-printable
Date: Sat, 25 Nov 2017 11:37:10 +0000
Message-ID: <30379.1511609830@critter.freebsd.dk>
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 11:37:35 -0000

--------
In message <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>, Scott Long w=
rites:

> Why is overloading EIO so bad?  brelse() will call bdirty() when a BIO_W=
RITE
> command has failed with EIO.  Calling bdirty() has the effect of retryin=
g the I/O.
> This disregards the fact that disk drivers only return EIO when they=E2=80=
=99ve decided
> that the I/O cannot be retried.  It has no termination condition for the=
 retries, and
> will endlessly retry I/O in vain; I=E2=80=99ve seen this quite frequentl=
y.

The really annoying thing about this particular class of errors,
is that if we propagated them up to the filesystems, very often
things could be relocated to different blocks and we would avoid the
unnecessary filesystem corruption.

The real fundamental deficiency is that we do not have a way to say "give =
up
if this bio cannot be completed in X time" which is what people actually w=
ant.

That is suprisingly hard to provide, there are far too many
corner-cases for me to enumerate them all, but let me just give one
example:

Imagine you issue a deadlined write to a RAID5 thing.  Thee component
writes happen smoothly, but the last two fail the deadline, with
no way to predict how long time it will take before they complete
or fail.

* Does the bio write transaction fail ?

* Does the bio write transaction time out ?

* Do you attempt to complete the write to the RAID5 ?

* Where do you store a copy of the data if you do ?

* What happens next time a read happens on this bio's extent ?

Then for an encore, imagine it was a read bio: Three DMAs go smoothly,
two are outstanding and you don't know if/when they will complete/fail.

* If you fail or time out the bio, how do you "taint" the space
  being read into until the two remaining DMAs are outstanding?

* What if that space is mapped into userland ?

* What if that space is being executed ?

* What if one of the two outstanding DMAs later return garbage ?

My conclusion back when I did GEOM, was that the only way to
do something like this sanely, is to have a special GEOM do it
for you, which always allocates a temp-space:

	allocate temp buffer
	if (write)
		copy write data to temp buffer
	issue bio downwards on temp buffer
	if timeout
		park temp buffer until biodone
		return(timeout)
	if (read)
		copy temp buffer to read space
	return (ok/error)


-- =

Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    =

Never attribute to malice what can adequately be explained by incompetence=
.

From owner-freebsd-geom@freebsd.org  Sat Nov 25 16:25:06 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 230F0DEA368
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Sat, 25 Nov 2017 16:25:06 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-io0-x235.google.com (mail-io0-x235.google.com
 [IPv6:2607:f8b0:4001:c06::235])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D9AB56517C
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 16:25:05 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-io0-x235.google.com with SMTP id s37so21282306ioe.10
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 08:25:05 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=INwbRnB4Pq35fT3wBMoGQsVmSwWa8Uzwez6+ZA+Wong=;
 b=o4Lmv1sLgXpR/W5FxIK2W45MqgEm3eW1Q5ufZLaYxdVob4ZMzks09gm/T4uP9N9Czp
 CIHLWgrcFtzMHk72QQ30gTLcb7ueVSnkl93kNn9yZ2SiivsEjgC4hZ+Wd1xdGYVwPt69
 Bxa9DOJWxg0j9GZK0CuG1hXsqoepbmoFe3tW9Os9vcifNd74bt1jhLuESf4WAYLmZq4t
 iGVRmCO+CDylOKHMKdtfxC6brOQpXAwYYgE+7CmtDg1WF3dAqOKduG47bgQroIouiM9+
 7Xv11ziBUg9z51DMiF9yVBGvAXXp2xqr3QCeRv0kPRTBGC8q/VWbv80T1JOh8gOU2BGl
 X59A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=INwbRnB4Pq35fT3wBMoGQsVmSwWa8Uzwez6+ZA+Wong=;
 b=FsOxl9Ndj2a3VZ0en18/fVeC9ncvW5LATHXthv8M3O+jk2TAzabNSICFqP3+p4qsY+
 ZsKLrKcpfVj2EILxyG3bxJzmYkM8IXKzC8KMMXhJTBd5kgAUwTQWZnjn24jBSR0/Le7u
 4m5CPtpL0iBBhfrgrgMHBp8JNU9WdPPVCPNqlyCbEnZM7mWjF42C+VBcbObDlFXTOKEz
 uhnTBCHBv6wAglyLLhNYOta/m4cDqFTob2DTBn9VMJY1pG32yZt6YY8SqvXjoCyuTKUH
 KPAiX4sBYp175qD2HpRbnPXU2ka+3wGK02ik+JhGcK79cn6VrpK90Y/6gy5+IzQ8fukA
 cUsA==
X-Gm-Message-State: AJaThX5IoL95fvVLCiQE98eU672lX67pdHJSxYvUoSv/Qpa5H8eyEU1M
 kaGDyGng8JJyhqRCaOJJ/ODTWxSahd4tbeG1MIjM/g==
X-Google-Smtp-Source: AGs4zMZe+RePIe6UOmoCphMhiNYDZUz/+iu/mFebQM7KQ1V+YUeenNjnfy6CxK0Ko06+ny1hO6AXLOlxM8WMxivNo20=
X-Received: by 10.107.81.24 with SMTP id f24mr35218615iob.63.1511627105118;
 Sat, 25 Nov 2017 08:25:05 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 08:25:04 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd]
In-Reply-To: <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 25 Nov 2017 09:25:04 -0700
X-Google-Sender-Auth: dkGLRA2lZh9ar81kx7rluRa6pgo
Message-ID: <CANCZdfqiO7tmD+cehaeM-RuENY_Bt5Qj3sOgA4ZbY67oDrcHbg@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: Scott Long <scottl@samsco.org>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 16:25:06 -0000

On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon <avg@freebsd.org> wrote:

> On 24/11/2017 16:57, Scott Long wrote:
> >
> >
> >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
> >>
> >> On 24/11/2017 15:08, Warner Losh wrote:
> >>>
> >>>
> >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
> >>> <mailto:avg@freebsd.org>> wrote:
> >>>
> >>>
> >>>    https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/
> D13224>
> >>>
> >>>    Anyone interested is welcome to join the review.
> >>>
> >>>
> >>> I think it's a really bad idea. It introduces a 'one-size-fits-all'
> notion of
> >>> QoS that seems misguided. It conflates a shorter timeout with don't
> retry. And
> >>> why is retrying bad? It seems more a notion of 'fail fast' or so othe=
r
> concept.
> >>> There's so many other ways you'd want to use it. And it uses the same
> return
> >>> code (EIO) to mean something new. It's generally meant 'The lower
> layers have
> >>> retried this, and it failed, do not submit it again as it will not
> succeed' with
> >>> 'I gave it a half-assed attempt, and that failed, but resubmission
> might work'.
> >>> This breaks a number of assumptions in the BUF/BIO layer as well as
> parts of CAM
> >>> even more than they are broken now.
> >>>
> >>> So let's step back a bit: what problem is it trying to solve?
> >>
> >> A simple example.  I have a mirror, I issue a read to one of its
> members.  Let's
> >> assume there is some trouble with that particular block on that
> particular disk.
> >> The disk may spend a lot of time trying to read it and would still
> fail.  With
> >> the current defaults I would wait 5x that time to finally get the erro=
r
> back.
> >> Then I go to another mirror member and get my data from there.
> >
> > There are many RAID stacks that already solve this problem by having a
> policy
> > of always reading all disk members for every transaction, and throwing
> away the
> > sub-transactions that arrive late.  It=E2=80=99s not a policy that is a=
lways
> desired, but it
> > serves a useful purpose for low-latency needs.
>
> That's another possible and useful strategy.
>
> >> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
> read, get
> >> the error back sooner and try the other disk sooner.  Only if I know
> that there
> >> are no other copies to try, then I would use the normal read with all
> the retrying.
> >>
> >
> > I agree with Warner that what you are proposing is not correct.  It
> weakens the
> > contract between the disk layer and the upper layers, making it less
> clear who is
> > responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D means=
.  That contract
> is already
> > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
> > are working on a plan to fix that.
>
> Well...  I do realize now that there is some problem in this area, both
> you and
> Warner mentioned it.  But knowing that it exists is not the same as
> knowing what
> it is :-)
> I understand that it could be rather complex and not easy to describe in =
a
> short
> email...
>
> But then, this flag is optional, it's off by default and no one is forced
> to
> used it.  If it's used only by ZFS, then it would not be horrible.
>

Except that it isn't the same flag as what Solaris has (its B_FAILFAST does
something different: it isn't about limiting retries but about failing ALL
the queued I/O for a unit, not just trying one retry), and the problems
that it solves are quite rare. And if you return a different errno, then
the EIO contract is still fulfilled.


> Unless it makes things very hard for the infrastructure.
> But I am circling back to not knowing what problem(s) you and Warner are
> planning to fix.
>

The middle layers of the I/O system are a bit fragile in the face of I/O
errors. We're fixing that.

Of course, you still haven't articulated why this approach would be better,
nor show any numbers as to how it makes things better.

Warner

From owner-freebsd-geom@freebsd.org  Sat Nov 25 16:36:31 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 34F0EDEA70D
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Sat, 25 Nov 2017 16:36:31 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-io0-x22b.google.com (mail-io0-x22b.google.com
 [IPv6:2607:f8b0:4001:c06::22b])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EA5356579D
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 16:36:30 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-io0-x22b.google.com with SMTP id z74so32212286iof.12
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 08:36:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=PAGMPg1oviQBjanQMuXwwRvreJHA+xXSsbF+rVnmABo=;
 b=jwHO36iyQQyGYD/VJxZLY+Y1Slru/PtgkR7iCGVoSnfmXTMd/DE5Dwk3ExlWJ77/n8
 6QXOTuIUWi1p1iP4G5bpPHqQj1DmBiDwAokIONmXoTlM+njgNK8eYlevnbbow9B0pc+U
 5aGyyipRR0UZqSRLKebh0+42VMwnqNRS9oecjKD5mMSpQesXBDI+Xh3Mkg6yiPyE5RjQ
 /OaifUrOwdVeethDdE86HE1hdhGckNlYrA9euWMPn699PJ7l7Hk3dfFp+Q9Dok5fT9Oe
 UwzfJnK2OURLd9z4p1E1CeA8DIO64YpH4Zm31e+ruacOonWrF0yL/7f5ldJvcov9Z1G+
 rnlw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=PAGMPg1oviQBjanQMuXwwRvreJHA+xXSsbF+rVnmABo=;
 b=poSc/kWrDHZIo+o/GwV7l6GtK/JBr5E7H+tqUjm08p3b8JFsummpxOOd0HLj1Au2Uo
 3spsRZeG4oFLrEQfYeW9vYMYb1FMqGkwqfO+QjKof3LPEZh7/fi0x/GanmgT6lckeMT/
 lIjh1zGT+MsJoXrdZlbccyOUKPWK9NC6XQVcfNnhBatQzwFxxlGZ4hgo8gmHq6QexIgY
 a8sUPSTlFl4j9YaeLzGOu7QQiapPnuCPI6Q/WNRMLn2GLvK7DFYfOhmfWimRZwG1LAQY
 Owyb5N2+dwfF5Y1G2c3J3L4cOxlvGxslSsMw0bh6Oc4XGCGLJykCXBbyqGZ2dOGOF+ZH
 ZRtA==
X-Gm-Message-State: AJaThX6izarZnp53wipb7pJ6bZXkPjl4h7Ga0YCJJy/Xjtlq4c8Q2y9k
 HVPSApnqLqzvTiNIlUJnGBmoQXTFd99kvnK2I51XtA==
X-Google-Smtp-Source: AGs4zMaOq0NE2XAnPb2aSMAenvQYIra9r9I9OqDGjqHPMAzZj/hOPvKqt8rkqyfWTC0aR1H3wfmt/f6FieY5jmzAZpU=
X-Received: by 10.107.30.81 with SMTP id e78mr22143577ioe.130.1511627790118;
 Sat, 25 Nov 2017 08:36:30 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 08:36:29 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd]
In-Reply-To: <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
 <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 25 Nov 2017 09:36:29 -0700
X-Google-Sender-Auth: cPV8WY30lXU33pTYiZKqTUQRXtA
Message-ID: <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, 
 Scott Long <scottl@samsco.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 16:36:31 -0000

On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon <avg@freebsd.org> wrote:

> On 24/11/2017 18:33, Warner Losh wrote:
> >
> >
> > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg@freebsd.org
> > <mailto:avg@freebsd.org>> wrote:
> >
> >     On 24/11/2017 15:08, Warner Losh wrote:
> >     >
> >     >
> >     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>
> >     > <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote:
> >     >
> >     >
> >     >     https://reviews.freebsd.org/D13224
> >     <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/
> D13224
> >     <https://reviews.freebsd.org/D13224>>
> >     >
> >     >     Anyone interested is welcome to join the review.
> >     >
> >     >
> >     > I think it's a really bad idea. It introduces a
> 'one-size-fits-all' notion of
> >     > QoS that seems misguided. It conflates a shorter timeout with
> don't retry. And
> >     > why is retrying bad? It seems more a notion of 'fail fast' or so
> other concept.
> >     > There's so many other ways you'd want to use it. And it uses the
> same return
> >     > code (EIO) to mean something new. It's generally meant 'The lower
> layers have
> >     > retried this, and it failed, do not submit it again as it will not
> succeed' with
> >     > 'I gave it a half-assed attempt, and that failed, but resubmission
> might work'.
> >     > This breaks a number of assumptions in the BUF/BIO layer as well
> as parts of CAM
> >     > even more than they are broken now.
> >     >
> >     > So let's step back a bit: what problem is it trying to solve?
> >
> >     A simple example.  I have a mirror, I issue a read to one of its
> members.  Let's
> >     assume there is some trouble with that particular block on that
> particular disk.
> >      The disk may spend a lot of time trying to read it and would still
> fail.  With
> >     the current defaults I would wait 5x that time to finally get the
> error back.
> >     Then I go to another mirror member and get my data from there.
> >     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
> read, get
> >     the error back sooner and try the other disk sooner.  Only if I know
> that there
> >     are no other copies to try, then I would use the normal read with
> all the
> >     retrying.
> >
> >
> > It sounds like you are optimizing the wrong thing and taking an overly
> > simplistic view of quality of service.
> > First, failing blocks on a disk is fairly rare. Do you really want to
> optimize
> > for that case?
>
> If it can be done without any harm to the sunny day scenario, then why not?
> I think that 'robustness' is the word here, not 'optimization'.


I fail to see how it is a robustness issue. You've not made that case. You
want the I/O to fail fast so you can give another disk a shot sooner.
That's optimization.

> Second, you're really saying 'If you can't read it fast, fail" since we
> only
> > control the software side of read retry.
>
> Am I?
> That's not what I wanted to say, really.  I just wanted to say, if this I/O
> fails, don't retry it, leave it to me.
> This is very simple, simplistic as you say, but I like simple.


Right. Simple doesn't make it right. In fact, simple often makes it wrong.
We have big issues with the nvd device today because it's mindlessly queues
all the trim requests to the NVMe device w/o collapsing them, resulting in
horrible performance.

> There's new op codes being proposed
> > that say 'read or fail within Xms' which is really what you want: if
> it's taking
> > too long on disk A you want to move to disk B. The notion here was we'd
> return
> > EAGAIN (or some other error) if it failed after Xms, and maybe do some
> emulation
> > in software for drives that don't support this. You'd tweak this number
> to
> > control performance. You're likely to get a much bigger performance win
> all the
> > time by scheduling I/O to drives that have the best recent latency.
>
> ZFS already does some latency based decisions.
> The things that you describe are very interesting, but they are for the
> future.
>
> > Third, do you have numbers that show this is actually a win?
>
> I do not have any numbers right now.
> What kind of numbers would you like?  What kind of scenarios?


The usual kind. How is latency for I/O improved when you have a disk with a
few failing sectors that take a long time to read (which isn't a given:
some sectors fail fast). What happens when you have a failed disk? etc. How
does this compare with the current system.

Basically, how do you know this will really make things better and isn't
some kind of 'feel good' thing about 'doing something clever' about the
problem that may actually make things worse.

> This is a terrible
> > thing from an architectural view.
>
> You have said this several times, but unfortunately you haven't explained
> it yet.


I have explained it. You weren't listening.

1. It breaks the EIO contract that's currently in place.
2. It presumes to know what kind of retries should be done at the upper
layers where today we have a system that's more black and white. You don't
know the same info the low layers have to know whether to try another
drive, or just retry this one.
3. It assumes that retries are the source of latency in the system. they
aren't necessarily.
4. It assumes retries are necessarily slow: they may be, they might not be.
All depends on the drive (SSDs repeated I/O are often faster than actual
I/O).
5. It's just one bit when you really need more complex nuances to get good
QoE out of the I/O system. Retries is an incidental detail that's not that
important, while latency is what you care most about minimizing. You
wouldn't care if I tried to read the data 20 times if it got the result
faster than going to a different drive.
6. It's putting the wrong kind of specific hints into the mix.

> Absent numbers that show it's a big win, I'm
> > very hesitant to say OK.
> >
> > Forth, there's a large number of places in the stack today that need to
> > communicate their I/O is more urgent, and we don't have any good way to
> > communicate even that simple concept down the stack.
>
> That's unfortunately, but my proposal has quite little to do with I/O
> scheduling, priorities, etc.


Except it does. It dictates error recovery policy which is I/O scheduling.

> Finally, the only places that ZFS uses the TRYHARDER flag are for things
> like
> > the super block if I'm reading the code right. It doesn't do it for
> normal I/O.
>
> Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in
> the
> same way as ZIO_FLAG_TRYHARD.
>
> > There's no code to cope with what would happen if all the copies of a
> block
> > couldn't be read with the NORETRY flag. One of them might contain the
> data.
>
> ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.
>

Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when
an I/O fails' It doesn't control retries at all in Solaris. It's a
different concept entirely, and one badly thought out.

Warner

From owner-freebsd-geom@freebsd.org  Sat Nov 25 16:58:38 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 49CC2DEADA2;
 Sat, 25 Nov 2017 16:58:38 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f43.google.com (mail-lf0-f43.google.com
 [209.85.215.43])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EF135662E9;
 Sat, 25 Nov 2017 16:58:37 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f43.google.com with SMTP id g35so28372631lfi.13;
 Sat, 25 Nov 2017 08:58:37 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=sNmBbqSvUdx+P0oBQz+HZ0eKlahQ3p0XcaiUMbppq94=;
 b=ZbYvg6jV7ktmRiS8wYCaHCHERogGTOY8Npky+qYT3+jp5cvvlWMDTohvMf13T44Fdv
 k856mLl06msqEevycDCZqvaFy0z5P3D6LMuQ2Rn7DVPWeGsye+F4EWW6zh3Y7gxBsger
 JFKzBWv/XiRArbdOYW5TR8jZK1Lx1D5qXkXR76XQ58heVIThtOfG7cUcC5zx/5TQeIwQ
 cp4kxE1gfRKxSKmH/RSi+9hnLBUtUa/xjZEWYEwIFnLTo7pv+BDKliFtJ1YR/B+RCb9j
 dzz236S/DUb0Eq3E4iKj0A33npuswtiw1G1Gs4F+6Nd3eCdYW2xuzSt1y8Q3X52P2V5P
 gf6w==
X-Gm-Message-State: AJaThX4KWIf7Tpc0YteQA1eyDsSBQ5luze6tAgx0c7ZK5PjfjH7VPL1F
 Ad3nYuZRatRAfec5FodPk03zwUJoS3g=
X-Google-Smtp-Source: AGs4zMaPazeEKorA20LgfmuxaRUD4LZnRTIi/eMwruMrxb9Wv325OBtJQl3mgpdxVZyMEiz1uxJGig==
X-Received: by 10.25.20.77 with SMTP id k74mr9606078lfi.80.1511629110179;
 Sat, 25 Nov 2017 08:58:30 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id v12sm5027560ljd.15.2017.11.25.08.58.28
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 25 Nov 2017 08:58:29 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Warner Losh <imp@bsdimp.com>
Cc: Scott Long <scottl@samsco.org>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
 <CANCZdfqiO7tmD+cehaeM-RuENY_Bt5Qj3sOgA4ZbY67oDrcHbg@mail.gmail.com>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <27c9395f-5b3c-a062-3aee-de591770af0b@FreeBSD.org>
Date: Sat, 25 Nov 2017 18:58:27 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <CANCZdfqiO7tmD+cehaeM-RuENY_Bt5Qj3sOgA4ZbY67oDrcHbg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 16:58:38 -0000

On 25/11/2017 18:25, Warner Losh wrote:
> 
> 
> On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>> wrote:
> 
>     On 24/11/2017 16:57, Scott Long wrote:
>     >
>     >
>     >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>     >>
>     >> On 24/11/2017 15:08, Warner Losh wrote:
>     >>>
>     >>>
>     >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
>     <mailto:avg@freebsd.org>
>     >>> <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote:
>     >>>
>     >>>
>     >>>    https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>>
>     >>>
>     >>>    Anyone interested is welcome to join the review.
>     >>>
>     >>>
>     >>> I think it's a really bad idea. It introduces a 'one-size-fits-all'
>     notion of
>     >>> QoS that seems misguided. It conflates a shorter timeout with don't
>     retry. And
>     >>> why is retrying bad? It seems more a notion of 'fail fast' or so other
>     concept.
>     >>> There's so many other ways you'd want to use it. And it uses the same return
>     >>> code (EIO) to mean something new. It's generally meant 'The lower layers
>     have
>     >>> retried this, and it failed, do not submit it again as it will not
>     succeed' with
>     >>> 'I gave it a half-assed attempt, and that failed, but resubmission might
>     work'.
>     >>> This breaks a number of assumptions in the BUF/BIO layer as well as
>     parts of CAM
>     >>> even more than they are broken now.
>     >>>
>     >>> So let's step back a bit: what problem is it trying to solve?
>     >>
>     >> A simple example.  I have a mirror, I issue a read to one of its
>     members.  Let's
>     >> assume there is some trouble with that particular block on that
>     particular disk.
>     >> The disk may spend a lot of time trying to read it and would still fail. 
>     With
>     >> the current defaults I would wait 5x that time to finally get the error back.
>     >> Then I go to another mirror member and get my data from there.
>     >
>     > There are many RAID stacks that already solve this problem by having a policy
>     > of always reading all disk members for every transaction, and throwing
>     away the
>     > sub-transactions that arrive late.  It’s not a policy that is always
>     desired, but it
>     > serves a useful purpose for low-latency needs.
> 
>     That's another possible and useful strategy.
> 
>     >> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
>     >> the error back sooner and try the other disk sooner.  Only if I know that there
>     >> are no other copies to try, then I would use the normal read with all the retrying.
>     >>
>     >
>     > I agree with Warner that what you are proposing is not correct.  It weakens the
>     > contract between the disk layer and the upper layers, making it less clear who is
>     > responsible for retries and less clear what “EIO” means.  That contract is already
>     > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
>     > are working on a plan to fix that.
> 
>     Well...  I do realize now that there is some problem in this area, both you and
>     Warner mentioned it.  But knowing that it exists is not the same as knowing what
>     it is :-)
>     I understand that it could be rather complex and not easy to describe in a short
>     email...
> 
>     But then, this flag is optional, it's off by default and no one is forced to
>     used it.  If it's used only by ZFS, then it would not be horrible.
> 
> 
> Except that it isn't the same flag as what Solaris has (its B_FAILFAST does
> something different: it isn't about limiting retries but about failing ALL the
> queued I/O for a unit, not just trying one retry), and the problems that it
> solves are quite rare. And if you return a different errno, then the EIO
> contract is still fulfilled. 

Yes, it isn't the same.
I think that illumos flag does even more.

>     Unless it makes things very hard for the infrastructure.
>     But I am circling back to not knowing what problem(s) you and Warner are
>     planning to fix.
> 
> 
> The middle layers of the I/O system are a bit fragile in the face of I/O errors.
> We're fixing that.

What are the middle layers?

> Of course, you still haven't articulated why this approach would be better

Better than what?

> nor
> show any numbers as to how it makes things better.

By now, I have.  See my reply to Scott's email.

-- 
Andriy Gapon

From owner-freebsd-geom@freebsd.org  Sat Nov 25 17:36:31 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9894ADEBB29;
 Sat, 25 Nov 2017 17:36:31 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f45.google.com (mail-lf0-f45.google.com
 [209.85.215.45])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 173D96767F;
 Sat, 25 Nov 2017 17:36:30 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f45.google.com with SMTP id k66so28461669lfg.3;
 Sat, 25 Nov 2017 09:36:30 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=iOl1D78pVkm1p2WO2iSswIdicfhFfzoRB2Bc9x8J0XA=;
 b=FwAhgIqykPRVE9Q778D+OPfmyn95E5zd8SIsartEGdWg1Ssj9VpFs/Gvcwq5N5rI9H
 8jCfpIwC2dfgsmCoT7YdNA79XbZLsWf5M6TxrkJ5wTl7EiCNWYVFQHmxM/tN758mxqGU
 CXD6NL6lpvPFgTIjKy8/3ruVgDirqO4wOwg+LFRlaXftY4mRdtYkpTaTZGIuE3sk1fRS
 tMSOrAqrkxaNCqFBIDnxdKwu19Mqw2DTfK9+fcU3dNtKwo/tyRcsZVseXfphZ/t0lpFo
 uTO7GWHARKCOSkjmUCmBoJwIxNjPyzyNYRU+oq2OoJexdlVuqRKjQrMZexOnAD8HI3ZU
 Aplg==
X-Gm-Message-State: AJaThX4VvLGCfaJtIBGAuRmop0Jb1Xk7CKTLNuZkpUJ67LBcdxotKHfj
 520NpS6rN+tiPce5ICUf/OhRAutEV/c=
X-Google-Smtp-Source: AGs4zMbBpiuwlbgmOFur1itv4JJ+4S3jA/+BP4yojPXM0feB1c+MAnucw6G9lWk0Z6UcVuGQhWCJQQ==
X-Received: by 10.46.84.86 with SMTP id y22mr4011842ljd.89.1511631382512;
 Sat, 25 Nov 2017 09:36:22 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id f66sm1845239lfl.72.2017.11.25.09.36.20
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 25 Nov 2017 09:36:20 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Scott Long <scottl@samsco.org>
Cc: Warner Losh <imp@bsdimp.com>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
 <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>
Date: Sat, 25 Nov 2017 19:36:19 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 17:36:31 -0000

On 25/11/2017 12:54, Scott Long wrote:
> 
> 
>> On Nov 24, 2017, at 10:17 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>>
>>
>>>> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get
>>>> the error back sooner and try the other disk sooner.  Only if I know that there
>>>> are no other copies to try, then I would use the normal read with all the retrying.
>>>>
>>>
>>> I agree with Warner that what you are proposing is not correct.  It weakens the
>>> contract between the disk layer and the upper layers, making it less clear who is
>>> responsible for retries and less clear what “EIO” means.  That contract is already
>>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
>>> are working on a plan to fix that.
>>
>> Well...  I do realize now that there is some problem in this area, both you and
>> Warner mentioned it.  But knowing that it exists is not the same as knowing what
>> it is :-)
>> I understand that it could be rather complex and not easy to describe in a short
>> email…
>>
> 
> There are too many questions to ask, I will do my best to keep the conversation
> logical.  First, how do you propose to distinguish between EIO due to a lengthy
> set of timeouts, vs EIO due to an immediate error returned by the disk hardware?

At what layer / component?
If I am the issuer of the request then I know how I issued that request and what
kind of request it was.  If I am an intermediate layer, then what do I care.

> CAM has an extensive table-driven error recovery protocol who’s purpose is to
> decide whether or not to do retries based on hardware state information that is
> not made available to the upper layers.  Do you have a test case that demonstrates
> the problem that you’re trying to solve?  Maybe the error recovery table is wrong
> and you’re encountering a case that should not be retried.  If that’s what’s going on,
> we should fix CAM instead of inventing a new work-around.

Let's assume that I am talking about the case of not being able to read an HDD
sector that is gone bad.
Here is a real world example:

Jun 16 10:40:18 trant kernel: ahcich0: NCQ error, slot = 20, port = -1
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun 16 10:40:20 trant kernel: ahcich0: NCQ error, slot = 22, port = -1
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun 16 10:40:22 trant kernel: ahcich0: NCQ error, slot = 24, port = -1
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun 16 10:40:25 trant kernel: ahcich0: NCQ error, slot = 26, port = -1
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): Retrying command
Jun 16 10:40:27 trant kernel: ahcich0: NCQ error, slot = 28, port = -1
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60
00 00 58 62 40 2c 00 00 08 00 00
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR),
error: 40 (UNC )
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00
00 00 00
Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): Error 5, Retries exhausted

I do not see anything wrong in what CAM / ahci /ata_da did here.
They did what I would expect them to do.  They tried very hard to get data that
I told them I need.

Timestamp of the first error is Jun 16 10:40:18.
Timestamp of the last error is Jun 16 10:40:27.
So, it took additional 9 seconds to finally produce EIO.
That disk is a part of a ZFS mirror.  If the request was failed after the first
attempt, then ZFS would be able to get the data from a good disk much sooner.

And don't take me wrong, I do NOT want CAM or GEOM to make that decision by
itself.  I want ZFS to be able to tell the lower layers when they should try as
hard as they normally do and when they should report an I/O error as soon as it
happens without any retries.

> Second, what about disk subsystems that do retries internally, out of the control
> of the FreeBSD driver?  This would include most hardware RAID controllers.
> Should what you are proposing only work for a subset of the kinds of storage
> systems that are available and in common use?

Yes.  I do not worry about things that are beyond my control.
Those subsystems would behave as they do now.  So, nothing would get worse.

> Third, let’s say that you run out of alternate copies to try, and as you stated
> originally, that will force you to retry the copies that had returned EIO.  How
> will you know when you can retry?  How will you know how many times you
> will retry?  How will you know that a retry is even possible?

I am continuing to use ZFS as an example.  It already has all the logic built
in.  If all vdev zio-s (requests to mirror members as an example) fail, then
their parent zio (a logical read from the mirror) will be retried (by ZFS) and
when ZFS retries it sets a special flag (ZIO_FLAG_IO_RETRY) on that zio and on
its future child zio-s.

Essentially, my answer is you have to program it correctly, there is no magic.

> Should the retries
> be able to be canceled?

I think that this is an orthogonal question.
I do not have any answer and I am not ready to discuss this at all.

> Why is overloading EIO so bad?  brelse() will call bdirty() when a BIO_WRITE
> command has failed with EIO.  Calling bdirty() has the effect of retrying the I/O.
> This disregards the fact that disk drivers only return EIO when they’ve decided
> that the I/O cannot be retried.  It has no termination condition for the retries, and
> will endlessly retry I/O in vain; I’ve seen this quite frequently.  It also disregards
> the fact that I/O marked as B_PAGING can’t be retried in this fashion, and will
> trigger a panic.  Because we pretend that EIO can be retried, we are left with
> a system that is very fragile when I/O actually does fail.  Instead of adding
> more special cases and blurred lines, I want to go back to enforcing strict
> contracts between the layers and force the core parts of the system to respect
> those contracts and handle errors properly, instead of just retrying and
> hoping for the best.

So, I suggest that the buffer layer (all the b* functions) does not use the
proposed flag.  Any problems that exist in it should be resolved first.
ZFS does not use that layer.

>> But then, this flag is optional, it's off by default and no one is forced to
>> used it.  If it's used only by ZFS, then it would not be horrible.
>> Unless it makes things very hard for the infrastructure.
>> But I am circling back to not knowing what problem(s) you and Warner are
>> planning to fix.
>>
> 
> Saying that a feature is optional means nothing; while consumers of the API
> might be able to ignore it, the producers of the API cannot ignore it.  It is
> these producers who are sick right now and should be fixed, instead of
> creating new ways to get even more sick.

I completely agree.
But which producers of the API do you mean specifically?
So far, you mentioned only the consumer level problems with the b-layer.

Having said all of the above, I must admit one thing.
When I proposed BIO_NORETRY I had only the simplest GEOM topology in mind:
ZFS -> [partition] -> disk.
Now that I start to think about more complex topologies I am not really sure how
the flag should be handled by geom-s with complex internal behavior.  If that
can be implemented reasonably and clearly, if the flag will create a big mess.
E.g., things like gmirrors on top of gmirrors and so on.
Maybe the flag, if it ever accepted, should never be propagated automatically.
Only geom-s that are aware of it should propagate or request it.  That should be
safer.

-- 
Andriy Gapon

From owner-freebsd-geom@freebsd.org  Sat Nov 25 17:38:22 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0910DEBBEB
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Sat, 25 Nov 2017 17:38:22 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-io0-x230.google.com (mail-io0-x230.google.com
 [IPv6:2607:f8b0:4001:c06::230])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id BB2B6678DE
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 17:38:22 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-io0-x230.google.com with SMTP id d21so10099575ioe.7
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 09:38:22 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=3iyg3cV4DFZHCSkAutindZTnEutdchf5thqOLOgiRDE=;
 b=sBaoG7R6xwKCU2ZqOf+FmtFLSmDy+IGCKl1rYjgYcQVPD8SX74GnlHXJjXAihqhzxB
 qYO1Nmusy0hdZBTlqRmtawdcgBucaQkU0Oi4tbIfNvRJu/E2quYUNAGNTvPTvGyMqzqF
 0o22Xp9QXupfqN2fY2yqRsZsYLnvRU0yDtOe3kzqC5BNYKWHxtKVL84kJ6Cdtxc0IUaw
 b2zgXXFmsZ6mo28JPH6uzln2ufZRyYjV2DKL4MjiLRhTsiM1opHMTqoHN+68wIuDXMnW
 Wb6C83vs2nNGaHHQlk6LuX+AFdQqubx+QyjY6BScJsQNCu/TLG5x/xZaaZgF4FG3KN/4
 Wzdg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=3iyg3cV4DFZHCSkAutindZTnEutdchf5thqOLOgiRDE=;
 b=ryfpG30In2aVWFgFUkX7qEAQVgVRczrsLeTiuU5P7HtgVDJRbocgA56to7OUhLJv1K
 eW/JnwkOyM/mOKwcDaYBw0ErtaLtpAscCZX4KZkrch9SUpwn9dG20KvFaZ6FOS+B30BO
 ng6z5stY7nK6/lII0DuUia4xb9vfhi6G8dyt6TN+3uiXBPY80P5ka0IP0Txv4oixuPMs
 w0khl9tFWgLQSwiQEOWkx1bLUGpO3lM0zKRS7GECb06fyz9JRn0iMEiAX95Cfau8gw/Y
 A66Fl5eO7oysEUAuKxORtJ2z1T5e5bRm5x9RBWTGTgNEkQRhZTnpIJ1/ERGtD1a91cvn
 pqlA==
X-Gm-Message-State: AJaThX40n+fq+h+BUKF/IwQuHui65gbflv3E2orB4hcRFA7ngp22RYqu
 h/HbtFHDs2eAavYzodywlgciNqJehq0yoMYFO6j7rg==
X-Google-Smtp-Source: AGs4zMZpYS3c6FiEoS1YalzMszvrxFrWG+Ow/PBeOV+/wjBX9x9Kdxe/x6Qd2CetfGxNr8hrU8xe+Vh7WeKktp0Fnq4=
X-Received: by 10.107.104.18 with SMTP id d18mr33320248ioc.136.1511631501988; 
 Sat, 25 Nov 2017 09:38:21 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 09:38:21 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd]
In-Reply-To: <27c9395f-5b3c-a062-3aee-de591770af0b@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
 <CANCZdfqiO7tmD+cehaeM-RuENY_Bt5Qj3sOgA4ZbY67oDrcHbg@mail.gmail.com>
 <27c9395f-5b3c-a062-3aee-de591770af0b@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 25 Nov 2017 10:38:21 -0700
X-Google-Sender-Auth: hRNqrbEqFn4VgCb2ByZQrwSf-oE
Message-ID: <CANCZdfo+Ab3apD6uw5A42V0AsFWAtPCHJg+QK2p2c+i8Frhxsw@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: Scott Long <scottl@samsco.org>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 17:38:23 -0000

On Sat, Nov 25, 2017 at 9:58 AM, Andriy Gapon <avg@freebsd.org> wrote:

> On 25/11/2017 18:25, Warner Losh wrote:
> >
> >
> > On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon <avg@freebsd.org
> > <mailto:avg@freebsd.org>> wrote:
> >
> >     On 24/11/2017 16:57, Scott Long wrote:
> >     >
> >     >
> >     >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon <avg@FreeBSD.org>
> wrote:
> >     >>
> >     >> On 24/11/2017 15:08, Warner Losh wrote:
> >     >>>
> >     >>>
> >     >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org
> >     <mailto:avg@freebsd.org>
> >     >>> <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote:
> >     >>>
> >     >>>
> >     >>>    https://reviews.freebsd.org/D13224
> >     <https://reviews.freebsd.org/D13224> <https://reviews.freebsd.org/
> D13224
> >     <https://reviews.freebsd.org/D13224>>
> >     >>>
> >     >>>    Anyone interested is welcome to join the review.
> >     >>>
> >     >>>
> >     >>> I think it's a really bad idea. It introduces a
> 'one-size-fits-all'
> >     notion of
> >     >>> QoS that seems misguided. It conflates a shorter timeout with
> don't
> >     retry. And
> >     >>> why is retrying bad? It seems more a notion of 'fail fast' or s=
o
> other
> >     concept.
> >     >>> There's so many other ways you'd want to use it. And it uses th=
e
> same return
> >     >>> code (EIO) to mean something new. It's generally meant 'The
> lower layers
> >     have
> >     >>> retried this, and it failed, do not submit it again as it will
> not
> >     succeed' with
> >     >>> 'I gave it a half-assed attempt, and that failed, but
> resubmission might
> >     work'.
> >     >>> This breaks a number of assumptions in the BUF/BIO layer as wel=
l
> as
> >     parts of CAM
> >     >>> even more than they are broken now.
> >     >>>
> >     >>> So let's step back a bit: what problem is it trying to solve?
> >     >>
> >     >> A simple example.  I have a mirror, I issue a read to one of its
> >     members.  Let's
> >     >> assume there is some trouble with that particular block on that
> >     particular disk.
> >     >> The disk may spend a lot of time trying to read it and would
> still fail.
> >     With
> >     >> the current defaults I would wait 5x that time to finally get th=
e
> error back.
> >     >> Then I go to another mirror member and get my data from there.
> >     >
> >     > There are many RAID stacks that already solve this problem by
> having a policy
> >     > of always reading all disk members for every transaction, and
> throwing
> >     away the
> >     > sub-transactions that arrive late.  It=E2=80=99s not a policy tha=
t is
> always
> >     desired, but it
> >     > serves a useful purpose for low-latency needs.
> >
> >     That's another possible and useful strategy.
> >
> >     >> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the
> first read, get
> >     >> the error back sooner and try the other disk sooner.  Only if I
> know that there
> >     >> are no other copies to try, then I would use the normal read wit=
h
> all the retrying.
> >     >>
> >     >
> >     > I agree with Warner that what you are proposing is not correct.
> It weakens the
> >     > contract between the disk layer and the upper layers, making it
> less clear who is
> >     > responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D=
 means.  That
> contract is already
> >     > weak due to poor design decisions in VFS-BIO and GEOM, and Warner
> and I
> >     > are working on a plan to fix that.
> >
> >     Well...  I do realize now that there is some problem in this area,
> both you and
> >     Warner mentioned it.  But knowing that it exists is not the same as
> knowing what
> >     it is :-)
> >     I understand that it could be rather complex and not easy to
> describe in a short
> >     email...
> >
> >     But then, this flag is optional, it's off by default and no one is
> forced to
> >     used it.  If it's used only by ZFS, then it would not be horrible.
> >
> >
> > Except that it isn't the same flag as what Solaris has (its B_FAILFAST
> does
> > something different: it isn't about limiting retries but about failing
> ALL the
> > queued I/O for a unit, not just trying one retry), and the problems tha=
t
> it
> > solves are quite rare. And if you return a different errno, then the EI=
O
> > contract is still fulfilled.
>
> Yes, it isn't the same.
> I think that illumos flag does even more.


Since it isn't the same, and there's not other systems that do a similar
thing, that ups the burden of proof that this is a good idea.

>     Unless it makes things very hard for the infrastructure.
> >     But I am circling back to not knowing what problem(s) you and Warne=
r
> are
> >     planning to fix.
> >
> >
> > The middle layers of the I/O system are a bit fragile in the face of I/=
O
> errors.
> > We're fixing that.
>
> What are the middle layers?


The buffer cache and lower layers of the UFS code is where the problems
chiefly lie.

> Of course, you still haven't articulated why this approach would be bette=
r
>
> Better than what?


Well, anything?


> > nor
> > show any numbers as to how it makes things better.
>
> By now, I have.  See my reply to Scott's email.


I just checked my email, I've seen no such reply. I checked it before I
replied.  Maybe it's just delayed.

Warner

From owner-freebsd-geom@freebsd.org  Sat Nov 25 17:41:03 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB5D7DEBD5E;
 Sat, 25 Nov 2017 17:41:03 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f42.google.com (mail-lf0-f42.google.com
 [209.85.215.42])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 6A68067AB8;
 Sat, 25 Nov 2017 17:41:03 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f42.google.com with SMTP id o41so28431705lfi.2;
 Sat, 25 Nov 2017 09:41:03 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=XPeLSjy+VQezLXaWyLqVGxRiCz8i9BQr86mV3Lj2YAY=;
 b=P5aLhmuYvRBS2YzstZB/VysjO8tDdJJbkA4wc5dEAqR37aBMeuaP2jmTHwBSSXeZoZ
 0/2t++942PSBvM4SrNJidCkQSPuCaHdvGBF5xAkDIaPTmKg6msEv9PINiWbmNcYkLE3i
 vmXz5BwDSRT/pGxmmHsim4Zlm6/ifAOlBZAJRArWDncGHwtgtTrM/BjaFwrPg+9suF2q
 gUxNmXczOtc90V3bTXcDD6x8RCsqpVPSX2NIw8eolApVVfBsl/wF+oKJF8hpowlPprof
 E341dmmpia2l/OkYqRHuWxAPA/S3nnZCG8TfSC2fGc6RsaWZ8/3BeIyYuYb9VLIXjrs3
 J0ug==
X-Gm-Message-State: AJaThX5VQgIz99QNQimDK5dBVUgDFPyYvPs1xAHfwmF285C9KdPkF+Db
 8WvrI088SaihgKke3mGbprI=
X-Google-Smtp-Source: AGs4zMaKaiPdOiBetpyS/MEYxVNA+8qROI/0uCcumggNbeOONXIBzJLKs7oDi2qoq0qJD4SHl4FwIg==
X-Received: by 10.46.99.211 with SMTP id s80mr11250056lje.7.1511631660987;
 Sat, 25 Nov 2017 09:41:00 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id m26sm5121410ljb.61.2017.11.25.09.40.59
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 25 Nov 2017 09:41:00 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Warner Losh <imp@bsdimp.com>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org,
 Scott Long <scottl@samsco.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
 <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
 <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org>
Date: Sat, 25 Nov 2017 19:40:59 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 17:41:04 -0000


Before anything else, I would like to say that I got an impression that we speak
from so different angles that we either don't understand each other's words or,
even worse, misinterpret them.

On 25/11/2017 18:36, Warner Losh wrote:
> 
> 
> On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>> wrote:
> 
>     On 24/11/2017 18:33, Warner Losh wrote:
>     >
>     >
>     > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org>
>     > <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote:
>     >
>     >     On 24/11/2017 15:08, Warner Losh wrote:
>     >     >
>     >     >
>     >     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org> <mailto:avg@freebsd.org
>     <mailto:avg@freebsd.org>>
>     >     > <mailto:avg@freebsd.org <mailto:avg@freebsd.org> <mailto:avg@freebsd.org
>     <mailto:avg@freebsd.org>>>> wrote:
>     >     >
>     >     >
>     >     >     https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>
>     >     <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>> <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>
>     >     <https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>>>
>     >     >
>     >     >     Anyone interested is welcome to join the review.
>     >     >
>     >     >
>     >     > I think it's a really bad idea. It introduces a 'one-size-fits-all'
>     notion of
>     >     > QoS that seems misguided. It conflates a shorter timeout with don't
>     retry. And
>     >     > why is retrying bad? It seems more a notion of 'fail fast' or so
>     other concept.
>     >     > There's so many other ways you'd want to use it. And it uses the
>     same return
>     >     > code (EIO) to mean something new. It's generally meant 'The lower
>     layers have
>     >     > retried this, and it failed, do not submit it again as it will not
>     succeed' with
>     >     > 'I gave it a half-assed attempt, and that failed, but resubmission
>     might work'.
>     >     > This breaks a number of assumptions in the BUF/BIO layer as well as
>     parts of CAM
>     >     > even more than they are broken now.
>     >     >
>     >     > So let's step back a bit: what problem is it trying to solve?
>     >
>     >     A simple example.  I have a mirror, I issue a read to one of its
>     members.  Let's
>     >     assume there is some trouble with that particular block on that
>     particular disk.
>     >      The disk may spend a lot of time trying to read it and would still
>     fail.  With
>     >     the current defaults I would wait 5x that time to finally get the
>     error back.
>     >     Then I go to another mirror member and get my data from there.
>     >     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
>     read, get
>     >     the error back sooner and try the other disk sooner.  Only if I know
>     that there
>     >     are no other copies to try, then I would use the normal read with all the
>     >     retrying.
>     >
>     >
>     > It sounds like you are optimizing the wrong thing and taking an overly
>     > simplistic view of quality of service.
>     > First, failing blocks on a disk is fairly rare. Do you really want to optimize
>     > for that case?
> 
>     If it can be done without any harm to the sunny day scenario, then why not?
>     I think that 'robustness' is the word here, not 'optimization'.
> 
> 
> I fail to see how it is a robustness issue. You've not made that case. You want
> the I/O to fail fast so you can give another disk a shot sooner. That's
> optimization.

Then you can call a protection against denial-of-service an optimization too.
You want to do things faster, right?

>     > Second, you're really saying 'If you can't read it fast, fail" since we only
>     > control the software side of read retry.
> 
>     Am I?
>     That's not what I wanted to say, really.  I just wanted to say, if this I/O
>     fails, don't retry it, leave it to me.
>     This is very simple, simplistic as you say, but I like simple.
> 
> 
> Right. Simple doesn't make it right. In fact, simple often makes it wrong.

I agree.  The same applies to complex well.  Let's stop at this.

> We
> have big issues with the nvd device today because it's mindlessly queues all the
> trim requests to the NVMe device w/o collapsing them, resulting in horrible
> performance.
> 
>     > There's new op codes being proposed
>     > that say 'read or fail within Xms' which is really what you want: if it's taking
>     > too long on disk A you want to move to disk B. The notion here was we'd return
>     > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation
>     > in software for drives that don't support this. You'd tweak this number to
>     > control performance. You're likely to get a much bigger performance win all the
>     > time by scheduling I/O to drives that have the best recent latency.
> 
>     ZFS already does some latency based decisions.
>     The things that you describe are very interesting, but they are for the future.
> 
>     > Third, do you have numbers that show this is actually a win?
> 
>     I do not have any numbers right now.
>     What kind of numbers would you like?  What kind of scenarios?
> 
> 
> The usual kind. How is latency for I/O improved when you have a disk with a few
> failing sectors that take a long time to read (which isn't a given: some sectors
> fail fast).

Today I gave an example of how four retries added about 9 seconds of additional
delay.  I think that that is significant.

> What happens when you have a failed disk? etc. How does this compare
> with the current system.

I haven't done such an experiment.  I guess it depends on how exactly the disk
fails.  There is a big difference between a disk dropping a link and a disk
turning into a black hole.

> Basically, how do you know this will really make things better and isn't some
> kind of 'feel good' thing about 'doing something clever' about the problem that
> may actually make things worse.
> 
>     > This is a terrible
>     > thing from an architectural view.
> 
>     You have said this several times, but unfortunately you haven't explained it
>     yet.
> 
> 
> I have explained it. You weren't listening.

This is the first time I see the below list or anything like it.

> 1. It breaks the EIO contract that's currently in place.

This needs further explanation.

> 2. It presumes to know what kind of retries should be done at the upper layers
> where today we have a system that's more black and white.

I don't understand this argument.  If your upper level code does not know how to
do retries, then it should not concern itself with that and should not use the flag.

> You don't know the
> same info the low layers have to know whether to try another drive, or just
> retry this one.

Eh?  Either we have different definitions of upper and lower layers or I don't
understand how lower layers (e.g. CAM) can know about another drive.

> 3. It assumes that retries are the source of latency in the system. they aren't
> necessarily.

I am not assuming that at all for the general case.

> 4. It assumes retries are necessarily slow: they may be, they might not be. All
> depends on the drive (SSDs repeated I/O are often faster than actual I/O).

Of course.  But X plus epsilon is always greater than X.  And we know than in
many practical cases epsilon can be rather large.

> 5. It's just one bit when you really need more complex nuances to get good QoE
> out of the I/O system. Retries is an incidental detail that's not that
> important, while latency is what you care most about minimizing. You wouldn't
> care if I tried to read the data 20 times if it got the result faster than going
> to a different drive.

That's a good point.  But then again, it's the upper layers that have a better
chance of predicting this kind of thing.  That is, if I know that my backup
storage is extremely slow, then I will allow the fast primary storage do all
retries it wants to do.  It's not CAM nor scsi_da nor a specific SIM that can
make those decisions.  It's an issuer of the I/O request [or an intermediate
geom that encapsulates that knowledge and effectively acts as an issuer of I/O-s
to the lower geoms].

> 6. It's putting the wrong kind of specific hints into the mix.

This needs further explanation.

>     > Absent numbers that show it's a big win, I'm
>     > very hesitant to say OK.
>     >
>     > Forth, there's a large number of places in the stack today that need to
>     > communicate their I/O is more urgent, and we don't have any good way to
>     > communicate even that simple concept down the stack.
> 
>     That's unfortunately, but my proposal has quite little to do with I/O
>     scheduling, priorities, etc.
> 
> 
> Except it does. It dictates error recovery policy which is I/O scheduling.
> 
>     > Finally, the only places that ZFS uses the TRYHARDER flag are for things like
>     > the super block if I'm reading the code right. It doesn't do it for normal I/O.
> 
>     Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the
>     same way as ZIO_FLAG_TRYHARD.
> 
>     > There's no code to cope with what would happen if all the copies of a block
>     > couldn't be read with the NORETRY flag. One of them might contain the data.
> 
>     ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above.
> 
> 
> Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when an
> I/O fails' It doesn't control retries at all in Solaris. It's a different
> concept entirely, and one badly thought out.

I think that it does control retries.
And it does even more. My understanding is that bio-s with B_FAILFAST can be
failed immediately in the situation roughly equivalent to a CAM devq (or simq)
being frozen.


-- 
Andriy Gapon

From owner-freebsd-geom@freebsd.org  Sat Nov 25 17:50:08 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id DE8B2DEBFAC;
 Sat, 25 Nov 2017 17:50:08 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: from mail-lf0-f43.google.com (mail-lf0-f43.google.com
 [209.85.215.43])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 6AC8C67DE5;
 Sat, 25 Nov 2017 17:50:08 +0000 (UTC)
 (envelope-from agapon@gmail.com)
Received: by mail-lf0-f43.google.com with SMTP id x68so28511429lff.0;
 Sat, 25 Nov 2017 09:50:08 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=5ApCUC5DjZ1viDCnCrzu033PbzMQh+JO2M+WEA6YFh8=;
 b=CquXPToyf3XEhqCcFIHTuvpNPrIFqKCBFokh83IxQ+oYN3aSuSdvXrSOHqZFFQ84vz
 vhu8Qjz8JZoSka/xW6NTMm5ECJKoABu1aaHXhgjJqgcp3rByR/DgtG30q8O9RX7msDqq
 g4IxSrPcWYCMoQMA0gs8OANCF/PfAxxCjaYiAdRk+Plp0zYLU7NLFwgTEnzRddWj9gtO
 t0iSe/1/no3htvTsYhNsjvOqt2a2gXCfAX8aKDwO/Nq7RU4KOjWEv2GXSbtlfAt3Rc0P
 5nxbcVEZzaf3uI04FEwySbDCnxaOce4fW/LRQ++dZCVhVjtLXMckjLh2bYT4lMQwlqGk
 8Ggw==
X-Gm-Message-State: AJaThX7sPSpEO7PZLcrL+gIQxz0jcCZMwX4PaulyjEzVMvq9IhRng6Bd
 2RqYuI2wnS8U4nnsCkpfjxWlvMbVagw=
X-Google-Smtp-Source: AGs4zMZZPtEaOC5dWapQhfIZTZBuOiIG6iLS8nMAw2SiAkHolrWh3Xnzl0XQnyoZVX4HQjvhBQqODQ==
X-Received: by 10.46.77.148 with SMTP id c20mr13420749ljd.156.1511632205845;
 Sat, 25 Nov 2017 09:50:05 -0800 (PST)
Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96])
 by smtp.googlemail.com with ESMTPSA id a9sm4184731lfg.12.2017.11.25.09.50.04
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 25 Nov 2017 09:50:05 -0800 (PST)
Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS
 vdev_geom
To: Warner Losh <imp@bsdimp.com>
Cc: Scott Long <scottl@samsco.org>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
 <CANCZdfqiO7tmD+cehaeM-RuENY_Bt5Qj3sOgA4ZbY67oDrcHbg@mail.gmail.com>
 <27c9395f-5b3c-a062-3aee-de591770af0b@FreeBSD.org>
 <CANCZdfo+Ab3apD6uw5A42V0AsFWAtPCHJg+QK2p2c+i8Frhxsw@mail.gmail.com>
From: Andriy Gapon <avg@FreeBSD.org>
Message-ID: <b582fe10-8c2d-5755-cc0c-197977d8b3e6@FreeBSD.org>
Date: Sat, 25 Nov 2017 19:50:03 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <CANCZdfo+Ab3apD6uw5A42V0AsFWAtPCHJg+QK2p2c+i8Frhxsw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 17:50:09 -0000

On 25/11/2017 19:38, Warner Losh wrote:
> 
> 
> On Sat, Nov 25, 2017 at 9:58 AM, Andriy Gapon <avg@freebsd.org
> <mailto:avg@freebsd.org>> wrote:
> 
>     On 25/11/2017 18:25, Warner Losh wrote:
>     >
>     >
>     > On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org>
>     > <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>> wrote:
>     >
>     >     On 24/11/2017 16:57, Scott Long wrote:
>     >     >
>     >     >
>     >     >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon <avg@FreeBSD.org> wrote:
>     >     >>
>     >     >> On 24/11/2017 15:08, Warner Losh wrote:
>     >     >>>
>     >     >>>
>     >     >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg@freebsd.org <mailto:avg@freebsd.org>
>     >     <mailto:avg@freebsd.org <mailto:avg@freebsd.org>>
>     >     >>> <mailto:avg@freebsd.org <mailto:avg@freebsd.org> <mailto:avg@freebsd.org
>     <mailto:avg@freebsd.org>>>> wrote:
>     >     >>>
>     >     >>>
>     >     >>>    https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>
>     >     <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>> <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>
>     >     <https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>>>
>     >     >>>
>     >     >>>    Anyone interested is welcome to join the review.
>     >     >>>
>     >     >>>
>     >     >>> I think it's a really bad idea. It introduces a 'one-size-fits-all'
>     >     notion of
>     >     >>> QoS that seems misguided. It conflates a shorter timeout with don't
>     >     retry. And
>     >     >>> why is retrying bad? It seems more a notion of 'fail fast' or so other
>     >     concept.
>     >     >>> There's so many other ways you'd want to use it. And it uses the
>     same return
>     >     >>> code (EIO) to mean something new. It's generally meant 'The lower
>     layers
>     >     have
>     >     >>> retried this, and it failed, do not submit it again as it will not
>     >     succeed' with
>     >     >>> 'I gave it a half-assed attempt, and that failed, but resubmission
>     might
>     >     work'.
>     >     >>> This breaks a number of assumptions in the BUF/BIO layer as well as
>     >     parts of CAM
>     >     >>> even more than they are broken now.
>     >     >>>
>     >     >>> So let's step back a bit: what problem is it trying to solve?
>     >     >>
>     >     >> A simple example.  I have a mirror, I issue a read to one of its
>     >     members.  Let's
>     >     >> assume there is some trouble with that particular block on that
>     >     particular disk.
>     >     >> The disk may spend a lot of time trying to read it and would still
>     fail. 
>     >     With
>     >     >> the current defaults I would wait 5x that time to finally get the
>     error back.
>     >     >> Then I go to another mirror member and get my data from there.
>     >     >
>     >     > There are many RAID stacks that already solve this problem by having
>     a policy
>     >     > of always reading all disk members for every transaction, and throwing
>     >     away the
>     >     > sub-transactions that arrive late.  It’s not a policy that is always
>     >     desired, but it
>     >     > serves a useful purpose for low-latency needs.
>     >
>     >     That's another possible and useful strategy.
>     >
>     >     >> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
>     read, get
>     >     >> the error back sooner and try the other disk sooner.  Only if I
>     know that there
>     >     >> are no other copies to try, then I would use the normal read with
>     all the retrying.
>     >     >>
>     >     >
>     >     > I agree with Warner that what you are proposing is not correct.  It
>     weakens the
>     >     > contract between the disk layer and the upper layers, making it less
>     clear who is
>     >     > responsible for retries and less clear what “EIO” means.  That
>     contract is already
>     >     > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
>     >     > are working on a plan to fix that.
>     >
>     >     Well...  I do realize now that there is some problem in this area,
>     both you and
>     >     Warner mentioned it.  But knowing that it exists is not the same as
>     knowing what
>     >     it is :-)
>     >     I understand that it could be rather complex and not easy to describe
>     in a short
>     >     email...
>     >
>     >     But then, this flag is optional, it's off by default and no one is
>     forced to
>     >     used it.  If it's used only by ZFS, then it would not be horrible.
>     >
>     >
>     > Except that it isn't the same flag as what Solaris has (its B_FAILFAST does
>     > something different: it isn't about limiting retries but about failing ALL the
>     > queued I/O for a unit, not just trying one retry), and the problems that it
>     > solves are quite rare. And if you return a different errno, then the EIO
>     > contract is still fulfilled. 
> 
>     Yes, it isn't the same.
>     I think that illumos flag does even more.
> 
> 
> Since it isn't the same, and there's not other systems that do a similar thing,
> that ups the burden of proof that this is a good idea.
> 
>     >     Unless it makes things very hard for the infrastructure.
>     >     But I am circling back to not knowing what problem(s) you and Warner are
>     >     planning to fix.
>     >
>     >
>     > The middle layers of the I/O system are a bit fragile in the face of I/O errors.
>     > We're fixing that.
> 
>     What are the middle layers?
> 
> 
> The buffer cache and lower layers of the UFS code is where the problems chiefly lie.

Those are the upper layers from the point of view of GEOM and things below it.
If they don't set that flag on the bio, then it is not going to magically appear
in their I/O path.

>     > Of course, you still haven't articulated why this approach would be better
> 
>     Better than what?
> 
> 
> Well, anything?

I think that I have described how it is better than what we have now, which I
think is a part of 'anything'.

> 
>     > nor
>     > show any numbers as to how it makes things better.
> 
>     By now, I have.  See my reply to Scott's email.
> 
> 
> I just checked my email, I've seen no such reply. I checked it before I
> replied.  Maybe it's just delayed.

Sorry, my mistake.  I thought that I sent that email in the morning, but I
didn't.  I have just sent it.
Apologies again.


-- 
Andriy Gapon

From owner-freebsd-geom@freebsd.org  Sat Nov 25 17:57:40 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3B7CEDEC22E
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Sat, 25 Nov 2017 17:57:40 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x22c.google.com (mail-it0-x22c.google.com
 [IPv6:2607:f8b0:4001:c0b::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id F16BF681C7
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 17:57:39 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x22c.google.com with SMTP id x13so16485614iti.4
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 09:57:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=sDDg+58CGPKx0+tTc1ED8m6HEe5BrIsIIQB/kA3ChwU=;
 b=qPFfmzjRNBbP3DmUvRWgC5jMjFTpdNySHNEs8H/xdoUucHPhTJiEXi40KmJeeH77Ey
 sY1c6XJ+8SS4Z0oqTJQ9XQn1y9K6v09yaqrPtqKmfWVtforMcMiq7TYqaT9DLUuNZA2A
 WiGcKFBDw7GXRV9Z3D6RsIJ4XMyMU7TZNOuYra1Q8c9dnWl5m3DqvlaeK1OeayjCg8jB
 wIXzdVHAxvnLU05sVIu+j+ECf9iI8/rotfDVUVQOtJmW77SwHCZBIYDUAwBkj7yosMCR
 56TfIrZhYVrxHbWJQFjg/YMXIHP/MXKrfEAcRxZ78HvVBH5c6Vqc1OvzZKJVarY8JVdX
 zAsw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=sDDg+58CGPKx0+tTc1ED8m6HEe5BrIsIIQB/kA3ChwU=;
 b=Wkas7BByVOu9fOxwWH80k1VVi0R6Af2UvuneONPYVeP9GQJ5z2Ycy1blOMhUZP+Nsy
 8YNHB2vnHS6LgkoE7ynSodZSLqQkeLwQ2gRJExfzJw8lGDIRB32GxUD879P7Q6XDKK5B
 oMUFNgr6DPFRQa0VuuQJY+BF1DWC8vgriOGLXLjxtJzC9ZgXeO7M9CLkFgJHF+FAjz38
 wgMEl+38RE9jqCu4lyJP93YT5thDSigJNFQVsjzXaOdKp62OdyP/RKXZTYt2ZTW+6vDj
 RWMAMO3PzQK1rS1364V9WGmQ1wL4ORhQOQKLG2WQC9O3Y3qmJbScYZVzoK8IcKO2dPVU
 gzBQ==
X-Gm-Message-State: AJaThX691pSDEbiNbam1lowG1waMomVGhTosBuZaQ9W8rAScs8utnECc
 aGTe+R6r5dTOo4hhyhANeDNW/HpuZv04Ad5RMYSUOA==
X-Google-Smtp-Source: AGs4zMYazlz2WrPFSx5PeIGxX4EhK1qMgTHVpcFmS/KlPnGtbEi0uWZuo3DIO5vwYgUNvIHh2OtB3xdssoaEXn1dta8=
X-Received: by 10.36.164.13 with SMTP id z13mr22202600ite.115.1511632659185;
 Sat, 25 Nov 2017 09:57:39 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 09:57:38 -0800 (PST)
X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd]
In-Reply-To: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org>
 <c9a96004-9998-c96d-efd7-d7e510c3c460@FreeBSD.org>
 <DC23D104-F5F3-4844-8638-4644DC9DD411@samsco.org>
 <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 25 Nov 2017 10:57:38 -0700
X-Google-Sender-Auth: 6c28FYqH0hmmAfPKJFmQ73oXmro
Message-ID: <CANCZdfrZfuKZAMURu-biRMYYDD_=05ODbevsWEF9uZayvdnaQg@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: Scott Long <scottl@samsco.org>, FreeBSD FS <freebsd-fs@freebsd.org>,
 freebsd-geom@freebsd.org
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 17:57:40 -0000

On Sat, Nov 25, 2017 at 10:36 AM, Andriy Gapon <avg@freebsd.org> wrote:

>
> Timestamp of the first error is Jun 16 10:40:18.
> Timestamp of the last error is Jun 16 10:40:27.
> So, it took additional 9 seconds to finally produce EIO.
> That disk is a part of a ZFS mirror.  If the request was failed after the
> first
> attempt, then ZFS would be able to get the data from a good disk much
> sooner.
>
> And don't take me wrong, I do NOT want CAM or GEOM to make that decision by
> itself.  I want ZFS to be able to tell the lower layers when they should
> try as
> hard as they normally do and when they should report an I/O error as soon
> as it
> happens without any retries.


Let's walk through this. You see that it takes a long time to fail an I/O.
Perfectly reasonable observation. There's two reasons for this. One is that
the disks take a while to make an attempt to get the data. The second is
that the system has a global policy that's biased towards 'recover the
data' over 'fail fast'. These can be fixed by reducing the timeouts, or
lowing the read-retry count for a given drive or globally as a policy
decision made by the system administrator.

It may be perfectly reasonable to ask the lower layers to 'fail fast' and
have either a hard or a soft deadline on the I/O for a subset of I/O. A
hard deadline would return ETIMEDOUT or something when it's passed and
cancel the I/O. This gives better determinism in the system, but some
systems can't cancel just 1 I/O (like SATA drives), so we have to flush the
whole queue. If we get a lot of these, performance suffers. However, for
some class of drives, you know that if it doesn't succeed in 1s after you
submit it to the drive, it's unlikely to complete successfully and it's
worth the performance hit on a drive that's already acting up.

You could have a soft timeout, which says 'don't do any additional action
after X time has elapsed and you get word about this I/O. This is similar
to the hard timeout, but just stops retrying after the deadline has passed.
This scenario is better on the other users of the drive, assuming that the
read-recovery operations aren't starving them. It's also easier to
implement, but has worse worst case performance characteristics.

You aren't asking to limit retries. You're really asking to the I/O
subsystem to limit, where it can, the amount of time on an I/O so you can
try another one. You're means to doing this is to tell it not to retry.
That's the wrong means. It shouldn't be listed in the API that it's a 'NO
RETRY' request. It should be a QoS request flag: fail fast.

Part of why I'm being so difficult is that you don't understand this and
are proposing a horrible API. It should have a different name. The other
reason is that I  absolutely do not want to overload EIO. You must return a
different error back up the stack. You've show no interest in this past,
which is also a needless argument. We've given good reasons, and you've
poopooed them with bad arguments.

Also, this isn't the data I asked for. I know things can fail slowly. I was
asking for how it would improve systems running like this. As in "I
implemented it, and was able to fail over to this other drive faster" or
something like that. Actual drive failure scenarios vary widely, and
optimizing for this one failure is unwise. It may be the right
optimization, but it may not. There's lots of tricky edges in this space.

Warner

From owner-freebsd-geom@freebsd.org  Sat Nov 25 22:17:51 2017
Return-Path: <owner-freebsd-geom@freebsd.org>
Delivered-To: freebsd-geom@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id DC285DF2033
 for <freebsd-geom@mailman.ysv.freebsd.org>;
 Sat, 25 Nov 2017 22:17:51 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x22d.google.com (mail-it0-x22d.google.com
 [IPv6:2607:f8b0:4001:c0b::22d])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9D63E71CD8
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 22:17:51 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x22d.google.com with SMTP id x13so16849348iti.4
 for <freebsd-geom@freebsd.org>; Sat, 25 Nov 2017 14:17:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=LJmZtGwOy8guPaxwjyXIT3tK6dOEY/LFRWx1tpnpJSQ=;
 b=RghMCEjfdPhoF8iirse/K/MLeeQwCyBBUBDY4GIH4TSvyo6PeJvihzYn5hso1rNlLS
 +tnqaKeLnRgMLGXoyhERUjT7jVHpi6mWQqbaE5BG1d/iz/Yi8lFIvSpFjp83KXBLDvzl
 mfGLNV228BEqzpy7IuGp+mk3G1IzPGrVULVFy5xZMNEBfNY7ukdLtdMcOIivnrJUHBQF
 MMZqu50cjK8yirLEsSf2SNIid288qKfqZ50lTqRa1G2qg8DaOkrlnfwGIFfhaPM7nWCk
 9uvOhbf6/QzBkEBHFWCbW+pcDUuYoB53Zj157D7Up9UZff18kpkmcbCaYG620pYhofaI
 IJkw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=LJmZtGwOy8guPaxwjyXIT3tK6dOEY/LFRWx1tpnpJSQ=;
 b=AF1k/Ew3L8mqYFPuzqNrvk+LEgxEQNAQlIt+FJv42dMvb1pKFu8gVkfyY0WdSpHr2Q
 GauqBw57RThHk7UWZZ37dvtP6M6oAztvZ8pVhwuyk8QBvd261M9jLI6k//e+OWnqg7dS
 /gewE8+V2pVQJAfH+NIwvUZ0e6Q8+lm3+5ihaKq1WcuzHbXY/iGjCUjrnvqhOUkwGSTE
 GhdxAyvV7Y7UtB7xhbYwOjiWyf6q8o8BOSviHVzUG3Ui8xYU1yHdX/g2GeciSRivXqYV
 82liynPiNgT81CYKvXqT30EDNhXc6spdxlqPpaoAkzKKGQspLM1vTLEFL1oALowpWsWH
 nYeQ==
X-Gm-Message-State: AJaThX72e4utA5FGx2x423j2LLYTB1rVeGRpmiaif9ILOEkKYPgDoSDL
 N1V+2pAa6I8T0CMpFA5ZpQ3kIRwSmHjbDzti3/5c2w==
X-Google-Smtp-Source: AGs4zMbF8Ht5E9L9iEvEwyz0HA3vAeq7KZT0urdLQtXp+Ntp5RarLahEh59YUo6SC+4j4DZaU+ZV3bjeOtF98AIV5Mw=
X-Received: by 10.36.164.13 with SMTP id z13mr22853940ite.115.1511648270873;
 Sat, 25 Nov 2017 14:17:50 -0800 (PST)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 14:17:49 -0800 (PST)
X-Originating-IP: [50.253.99.174]
In-Reply-To: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org>
References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org>
 <CANCZdfoE5UWMC6v4bbov6zizvcEMCbrSdGeJ019axCUfS_T_6w@mail.gmail.com>
 <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org>
 <CANCZdfrBtYm_Jxcb6tXP+dtMq7dhRKmVOzvshG+yB++ARx1qOQ@mail.gmail.com>
 <f18e2760-85b9-2b5e-4269-edfe5468f9db@FreeBSD.org>
 <CANCZdfqo_nq7NQTR0nHELbUp5kKfWLszP_MJZQ1oAiSk8qpEtQ@mail.gmail.com>
 <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Sat, 25 Nov 2017 15:17:49 -0700
X-Google-Sender-Auth: OFWKq_KlfodVn8Jxgh-5DhOWhSE
Message-ID: <CANCZdfpLvMYBzvtU_ez_mOrkxk9LHf0sOvq4eHdDxgHgjf527A@mail.gmail.com>
Subject: Re: add BIO_NORETRY flag, implement support in ata_da,
 use in ZFS vdev_geom
To: Andriy Gapon <avg@freebsd.org>
Cc: FreeBSD FS <freebsd-fs@freebsd.org>, freebsd-geom@freebsd.org, 
 Scott Long <scottl@samsco.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.25
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: GEOM-specific discussions and implementations
 <freebsd-geom.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom/>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
 <mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 25 Nov 2017 22:17:52 -0000

On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon <avg@freebsd.org> wrote:

>
> Before anything else, I would like to say that I got an impression that we
> speak
> from so different angles that we either don't understand each other's
> words or,
> even worse, misinterpret them.


I understand what you are suggesting. Don't take my disagreement with your
proposal as willful misinterpretation. You are proposing something that's a
quick hack. Maybe a useful one, but it's still problematical because it has
the upper layers telling the lower layers what to do (don't do your retry),
rather than what service to provide (I prefer a fast error exit to over
every effort to recover the data). And it also does it by overloading the
meaning of EIO, which has real problems which you've not been open to
listening, I assume due to your narrow use case apparently blinding you to
the bigger picture issues with that route.

However, there's a way forward which I think that will solve these
objections. First, designate that I/O that fails due to short-circuiting
the normal recovery process, return ETIMEDOUT. The I/O stack currently
doesn't use this at all (it was introduced for the network side of things).
This is a general catch-all for an I/O that we complete before the lower
layers have given it the maximum amount of effort to recover the data, at
the user request. Next, don't use a flag. Instead add a 32-bit field that
is call bio_qos for quality of service hints and another 32-bit field for
bio_qos_param. This allows us to pass down specific quality of service
desires from the filesystem to the lower layers. The parameter will be
unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name for a value
to set it to (at the moment, just use 1). We'll assign the other QOS values
later for other things. It would allow us to implement the other sorts of
QoS things I talked about as well.

As for B_FAILFAST, it's quite unlike what you're proposing, except in one
incidental detail. It's a complicated state machine that the sd driver in
solaris implemented. It's an entire protocol. When the device gets errors,
it goes into this failfast state machine. The state machine makes a
determination that the errors are indicators the device is GONE, at least
for the moment, and it will fail I/Os in various ways from there. Any new
I/Os that are submitted will be failed (there's conditional behavior here:
depending on a global setting it's either all I/O or just B_FAILFAST I/O).
ZFS appears to set this bit for its discovery code only, when a device not
being there would significantly delay things. Anyway, when the device
returns (basically an I/O gets through or maybe some other event happens),
the driver exists this mode and returns to normal operation. It appears to
be designed not for the use case that you described, but rather for a drive
that's failing all over the place so that any pending I/Os get out of the
way quickly. Your use case is only superficially similar to that use case,
so the Solaris / Illumos experiences are mildly interesting, but due to the
differences not a strong argument for doing this. This facility in Illumos
is interesting, but would require significantly more retooling of the lower
I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just
Solaris) a daemon that looks at failures to manage them at a higher level,
which might make for a better user experience for FreeBSD, so that's
something that needs to be weighed as well.

We've known for some time that HDD retry algorithms take a long time. Same
is true of some SSD or NVMe algorithms, but not all. The other objection I
have to 'noretry' naming  is that it bakes the current observed HDD
behavior and recovery into the API. This is undesirable as other storage
technologies have retry mechanisms that happen quite quickly (and sometimes
in the drive itself). The cutoff between fast and slow recovery is device
specific, as are the methods used. For example, there's new proposals out
in NVMe (and maybe T10/T13 land) to have new types of READ commands that
specify the quality of service you expect, including providing some sort of
deadline hint to clip how much effort is expended in trying to recover the
data. It would be nice to design a mechanism that allows us to start using
these commands when drives are available with them, and possibly using
timeouts to allow for a faster abort. Most of your HDD I/O will complete
within maybe ~150ms, with a long tail out to maybe as long as ~400ms. It
might be desirable to set a policy that says 'don't let any I/Os remain in
the device longer than a second' and use this mechanism to enforce that. Or
don't let any I/Os last more than 20x the most recent median I/O time. A
single bit is insufficiently expressive to allow these sorts of things,
which is another reason for my objection to your proposal. With the QOS
fields being independent, the clone routines just copies them and makes no
judgement value on them.

So, those are my problems with your proposal, and also some hopefully
useful ways to move forward. I've chatted with others for years about
introducing QoS things into the I/O stack, so I know most of the above
won't be too contentious (though ETIMEDOUT I haven't socialized, so that
may be an area of concern for people).

Warner