From owner-freebsd-geom@freebsd.org Fri Nov 24 09:20:07 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CC1BCDC1D23 for ; Fri, 24 Nov 2017 09:20:07 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id BA9B26C513 for ; Fri, 24 Nov 2017 09:20:07 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id vAO9K76i088285 for ; Fri, 24 Nov 2017 09:20:07 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-geom@FreeBSD.org Subject: [Bug 223838] g_bio_clone vs g_bio_duplicate Date: Fri, 24 Nov 2017 09:20:07 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: avg@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-geom@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc assigned_to Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 09:20:07 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D223838 Andriy Gapon changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |pjd@FreeBSD.org Assignee|freebsd-bugs@FreeBSD.org |freebsd-geom@FreeBSD.org --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-geom@freebsd.org Fri Nov 24 11:01:26 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 71A6FDDFBF9; Fri, 24 Nov 2017 11:01:26 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f46.google.com (mail-lf0-f46.google.com [209.85.215.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 0E7776F69E; Fri, 24 Nov 2017 11:01:25 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f46.google.com with SMTP id x76so20389789lfb.6; Fri, 24 Nov 2017 03:01:25 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:from:subject:message-id:date:user-agent :mime-version:content-language:content-transfer-encoding; bh=6FrtmZIx7kW/3Ko/v0TC6QdvxvoEkfXpbpwL8HYZisI=; b=IUMUkPNkPFQMj7F9F4b5udDFW+1cezZmOP+NH1KkcbfLZOkZOziUqv06AxNK7MG5Nv yuqsUqESHY5mlrZRTXMUJ+t8GiskiOl4i6IXtBvURrpFfNhloJLELkl/tfF6bMf8LftV fgOb11TEH+xjus3aLxEnZZpVmt3hxudDBqt8TIhmIN+Dq6H7aSkHPXmkSD2Vyd3CGWNO 8ud3wVFUO1NrNwUwt9bXAoT+Xn8q5VTeqHV+x7GTxkIpjeOR0zcoKXpbuDRxjetmzSqJ ffXDgPMoKYzyLVtf9xNNR0nxdO9EgBnKJxMfIXgoCGXGjlZvELWNr9bUKrJ1jSOf0L0u pyHQ== X-Gm-Message-State: AJaThX76Jl0HOS8bJlFhrqYcln5AUYqZqggcRmekAsvlBnJ2bwUhsAVH AFY0tTJRqKE41SfncnPFqUeJyxHzkFs= X-Google-Smtp-Source: AGs4zMY683jXmpsJMe3oL3xJpzNM1zW+ffcjrGJ3523EBi/7T/NgRHjLoysysdel4AG7cbyvWqi/ew== X-Received: by 10.46.68.195 with SMTP id b64mr8733440ljf.121.1511519433271; Fri, 24 Nov 2017 02:30:33 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id m95sm3694573lfi.76.2017.11.24.02.30.32 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 24 Nov 2017 02:30:32 -0800 (PST) To: freebsd-fs@FreeBSD.org, freebsd-geom@FreeBSD.org From: Andriy Gapon Subject: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom Message-ID: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> Date: Fri, 24 Nov 2017 12:30:30 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 11:01:26 -0000 https://reviews.freebsd.org/D13224 Anyone interested is welcome to join the review. Thanks. -- Andriy Gapon From owner-freebsd-geom@freebsd.org Fri Nov 24 13:08:10 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1CDC2DE50D9 for ; Fri, 24 Nov 2017 13:08:10 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-io0-x22f.google.com (mail-io0-x22f.google.com [IPv6:2607:f8b0:4001:c06::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C8A9B73C3E for ; Fri, 24 Nov 2017 13:08:09 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-io0-x22f.google.com with SMTP id i38so29543369iod.2 for ; Fri, 24 Nov 2017 05:08:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=wOUgQZRszG2Cc8EXcF4I4ct1fBYHQfUdMYXIWs+F+kM=; b=isUkA4ECI3JesPLAYxUKKYXeVMi2zrx/o0OP8GdV9QaT6lGbmir4UFVIyNYDzeWoK4 +Gh2PDDBVRyQhALliKCbNPQLYwW5g0o4avQoA84U/EKvj8J3LcVpDh5H8n2iVv+rwCd5 T+fNeh0gVujBdYv16TG9dlsP8zhzc4MEl+yoGn3W+a1c4UdBEQuocPLoe46snTPi73fw wTCbX1qijQhDWL/aQz2REYg8qYM77zxui9siy77UlEQPzKF2nGzNCbDj8mlmgbX2fAY/ wLWvj0afhHXsnq32X67GK7ksGpZ2PMec9riGRPI7fCJvhW1zTI+msQ2pYO78iaOaYVcv S3bg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=wOUgQZRszG2Cc8EXcF4I4ct1fBYHQfUdMYXIWs+F+kM=; b=iXUcyquR3YgrG0a+JQGyt68Kovlq18Cl7yggq7ALk5wwPCyj7f8ob0Wbu/2lgU12rD MVg1uBy2LD534B4PUPNS4NAtUoAAmUvtUOU226AtZ1bQ5pHnxBCkP3/yGeZ0kTGpLWPM aZhfssjCpuZEYVZHpi7VVmHcw0oN25qzzxpiUaLMiUIsBZ8tWrNR5s3t9gNJKwIG6WY5 SSa9nby+X5bArjuZWV5joZqpRVNoVfUWIJtk99wru6z2jq7W0wwrTyJdCcgik0vZITUq 15fdDFG+1xy0JXG6XnbnF5Snp4tRLWdIbZrNXhS5Ij/6lCTVLhHRSYs673OHcX1GkbDl vT7w== X-Gm-Message-State: AJaThX7GCM4tEEZ079Dp6SCDBinz3o7koEfIlHs4+mPwuQ5e5fG1VsMz hganxeA7zvqEs5nXBvVntaC5fmy1uhzBvCtrHWlMSQ== X-Google-Smtp-Source: AGs4zMY4MoBgdZFANsotjdt+Pxukf0cEindNxrZgT9zCGlGViUGX63RvX4ZZmzK0IC5ZT5xNwZUl0/rOkuCUehmtndY= X-Received: by 10.107.104.18 with SMTP id d18mr28887141ioc.136.1511528888799; Fri, 24 Nov 2017 05:08:08 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Fri, 24 Nov 2017 05:08:08 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:f964:7c3e:d2:aac5] In-Reply-To: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> From: Warner Losh Date: Fri, 24 Nov 2017 06:08:08 -0700 X-Google-Sender-Auth: tj7KN5Upw_bTsDr4uHtec8Eefns Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 13:08:10 -0000 On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon wrote: > > https://reviews.freebsd.org/D13224 > > Anyone interested is welcome to join the review. > I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of QoS that seems misguided. It conflates a shorter timeout with don't retry. And why is retrying bad? It seems more a notion of 'fail fast' or so other concept. There's so many other ways you'd want to use it. And it uses the same return code (EIO) to mean something new. It's generally meant 'The lower layers have retried this, and it failed, do not submit it again as it will not succeed' with 'I gave it a half-assed attempt, and that failed, but resubmission might work'. This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM even more than they are broken now. So let's step back a bit: what problem is it trying to solve? Warner From owner-freebsd-geom@freebsd.org Fri Nov 24 13:41:27 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 31195DE6395; Fri, 24 Nov 2017 13:41:27 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f54.google.com (mail-lf0-f54.google.com [209.85.215.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DE9487522F; Fri, 24 Nov 2017 13:41:26 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f54.google.com with SMTP id k66so25549237lfg.3; Fri, 24 Nov 2017 05:41:26 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=r36jXUEGMIy23uM2+d2siF/0w4/bQvR41bQZSuWZoOo=; b=Dfi8pWBslpSts5kHV2uMfnBKr0qLawEXf/LFbwNONjeNdHjS+GRcKi/JkswWoybJC0 /rPaKsOtIEtTpj7Do/u1aD6AX6tw567BrhkrD9adt5cwuoBNvZb54sD01vQkUPf7aS3m CVZ55F+MkfPL9bX+3pn5P/CfwP9ToD35ND8iUgT/J0KT37/OEhWDnRJu45k1wXq8Ugjx mlEonz7+DWAdBHZgzhbjc/MD7RfNd03TJBCy0uV7KGCUD75ldSU+vKr/DWrD8WJc1zEt F5OTJmnvp3OpZOIodt9e/cDvXa9ES/ig+dvFe/sPuzjnBhn11Y3KnAfY9eZWuu7nRx3W YSJQ== X-Gm-Message-State: AJaThX4pRU8wTx/muTap/6DsWa6erbTJirBRhltJXyn3PGXfz4wPPN6F 1Vddzi37QxkMPQ6Nh9h1dIs= X-Google-Smtp-Source: AGs4zMbnrPvRtz68h9+hnPjT0RKaFjc6RWusF10UJUlzr7pZDgsEUvOC4aqpTDJh8D/LuFrtWW/MAQ== X-Received: by 10.25.163.11 with SMTP id m11mr9390033lfe.179.1511530479138; Fri, 24 Nov 2017 05:34:39 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id s66sm4550021lje.40.2017.11.24.05.34.37 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 24 Nov 2017 05:34:38 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Warner Losh Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> From: Andriy Gapon Message-ID: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> Date: Fri, 24 Nov 2017 15:34:36 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 13:41:27 -0000 On 24/11/2017 15:08, Warner Losh wrote: > > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > wrote: > > > https://reviews.freebsd.org/D13224 > > Anyone interested is welcome to join the review. > > > I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of > QoS that seems misguided. It conflates a shorter timeout with don't retry. And > why is retrying bad? It seems more a notion of 'fail fast' or so other concept. > There's so many other ways you'd want to use it. And it uses the same return > code (EIO) to mean something new. It's generally meant 'The lower layers have > retried this, and it failed, do not submit it again as it will not succeed' with > 'I gave it a half-assed attempt, and that failed, but resubmission might work'. > This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM > even more than they are broken now. > > So let's step back a bit: what problem is it trying to solve? A simple example. I have a mirror, I issue a read to one of its members. Let's assume there is some trouble with that particular block on that particular disk. The disk may spend a lot of time trying to read it and would still fail. With the current defaults I would wait 5x that time to finally get the error back. Then I go to another mirror member and get my data from there. IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first read, get the error back sooner and try the other disk sooner. Only if I know that there are no other copies to try, then I would use the normal read with all the retrying. -- Andriy Gapon From owner-freebsd-geom@freebsd.org Fri Nov 24 14:58:05 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 34698DE856C; Fri, 24 Nov 2017 14:58:05 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 046CA77CEF; Fri, 24 Nov 2017 14:58:04 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 498B620C6A; Fri, 24 Nov 2017 09:57:58 -0500 (EST) Received: from frontend1 ([10.202.2.160]) by compute6.internal (MEProxy); Fri, 24 Nov 2017 09:57:58 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsco.org; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc; s=fm1; bh=qVIuYEAsfpjDy6GFHeaqmYUXQPio2 uVChaJc9Aeq8Os=; b=v12oJ9CWdrq/ywPFKeZNkEVXl44tShaEcGNEDtUctXBOD 6GuZSe3fcFw0J69JaTrFf//o0WtjReDC5OEdf7fhCftRb/7wjUnYDsVv1AoB5oAN +h89BrWtYsC/e+hszmFr5+KN34hbdP/y0XAaBlOGMJoXPMziGMigoxm4l+ol9n87 p1iHdq9ABoD4jGFB/2BiHNfGBpRIfl+us8gWrUpDXLoJiQvkGQHwVWIhXfdRti7l uptZnVMxkoQ3qexTdGZo5YxTVPHG938ohoMbqiNslP8A2TficFtezumEq21E4zDG 9ItPZbW0BTlJ9YjLV2u2k+OOm7jMURWGTIdCJVNnA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=qVIuYE AsfpjDy6GFHeaqmYUXQPio2uVChaJc9Aeq8Os=; b=j6q8dcfIr5Qevxq0taQVzK 9EPL3kDNzXJ7jvrkSjOmhxxuS07mrLbq9iOh7jehiUnqooJSuTxMExb4W5HVWUKt BzKqvskYxR1m07E5wF3N75utkhyNDfhpGPCWyiHfrEaOUNmu1PwHkPRE14Otym2h rMu3yQoBhC/s9uEa6T/fCVNkWRDnLkorBFzcRtgS1U+3XsGUJfZwXGknPHia8N+4 WYoIemhfWKEVC/G2kaVnT30pXNPX/kPxN/OrIW/4ouxPt8eRwNblT9mz9ZLMtdli 6UMnssR2IBlHFhbjxHD/xwFFRu9PpwzzK/NWQiWwjvD4BuQ8ER6SO2bvySWFBo0g == X-ME-Sender: Received: from [192.168.0.103] (unknown [161.97.249.191]) by mail.messagingengine.com (Postfix) with ESMTPA id C09497F35E; Fri, 24 Nov 2017 09:57:57 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom From: Scott Long In-Reply-To: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> Date: Fri, 24 Nov 2017 07:57:55 -0700 Cc: Warner Losh , FreeBSD FS , freebsd-geom@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> To: Andriy Gapon X-Mailer: Apple Mail (2.3445.4.7) X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 14:58:05 -0000 > On Nov 24, 2017, at 6:34 AM, Andriy Gapon wrote: >=20 > On 24/11/2017 15:08, Warner Losh wrote: >>=20 >>=20 >> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > wrote: >>=20 >>=20 >> https://reviews.freebsd.org/D13224 = >>=20 >> Anyone interested is welcome to join the review. >>=20 >>=20 >> I think it's a really bad idea. It introduces a 'one-size-fits-all' = notion of >> QoS that seems misguided. It conflates a shorter timeout with don't = retry. And >> why is retrying bad? It seems more a notion of 'fail fast' or so = other concept. >> There's so many other ways you'd want to use it. And it uses the same = return >> code (EIO) to mean something new. It's generally meant 'The lower = layers have >> retried this, and it failed, do not submit it again as it will not = succeed' with >> 'I gave it a half-assed attempt, and that failed, but resubmission = might work'. >> This breaks a number of assumptions in the BUF/BIO layer as well as = parts of CAM >> even more than they are broken now. >>=20 >> So let's step back a bit: what problem is it trying to solve? >=20 > A simple example. I have a mirror, I issue a read to one of its = members. Let's > assume there is some trouble with that particular block on that = particular disk. > The disk may spend a lot of time trying to read it and would still = fail. With > the current defaults I would wait 5x that time to finally get the = error back. > Then I go to another mirror member and get my data from there. There are many RAID stacks that already solve this problem by having a = policy of always reading all disk members for every transaction, and throwing = away the sub-transactions that arrive late. It=E2=80=99s not a policy that is = always desired, but it serves a useful purpose for low-latency needs. > IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first = read, get > the error back sooner and try the other disk sooner. Only if I know = that there > are no other copies to try, then I would use the normal read with all = the retrying. >=20 I agree with Warner that what you are proposing is not correct. It = weakens the contract between the disk layer and the upper layers, making it less = clear who is responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D means. = That contract is already weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I are working on a plan to fix that. =20 Scott From owner-freebsd-geom@freebsd.org Fri Nov 24 16:33:58 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 530B5DEB555 for ; Fri, 24 Nov 2017 16:33:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x231.google.com (mail-it0-x231.google.com [IPv6:2607:f8b0:4001:c0b::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 103EF7AE70 for ; Fri, 24 Nov 2017 16:33:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x231.google.com with SMTP id n134so14236534itg.3 for ; Fri, 24 Nov 2017 08:33:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=; b=oEuCnvZ2uOi0mqEEuCMF+Ea3KIQMW8xD4XIOzTDCGoKTFGxUdG8pKgCWcqLfEk63kB talGVqrxz368QnLhZqPCzYo+LUkeNRkAlTGomWkYu2Ub7QuPvtaoAKp40xt6ffwmQrI1 6bhPj6xeTqwjqqFGmnLbausFNdZ9KfBW/NiIgowQgVw8jOgUNPkDlucrsRZL4I9VIlqF zgkNAsNDKJ/7903BgsEKiaQm+8vo0GLzqp8Pl1pGYnwRWKHT2sha56QWI+/pcqLie5y7 N6jvMdc9N5PWnTQ47uUezW9hDMv0kGSpnrXcGiDVOg0A/icNn/iIc1VF9SD6mwKi6LK6 P0QQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=; b=kAZce0Pwh74Hf+0VNGYhrLALjpwwj2Yju67Dw9qybI+88JRtPcaeutFKRku3sIDYN4 rl4T1/ilqv67uup41C274cAzXdOIb5uDrVNDttJEA6nfaK7rt4HW5MIBYVkX3jH3s9cv TI0BqF1cWZ10N5SGvJ5QAA8RDdJM1f8yFKTH3oJ3EaLq/Sqm8Oe/cZJ6tEAitF3JryiX CdZVB2ywGYYPGdBljvTohphTuldWu0M+oDrVY4iNjGKkO8enM/Ezzg2Tzs/JoAsy5vo0 d0qOby0zpMT2aKtVQEbhGJto7YEz3TZ1ekXRgP8EXum41Qk4QuTBAb3GQYde/TsX9lY2 22zQ== X-Gm-Message-State: AJaThX6NVnjhNto+wjebDUIRwcdb57c2NxRNhxp5NVz5LnfFo2WApE96 H5UU6hl0HUltULJ+eaF4k5t4h9ITgzXI1/HYj3eqRA== X-Google-Smtp-Source: AGs4zMb2qcX4SYQ8dG9gG7QijUzFvld3wUlWcmNehzHGRemSK6Qjy7T0UBzw5z1gKtWtiDfacKGNpjOpyFOqluW63Yg= X-Received: by 10.36.77.143 with SMTP id l137mr17155000itb.50.1511541237108; Fri, 24 Nov 2017 08:33:57 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Fri, 24 Nov 2017 08:33:56 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:f964:7c3e:d2:aac5] In-Reply-To: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> From: Warner Losh Date: Fri, 24 Nov 2017 09:33:56 -0700 X-Google-Sender-Auth: 9YkRcBZ2y_mwK-i8n9pyzcXRGJA Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 16:33:58 -0000 On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon wrote: > On 24/11/2017 15:08, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > wrote: > > > > > > https://reviews.freebsd.org/D13224 D13224> > > > > Anyone interested is welcome to join the review. > > > > > > I think it's a really bad idea. It introduces a 'one-size-fits-all' > notion of > > QoS that seems misguided. It conflates a shorter timeout with don't > retry. And > > why is retrying bad? It seems more a notion of 'fail fast' or so other > concept. > > There's so many other ways you'd want to use it. And it uses the same > return > > code (EIO) to mean something new. It's generally meant 'The lower layers > have > > retried this, and it failed, do not submit it again as it will not > succeed' with > > 'I gave it a half-assed attempt, and that failed, but resubmission might > work'. > > This breaks a number of assumptions in the BUF/BIO layer as well as > parts of CAM > > even more than they are broken now. > > > > So let's step back a bit: what problem is it trying to solve? > > A simple example. I have a mirror, I issue a read to one of its members. > Let's > assume there is some trouble with that particular block on that particular > disk. > The disk may spend a lot of time trying to read it and would still fail. > With > the current defaults I would wait 5x that time to finally get the error > back. > Then I go to another mirror member and get my data from there. > IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first read, > get > the error back sooner and try the other disk sooner. Only if I know that > there > are no other copies to try, then I would use the normal read with all the > retrying. > It sounds like you are optimizing the wrong thing and taking an overly simplistic view of quality of service. First, failing blocks on a disk is fairly rare. Do you really want to optimize for that case? Second, you're really saying 'If you can't read it fast, fail" since we only control the software side of read retry. There's new op codes being proposed that say 'read or fail within Xms' which is really what you want: if it's taking too long on disk A you want to move to disk B. The notion here was we'd return EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation in software for drives that don't support this. You'd tweak this number to control performance. You're likely to get a much bigger performance win all the time by scheduling I/O to drives that have the best recent latency. Third, do you have numbers that show this is actually a win? This is a terrible thing from an architectural view. Absent numbers that show it's a big win, I'm very hesitant to say OK. Forth, there's a large number of places in the stack today that need to communicate their I/O is more urgent, and we don't have any good way to communicate even that simple concept down the stack. Finally, the only places that ZFS uses the TRYHARDER flag are for things like the super block if I'm reading the code right. It doesn't do it for normal I/O. There's no code to cope with what would happen if all the copies of a block couldn't be read with the NORETRY flag. One of them might contain the data. Warner From owner-freebsd-geom@freebsd.org Fri Nov 24 17:18:05 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0C723DEC656; Fri, 24 Nov 2017 17:18:05 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f50.google.com (mail-lf0-f50.google.com [209.85.215.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id AD0F97C6EF; Fri, 24 Nov 2017 17:18:04 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f50.google.com with SMTP id y2so25239532lfj.4; Fri, 24 Nov 2017 09:18:04 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=BiB9h//yrNXLyrWIgkH9wcDBFEPZO9r5royyXbkPszM=; b=rObH8Xv1smrxW3ti4DvJH4xYwdM24DS/JVGZok9eC8WL1tWhwBy2lxNUiaBUuSZJ4j USyifnq/NDQpM6XE1bPvKRSEO/fU9STVQ5XagK8nKirMo+AJpOhBP0sZA2Kgw+VvjrrF dRqzvo6oUmbHHZ+pVkrCyM6XQrzA0al/Nkv0y+2+ReH8APnRD9Q9y66acRDps4hXCMaj 0QhJDg4D856AUFIRXg7b4i/dvJiKzyKP7jsrmzBYzHrAxuRWJ7FMEFUtx2Fqo4duGaaQ qB2vpUdINYr4OrTQ+n0r4xj/+JBe9wqG/HLPx3PuK6C5myL7Sh8dLACDreW1KD3wB5PD /KRA== X-Gm-Message-State: AJaThX494WCGv8gs3ts/QhzCqwOTeGMcO2Nl+iivO22aERVKhVID3hsD 7LgLjRs+f+X3ypJTJq3Sby5fOHYgvDc= X-Google-Smtp-Source: AGs4zMb416hW/35IIKGKmxO+LzubmkdxPtdvJXKkuLDu0Uvk6va7LTAbJxT+uYVIqnY6E+XvCnWX0Q== X-Received: by 10.46.15.25 with SMTP id 25mr2779272ljp.119.1511543881502; Fri, 24 Nov 2017 09:18:01 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id e74sm1404384ljf.43.2017.11.24.09.17.59 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 24 Nov 2017 09:18:00 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Scott Long Cc: Warner Losh , FreeBSD FS , freebsd-geom@freebsd.org References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> From: Andriy Gapon Message-ID: Date: Fri, 24 Nov 2017 19:17:58 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 17:18:05 -0000 On 24/11/2017 16:57, Scott Long wrote: > > >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon wrote: >> >> On 24/11/2017 15:08, Warner Losh wrote: >>> >>> >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon >> > wrote: >>> >>> >>> https://reviews.freebsd.org/D13224 >>> >>> Anyone interested is welcome to join the review. >>> >>> >>> I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of >>> QoS that seems misguided. It conflates a shorter timeout with don't retry. And >>> why is retrying bad? It seems more a notion of 'fail fast' or so other concept. >>> There's so many other ways you'd want to use it. And it uses the same return >>> code (EIO) to mean something new. It's generally meant 'The lower layers have >>> retried this, and it failed, do not submit it again as it will not succeed' with >>> 'I gave it a half-assed attempt, and that failed, but resubmission might work'. >>> This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM >>> even more than they are broken now. >>> >>> So let's step back a bit: what problem is it trying to solve? >> >> A simple example. I have a mirror, I issue a read to one of its members. Let's >> assume there is some trouble with that particular block on that particular disk. >> The disk may spend a lot of time trying to read it and would still fail. With >> the current defaults I would wait 5x that time to finally get the error back. >> Then I go to another mirror member and get my data from there. > > There are many RAID stacks that already solve this problem by having a policy > of always reading all disk members for every transaction, and throwing away the > sub-transactions that arrive late. It’s not a policy that is always desired, but it > serves a useful purpose for low-latency needs. That's another possible and useful strategy. >> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first read, get >> the error back sooner and try the other disk sooner. Only if I know that there >> are no other copies to try, then I would use the normal read with all the retrying. >> > > I agree with Warner that what you are proposing is not correct. It weakens the > contract between the disk layer and the upper layers, making it less clear who is > responsible for retries and less clear what “EIO” means. That contract is already > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I > are working on a plan to fix that. Well... I do realize now that there is some problem in this area, both you and Warner mentioned it. But knowing that it exists is not the same as knowing what it is :-) I understand that it could be rather complex and not easy to describe in a short email... But then, this flag is optional, it's off by default and no one is forced to used it. If it's used only by ZFS, then it would not be horrible. Unless it makes things very hard for the infrastructure. But I am circling back to not knowing what problem(s) you and Warner are planning to fix. -- Andriy Gapon From owner-freebsd-geom@freebsd.org Fri Nov 24 17:21:01 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6ECE4DEC737; Fri, 24 Nov 2017 17:21:01 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f44.google.com (mail-lf0-f44.google.com [209.85.215.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EEA307C82E; Fri, 24 Nov 2017 17:21:00 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f44.google.com with SMTP id k66so26187653lfg.3; Fri, 24 Nov 2017 09:21:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=DDjvUMqXfUPMiy37wIeC40DYy9cXSNrEzbE7WjarZrI=; b=gNyJjZr6bI99/qbi+V8XV4IA1T1S2+/fRUqb39AZDQCgg3PX7D9gN2ysLvQtwLZ8H5 Yo7Gsdk1wkVEYO+y2LIqqkrKyj+zXF6gn6cC6Ur4JEHvNvdhgpvFlFWZoAWcm/yHbm1L Mce63ad4OhogFX0zKcswSZgMd6KnYhT3d1CKey2eKYlHZ4+BnOHBC3d1kcFB8xdpgbWh tjAtXW8h8AtAf4r6HBVG7kKkvyNiuIqVkpAdFSKYTWa9SXhhRijAg8wrcVa7exz/u/A+ 3IKDqrTJH3I0FJWsspvQt8M35TRFxkHwiJCGPG66fezK8SHENVOnMe2zNJ+5aWwxBT6Y ySmg== X-Gm-Message-State: AJaThX7zGQlMJydDAZghpL0gzV4NGfjTRXsdMzD9x5or3fOnDNnc9pSd oxYHDrdgStmvNNdh6t87syk= X-Google-Smtp-Source: AGs4zMYf29dJUpe8orSQznnn8j7bzE3GZVUW/OlfKvye3SHBc7zy493mcDJyGGVSAlryNcXXTU3M2A== X-Received: by 10.46.117.28 with SMTP id q28mr10136527ljc.14.1511544052915; Fri, 24 Nov 2017 09:20:52 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id u17sm3771525lfi.97.2017.11.24.09.20.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 24 Nov 2017 09:20:52 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Warner Losh Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> From: Andriy Gapon Message-ID: Date: Fri, 24 Nov 2017 19:20:51 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 17:21:01 -0000 On 24/11/2017 18:33, Warner Losh wrote: > > > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon > wrote: > > On 24/11/2017 15:08, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > >> wrote: > > > > > >     https://reviews.freebsd.org/D13224 > > > > > >     Anyone interested is welcome to join the review. > > > > > > I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of > > QoS that seems misguided. It conflates a shorter timeout with don't retry. And > > why is retrying bad? It seems more a notion of 'fail fast' or so other concept. > > There's so many other ways you'd want to use it. And it uses the same return > > code (EIO) to mean something new. It's generally meant 'The lower layers have > > retried this, and it failed, do not submit it again as it will not succeed' with > > 'I gave it a half-assed attempt, and that failed, but resubmission might work'. > > This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM > > even more than they are broken now. > > > > So let's step back a bit: what problem is it trying to solve? > > A simple example.  I have a mirror, I issue a read to one of its members.  Let's > assume there is some trouble with that particular block on that particular disk. >  The disk may spend a lot of time trying to read it and would still fail.  With > the current defaults I would wait 5x that time to finally get the error back. > Then I go to another mirror member and get my data from there. > IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get > the error back sooner and try the other disk sooner.  Only if I know that there > are no other copies to try, then I would use the normal read with all the > retrying. > > > It sounds like you are optimizing the wrong thing and taking an overly > simplistic view of quality of service. > First, failing blocks on a disk is fairly rare. Do you really want to optimize > for that case? If it can be done without any harm to the sunny day scenario, then why not? I think that 'robustness' is the word here, not 'optimization'. > Second, you're really saying 'If you can't read it fast, fail" since we only > control the software side of read retry. Am I? That's not what I wanted to say, really. I just wanted to say, if this I/O fails, don't retry it, leave it to me. This is very simple, simplistic as you say, but I like simple. > There's new op codes being proposed > that say 'read or fail within Xms' which is really what you want: if it's taking > too long on disk A you want to move to disk B. The notion here was we'd return > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation > in software for drives that don't support this. You'd tweak this number to > control performance. You're likely to get a much bigger performance win all the > time by scheduling I/O to drives that have the best recent latency. ZFS already does some latency based decisions. The things that you describe are very interesting, but they are for the future. > Third, do you have numbers that show this is actually a win? I do not have any numbers right now. What kind of numbers would you like? What kind of scenarios? > This is a terrible > thing from an architectural view. You have said this several times, but unfortunately you haven't explained it yet. > Absent numbers that show it's a big win, I'm > very hesitant to say OK. > > Forth, there's a large number of places in the stack today that need to > communicate their I/O is more urgent, and we don't have any good way to > communicate even that simple concept down the stack. That's unfortunately, but my proposal has quite little to do with I/O scheduling, priorities, etc. > Finally, the only places that ZFS uses the TRYHARDER flag are for things like > the super block if I'm reading the code right. It doesn't do it for normal I/O. Right. But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the same way as ZIO_FLAG_TRYHARD. > There's no code to cope with what would happen if all the copies of a block > couldn't be read with the NORETRY flag. One of them might contain the data. ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above. -- Andriy Gapon From owner-freebsd-geom@freebsd.org Sat Nov 25 10:54:06 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B2DE1DDE418; Sat, 25 Nov 2017 10:54:06 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 74B507D2CE; Sat, 25 Nov 2017 10:54:06 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 3DBAD20C14; Sat, 25 Nov 2017 05:54:04 -0500 (EST) Received: from frontend1 ([10.202.2.160]) by compute6.internal (MEProxy); Sat, 25 Nov 2017 05:54:04 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsco.org; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc; s=fm1; bh=RaB/AKEW6ssBr1WgIVTDCxOVmtpaK 56Amk4OLGmTdd0=; b=hktOnAyGXDhIs2jROcS4re2WNoJh4EeJcnIG3biKqwep0 akgP6d18WTEK6J1ICq9cpd2J+xSeYABCXdEmdICogmastpBYdhIKhtfzvndsl79D i5EmDeyE93bqaV04cBchYRjRnZETgmhl93xVqXNR+pHLdmkJFNnCCqcDR10JHNKd sW4nmpmR4WvxZgVG7LkbmFiaQRBshJ/11h3azhVaDWOz4j+npO8EkjMMqVwGMFoz O0r8uzyyxCywUzYnmpMcFubBfZZqzMvrMltmhIS722Ke6+buQ3DeHJVqz5eiyu8N p4rAYrqIiGUwhoJzNuD35XfIldWe74sR5ZzBnOFvA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=RaB/AK EW6ssBr1WgIVTDCxOVmtpaK56Amk4OLGmTdd0=; b=m6bVt2S+eF+rrbG6orqHnO zLcu2MALeeuZEucMSts3+TTTGB/11L6qVJz04n5Hzhy36weGuekbGzE7peUo/W5v wkAa4wi5FhjOTO0BIJGwKS5raiEIdaPRTsl5aSmvzrLrio76NFtRRClWs+1+yesY +QW8EfoEgQ3Sh+3je5TIy/j5sC4GHZj0e4DBQ5UT3LttTzXU4ZU1NcRc499IsvFq c/ulQSKhAl+o+AedGUD43cBWkuaZl0s57+wqEvEbfQ6uzieN7QeVYEXoYpHlpWBq zAXov7tTEaWXhk9icxs1hduwtOYNUXsHBluVO5bAZB6ccqa9uCjhU4Pb3LVUXdKQ == X-ME-Sender: Received: from [192.168.0.106] (unknown [161.97.249.191]) by mail.messagingengine.com (Postfix) with ESMTPA id BDD0E7E6EE; Sat, 25 Nov 2017 05:54:03 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom From: Scott Long In-Reply-To: Date: Sat, 25 Nov 2017 03:54:01 -0700 Cc: Warner Losh , FreeBSD FS , freebsd-geom@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> To: Andriy Gapon X-Mailer: Apple Mail (2.3445.4.7) X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 10:54:06 -0000 > On Nov 24, 2017, at 10:17 AM, Andriy Gapon wrote: >=20 >=20 >>> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first = read, get >>> the error back sooner and try the other disk sooner. Only if I know = that there >>> are no other copies to try, then I would use the normal read with = all the retrying. >>>=20 >>=20 >> I agree with Warner that what you are proposing is not correct. It = weakens the >> contract between the disk layer and the upper layers, making it less = clear who is >> responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D = means. That contract is already >> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and = I >> are working on a plan to fix that. >=20 > Well... I do realize now that there is some problem in this area, = both you and > Warner mentioned it. But knowing that it exists is not the same as = knowing what > it is :-) > I understand that it could be rather complex and not easy to describe = in a short > email=E2=80=A6 >=20 There are too many questions to ask, I will do my best to keep the = conversation logical. First, how do you propose to distinguish between EIO due to a = lengthy set of timeouts, vs EIO due to an immediate error returned by the disk = hardware? CAM has an extensive table-driven error recovery protocol who=E2=80=99s = purpose is to decide whether or not to do retries based on hardware state information = that is not made available to the upper layers. Do you have a test case that = demonstrates the problem that you=E2=80=99re trying to solve? Maybe the error = recovery table is wrong and you=E2=80=99re encountering a case that should not be retried. If = that=E2=80=99s what=E2=80=99s going on, we should fix CAM instead of inventing a new work-around. Second, what about disk subsystems that do retries internally, out of = the control of the FreeBSD driver? This would include most hardware RAID = controllers. Should what you are proposing only work for a subset of the kinds of = storage systems that are available and in common use? Third, let=E2=80=99s say that you run out of alternate copies to try, = and as you stated originally, that will force you to retry the copies that had returned = EIO. How will you know when you can retry? How will you know how many times you will retry? How will you know that a retry is even possible? Should = the retries be able to be canceled? Why is overloading EIO so bad? brelse() will call bdirty() when a = BIO_WRITE command has failed with EIO. Calling bdirty() has the effect of = retrying the I/O. This disregards the fact that disk drivers only return EIO when = they=E2=80=99ve decided that the I/O cannot be retried. It has no termination condition for the = retries, and will endlessly retry I/O in vain; I=E2=80=99ve seen this quite = frequently. It also disregards the fact that I/O marked as B_PAGING can=E2=80=99t be retried in this = fashion, and will trigger a panic. Because we pretend that EIO can be retried, we are = left with a system that is very fragile when I/O actually does fail. Instead of = adding more special cases and blurred lines, I want to go back to enforcing = strict contracts between the layers and force the core parts of the system to = respect those contracts and handle errors properly, instead of just retrying and hoping for the best. > But then, this flag is optional, it's off by default and no one is = forced to > used it. If it's used only by ZFS, then it would not be horrible. > Unless it makes things very hard for the infrastructure. > But I am circling back to not knowing what problem(s) you and Warner = are > planning to fix. >=20 Saying that a feature is optional means nothing; while consumers of the = API might be able to ignore it, the producers of the API cannot ignore it. = It is these producers who are sick right now and should be fixed, instead of creating new ways to get even more sick. Scott From owner-freebsd-geom@freebsd.org Sat Nov 25 11:37:35 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4B0F0DDF02E; Sat, 25 Nov 2017 11:37:35 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 0C77E7E234; Sat, 25 Nov 2017 11:37:35 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.55.3]) by phk.freebsd.dk (Postfix) with ESMTP id 3507927347; Sat, 25 Nov 2017 11:37:27 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.15.2/8.15.2) with ESMTP id vAPBbAan030380; Sat, 25 Nov 2017 11:37:11 GMT (envelope-from phk@phk.freebsd.dk) To: Scott Long cc: Andriy Gapon , FreeBSD FS , Warner Losh , freebsd-geom@freebsd.org Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom In-reply-to: From: "Poul-Henning Kamp" References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-ID: <30378.1511609830.1@critter.freebsd.dk> Content-Transfer-Encoding: quoted-printable Date: Sat, 25 Nov 2017 11:37:10 +0000 Message-ID: <30379.1511609830@critter.freebsd.dk> X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 11:37:35 -0000 -------- In message , Scott Long w= rites: > Why is overloading EIO so bad? brelse() will call bdirty() when a BIO_W= RITE > command has failed with EIO. Calling bdirty() has the effect of retryin= g the I/O. > This disregards the fact that disk drivers only return EIO when they=E2=80= =99ve decided > that the I/O cannot be retried. It has no termination condition for the= retries, and > will endlessly retry I/O in vain; I=E2=80=99ve seen this quite frequentl= y. The really annoying thing about this particular class of errors, is that if we propagated them up to the filesystems, very often things could be relocated to different blocks and we would avoid the unnecessary filesystem corruption. The real fundamental deficiency is that we do not have a way to say "give = up if this bio cannot be completed in X time" which is what people actually w= ant. That is suprisingly hard to provide, there are far too many corner-cases for me to enumerate them all, but let me just give one example: Imagine you issue a deadlined write to a RAID5 thing. Thee component writes happen smoothly, but the last two fail the deadline, with no way to predict how long time it will take before they complete or fail. * Does the bio write transaction fail ? * Does the bio write transaction time out ? * Do you attempt to complete the write to the RAID5 ? * Where do you store a copy of the data if you do ? * What happens next time a read happens on this bio's extent ? Then for an encore, imagine it was a read bio: Three DMAs go smoothly, two are outstanding and you don't know if/when they will complete/fail. * If you fail or time out the bio, how do you "taint" the space being read into until the two remaining DMAs are outstanding? * What if that space is mapped into userland ? * What if that space is being executed ? * What if one of the two outstanding DMAs later return garbage ? My conclusion back when I did GEOM, was that the only way to do something like this sanely, is to have a special GEOM do it for you, which always allocates a temp-space: allocate temp buffer if (write) copy write data to temp buffer issue bio downwards on temp buffer if timeout park temp buffer until biodone return(timeout) if (read) copy temp buffer to read space return (ok/error) -- = Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe = Never attribute to malice what can adequately be explained by incompetence= . From owner-freebsd-geom@freebsd.org Sat Nov 25 16:25:06 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 230F0DEA368 for ; Sat, 25 Nov 2017 16:25:06 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-io0-x235.google.com (mail-io0-x235.google.com [IPv6:2607:f8b0:4001:c06::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D9AB56517C for ; Sat, 25 Nov 2017 16:25:05 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-io0-x235.google.com with SMTP id s37so21282306ioe.10 for ; Sat, 25 Nov 2017 08:25:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=INwbRnB4Pq35fT3wBMoGQsVmSwWa8Uzwez6+ZA+Wong=; b=o4Lmv1sLgXpR/W5FxIK2W45MqgEm3eW1Q5ufZLaYxdVob4ZMzks09gm/T4uP9N9Czp CIHLWgrcFtzMHk72QQ30gTLcb7ueVSnkl93kNn9yZ2SiivsEjgC4hZ+Wd1xdGYVwPt69 Bxa9DOJWxg0j9GZK0CuG1hXsqoepbmoFe3tW9Os9vcifNd74bt1jhLuESf4WAYLmZq4t iGVRmCO+CDylOKHMKdtfxC6brOQpXAwYYgE+7CmtDg1WF3dAqOKduG47bgQroIouiM9+ 7Xv11ziBUg9z51DMiF9yVBGvAXXp2xqr3QCeRv0kPRTBGC8q/VWbv80T1JOh8gOU2BGl X59A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=INwbRnB4Pq35fT3wBMoGQsVmSwWa8Uzwez6+ZA+Wong=; b=FsOxl9Ndj2a3VZ0en18/fVeC9ncvW5LATHXthv8M3O+jk2TAzabNSICFqP3+p4qsY+ ZsKLrKcpfVj2EILxyG3bxJzmYkM8IXKzC8KMMXhJTBd5kgAUwTQWZnjn24jBSR0/Le7u 4m5CPtpL0iBBhfrgrgMHBp8JNU9WdPPVCPNqlyCbEnZM7mWjF42C+VBcbObDlFXTOKEz uhnTBCHBv6wAglyLLhNYOta/m4cDqFTob2DTBn9VMJY1pG32yZt6YY8SqvXjoCyuTKUH KPAiX4sBYp175qD2HpRbnPXU2ka+3wGK02ik+JhGcK79cn6VrpK90Y/6gy5+IzQ8fukA cUsA== X-Gm-Message-State: AJaThX5IoL95fvVLCiQE98eU672lX67pdHJSxYvUoSv/Qpa5H8eyEU1M kaGDyGng8JJyhqRCaOJJ/ODTWxSahd4tbeG1MIjM/g== X-Google-Smtp-Source: AGs4zMZe+RePIe6UOmoCphMhiNYDZUz/+iu/mFebQM7KQ1V+YUeenNjnfy6CxK0Ko06+ny1hO6AXLOlxM8WMxivNo20= X-Received: by 10.107.81.24 with SMTP id f24mr35218615iob.63.1511627105118; Sat, 25 Nov 2017 08:25:05 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 08:25:04 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd] In-Reply-To: References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> From: Warner Losh Date: Sat, 25 Nov 2017 09:25:04 -0700 X-Google-Sender-Auth: dkGLRA2lZh9ar81kx7rluRa6pgo Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: Scott Long , FreeBSD FS , freebsd-geom@freebsd.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 16:25:06 -0000 On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon wrote: > On 24/11/2017 16:57, Scott Long wrote: > > > > > >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon wrote: > >> > >> On 24/11/2017 15:08, Warner Losh wrote: > >>> > >>> > >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon >>> > wrote: > >>> > >>> > >>> https://reviews.freebsd.org/D13224 D13224> > >>> > >>> Anyone interested is welcome to join the review. > >>> > >>> > >>> I think it's a really bad idea. It introduces a 'one-size-fits-all' > notion of > >>> QoS that seems misguided. It conflates a shorter timeout with don't > retry. And > >>> why is retrying bad? It seems more a notion of 'fail fast' or so othe= r > concept. > >>> There's so many other ways you'd want to use it. And it uses the same > return > >>> code (EIO) to mean something new. It's generally meant 'The lower > layers have > >>> retried this, and it failed, do not submit it again as it will not > succeed' with > >>> 'I gave it a half-assed attempt, and that failed, but resubmission > might work'. > >>> This breaks a number of assumptions in the BUF/BIO layer as well as > parts of CAM > >>> even more than they are broken now. > >>> > >>> So let's step back a bit: what problem is it trying to solve? > >> > >> A simple example. I have a mirror, I issue a read to one of its > members. Let's > >> assume there is some trouble with that particular block on that > particular disk. > >> The disk may spend a lot of time trying to read it and would still > fail. With > >> the current defaults I would wait 5x that time to finally get the erro= r > back. > >> Then I go to another mirror member and get my data from there. > > > > There are many RAID stacks that already solve this problem by having a > policy > > of always reading all disk members for every transaction, and throwing > away the > > sub-transactions that arrive late. It=E2=80=99s not a policy that is a= lways > desired, but it > > serves a useful purpose for low-latency needs. > > That's another possible and useful strategy. > > >> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first > read, get > >> the error back sooner and try the other disk sooner. Only if I know > that there > >> are no other copies to try, then I would use the normal read with all > the retrying. > >> > > > > I agree with Warner that what you are proposing is not correct. It > weakens the > > contract between the disk layer and the upper layers, making it less > clear who is > > responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D means= . That contract > is already > > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I > > are working on a plan to fix that. > > Well... I do realize now that there is some problem in this area, both > you and > Warner mentioned it. But knowing that it exists is not the same as > knowing what > it is :-) > I understand that it could be rather complex and not easy to describe in = a > short > email... > > But then, this flag is optional, it's off by default and no one is forced > to > used it. If it's used only by ZFS, then it would not be horrible. > Except that it isn't the same flag as what Solaris has (its B_FAILFAST does something different: it isn't about limiting retries but about failing ALL the queued I/O for a unit, not just trying one retry), and the problems that it solves are quite rare. And if you return a different errno, then the EIO contract is still fulfilled. > Unless it makes things very hard for the infrastructure. > But I am circling back to not knowing what problem(s) you and Warner are > planning to fix. > The middle layers of the I/O system are a bit fragile in the face of I/O errors. We're fixing that. Of course, you still haven't articulated why this approach would be better, nor show any numbers as to how it makes things better. Warner From owner-freebsd-geom@freebsd.org Sat Nov 25 16:36:31 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 34F0EDEA70D for ; Sat, 25 Nov 2017 16:36:31 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-io0-x22b.google.com (mail-io0-x22b.google.com [IPv6:2607:f8b0:4001:c06::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EA5356579D for ; Sat, 25 Nov 2017 16:36:30 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-io0-x22b.google.com with SMTP id z74so32212286iof.12 for ; Sat, 25 Nov 2017 08:36:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=PAGMPg1oviQBjanQMuXwwRvreJHA+xXSsbF+rVnmABo=; b=jwHO36iyQQyGYD/VJxZLY+Y1Slru/PtgkR7iCGVoSnfmXTMd/DE5Dwk3ExlWJ77/n8 6QXOTuIUWi1p1iP4G5bpPHqQj1DmBiDwAokIONmXoTlM+njgNK8eYlevnbbow9B0pc+U 5aGyyipRR0UZqSRLKebh0+42VMwnqNRS9oecjKD5mMSpQesXBDI+Xh3Mkg6yiPyE5RjQ /OaifUrOwdVeethDdE86HE1hdhGckNlYrA9euWMPn699PJ7l7Hk3dfFp+Q9Dok5fT9Oe UwzfJnK2OURLd9z4p1E1CeA8DIO64YpH4Zm31e+ruacOonWrF0yL/7f5ldJvcov9Z1G+ rnlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=PAGMPg1oviQBjanQMuXwwRvreJHA+xXSsbF+rVnmABo=; b=poSc/kWrDHZIo+o/GwV7l6GtK/JBr5E7H+tqUjm08p3b8JFsummpxOOd0HLj1Au2Uo 3spsRZeG4oFLrEQfYeW9vYMYb1FMqGkwqfO+QjKof3LPEZh7/fi0x/GanmgT6lckeMT/ lIjh1zGT+MsJoXrdZlbccyOUKPWK9NC6XQVcfNnhBatQzwFxxlGZ4hgo8gmHq6QexIgY a8sUPSTlFl4j9YaeLzGOu7QQiapPnuCPI6Q/WNRMLn2GLvK7DFYfOhmfWimRZwG1LAQY Owyb5N2+dwfF5Y1G2c3J3L4cOxlvGxslSsMw0bh6Oc4XGCGLJykCXBbyqGZ2dOGOF+ZH ZRtA== X-Gm-Message-State: AJaThX6izarZnp53wipb7pJ6bZXkPjl4h7Ga0YCJJy/Xjtlq4c8Q2y9k HVPSApnqLqzvTiNIlUJnGBmoQXTFd99kvnK2I51XtA== X-Google-Smtp-Source: AGs4zMaOq0NE2XAnPb2aSMAenvQYIra9r9I9OqDGjqHPMAzZj/hOPvKqt8rkqyfWTC0aR1H3wfmt/f6FieY5jmzAZpU= X-Received: by 10.107.30.81 with SMTP id e78mr22143577ioe.130.1511627790118; Sat, 25 Nov 2017 08:36:30 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 08:36:29 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd] In-Reply-To: References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> From: Warner Losh Date: Sat, 25 Nov 2017 09:36:29 -0700 X-Google-Sender-Auth: cPV8WY30lXU33pTYiZKqTUQRXtA Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 16:36:31 -0000 On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon wrote: > On 24/11/2017 18:33, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon > > wrote: > > > > On 24/11/2017 15:08, Warner Losh wrote: > > > > > > > > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > > >> wrote: > > > > > > > > > https://reviews.freebsd.org/D13224 > > D13224 > > > > > > > > > Anyone interested is welcome to join the review. > > > > > > > > > I think it's a really bad idea. It introduces a > 'one-size-fits-all' notion of > > > QoS that seems misguided. It conflates a shorter timeout with > don't retry. And > > > why is retrying bad? It seems more a notion of 'fail fast' or so > other concept. > > > There's so many other ways you'd want to use it. And it uses the > same return > > > code (EIO) to mean something new. It's generally meant 'The lower > layers have > > > retried this, and it failed, do not submit it again as it will not > succeed' with > > > 'I gave it a half-assed attempt, and that failed, but resubmission > might work'. > > > This breaks a number of assumptions in the BUF/BIO layer as well > as parts of CAM > > > even more than they are broken now. > > > > > > So let's step back a bit: what problem is it trying to solve? > > > > A simple example. I have a mirror, I issue a read to one of its > members. Let's > > assume there is some trouble with that particular block on that > particular disk. > > The disk may spend a lot of time trying to read it and would still > fail. With > > the current defaults I would wait 5x that time to finally get the > error back. > > Then I go to another mirror member and get my data from there. > > IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first > read, get > > the error back sooner and try the other disk sooner. Only if I know > that there > > are no other copies to try, then I would use the normal read with > all the > > retrying. > > > > > > It sounds like you are optimizing the wrong thing and taking an overly > > simplistic view of quality of service. > > First, failing blocks on a disk is fairly rare. Do you really want to > optimize > > for that case? > > If it can be done without any harm to the sunny day scenario, then why not? > I think that 'robustness' is the word here, not 'optimization'. I fail to see how it is a robustness issue. You've not made that case. You want the I/O to fail fast so you can give another disk a shot sooner. That's optimization. > Second, you're really saying 'If you can't read it fast, fail" since we > only > > control the software side of read retry. > > Am I? > That's not what I wanted to say, really. I just wanted to say, if this I/O > fails, don't retry it, leave it to me. > This is very simple, simplistic as you say, but I like simple. Right. Simple doesn't make it right. In fact, simple often makes it wrong. We have big issues with the nvd device today because it's mindlessly queues all the trim requests to the NVMe device w/o collapsing them, resulting in horrible performance. > There's new op codes being proposed > > that say 'read or fail within Xms' which is really what you want: if > it's taking > > too long on disk A you want to move to disk B. The notion here was we'd > return > > EAGAIN (or some other error) if it failed after Xms, and maybe do some > emulation > > in software for drives that don't support this. You'd tweak this number > to > > control performance. You're likely to get a much bigger performance win > all the > > time by scheduling I/O to drives that have the best recent latency. > > ZFS already does some latency based decisions. > The things that you describe are very interesting, but they are for the > future. > > > Third, do you have numbers that show this is actually a win? > > I do not have any numbers right now. > What kind of numbers would you like? What kind of scenarios? The usual kind. How is latency for I/O improved when you have a disk with a few failing sectors that take a long time to read (which isn't a given: some sectors fail fast). What happens when you have a failed disk? etc. How does this compare with the current system. Basically, how do you know this will really make things better and isn't some kind of 'feel good' thing about 'doing something clever' about the problem that may actually make things worse. > This is a terrible > > thing from an architectural view. > > You have said this several times, but unfortunately you haven't explained > it yet. I have explained it. You weren't listening. 1. It breaks the EIO contract that's currently in place. 2. It presumes to know what kind of retries should be done at the upper layers where today we have a system that's more black and white. You don't know the same info the low layers have to know whether to try another drive, or just retry this one. 3. It assumes that retries are the source of latency in the system. they aren't necessarily. 4. It assumes retries are necessarily slow: they may be, they might not be. All depends on the drive (SSDs repeated I/O are often faster than actual I/O). 5. It's just one bit when you really need more complex nuances to get good QoE out of the I/O system. Retries is an incidental detail that's not that important, while latency is what you care most about minimizing. You wouldn't care if I tried to read the data 20 times if it got the result faster than going to a different drive. 6. It's putting the wrong kind of specific hints into the mix. > Absent numbers that show it's a big win, I'm > > very hesitant to say OK. > > > > Forth, there's a large number of places in the stack today that need to > > communicate their I/O is more urgent, and we don't have any good way to > > communicate even that simple concept down the stack. > > That's unfortunately, but my proposal has quite little to do with I/O > scheduling, priorities, etc. Except it does. It dictates error recovery policy which is I/O scheduling. > Finally, the only places that ZFS uses the TRYHARDER flag are for things > like > > the super block if I'm reading the code right. It doesn't do it for > normal I/O. > > Right. But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in > the > same way as ZIO_FLAG_TRYHARD. > > > There's no code to cope with what would happen if all the copies of a > block > > couldn't be read with the NORETRY flag. One of them might contain the > data. > > ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above. > Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when an I/O fails' It doesn't control retries at all in Solaris. It's a different concept entirely, and one badly thought out. Warner From owner-freebsd-geom@freebsd.org Sat Nov 25 16:58:38 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 49CC2DEADA2; Sat, 25 Nov 2017 16:58:38 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f43.google.com (mail-lf0-f43.google.com [209.85.215.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EF135662E9; Sat, 25 Nov 2017 16:58:37 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f43.google.com with SMTP id g35so28372631lfi.13; Sat, 25 Nov 2017 08:58:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=sNmBbqSvUdx+P0oBQz+HZ0eKlahQ3p0XcaiUMbppq94=; b=ZbYvg6jV7ktmRiS8wYCaHCHERogGTOY8Npky+qYT3+jp5cvvlWMDTohvMf13T44Fdv k856mLl06msqEevycDCZqvaFy0z5P3D6LMuQ2Rn7DVPWeGsye+F4EWW6zh3Y7gxBsger JFKzBWv/XiRArbdOYW5TR8jZK1Lx1D5qXkXR76XQ58heVIThtOfG7cUcC5zx/5TQeIwQ cp4kxE1gfRKxSKmH/RSi+9hnLBUtUa/xjZEWYEwIFnLTo7pv+BDKliFtJ1YR/B+RCb9j dzz236S/DUb0Eq3E4iKj0A33npuswtiw1G1Gs4F+6Nd3eCdYW2xuzSt1y8Q3X52P2V5P gf6w== X-Gm-Message-State: AJaThX4KWIf7Tpc0YteQA1eyDsSBQ5luze6tAgx0c7ZK5PjfjH7VPL1F Ad3nYuZRatRAfec5FodPk03zwUJoS3g= X-Google-Smtp-Source: AGs4zMaPazeEKorA20LgfmuxaRUD4LZnRTIi/eMwruMrxb9Wv325OBtJQl3mgpdxVZyMEiz1uxJGig== X-Received: by 10.25.20.77 with SMTP id k74mr9606078lfi.80.1511629110179; Sat, 25 Nov 2017 08:58:30 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id v12sm5027560ljd.15.2017.11.25.08.58.28 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 25 Nov 2017 08:58:29 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Warner Losh Cc: Scott Long , FreeBSD FS , freebsd-geom@freebsd.org References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> From: Andriy Gapon Message-ID: <27c9395f-5b3c-a062-3aee-de591770af0b@FreeBSD.org> Date: Sat, 25 Nov 2017 18:58:27 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 16:58:38 -0000 On 25/11/2017 18:25, Warner Losh wrote: > > > On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon > wrote: > > On 24/11/2017 16:57, Scott Long wrote: > > > > > >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon wrote: > >> > >> On 24/11/2017 15:08, Warner Losh wrote: > >>> > >>> > >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > >>> >> wrote: > >>> > >>> > >>>    https://reviews.freebsd.org/D13224 > > > >>> > >>>    Anyone interested is welcome to join the review. > >>> > >>> > >>> I think it's a really bad idea. It introduces a 'one-size-fits-all' > notion of > >>> QoS that seems misguided. It conflates a shorter timeout with don't > retry. And > >>> why is retrying bad? It seems more a notion of 'fail fast' or so other > concept. > >>> There's so many other ways you'd want to use it. And it uses the same return > >>> code (EIO) to mean something new. It's generally meant 'The lower layers > have > >>> retried this, and it failed, do not submit it again as it will not > succeed' with > >>> 'I gave it a half-assed attempt, and that failed, but resubmission might > work'. > >>> This breaks a number of assumptions in the BUF/BIO layer as well as > parts of CAM > >>> even more than they are broken now. > >>> > >>> So let's step back a bit: what problem is it trying to solve? > >> > >> A simple example.  I have a mirror, I issue a read to one of its > members.  Let's > >> assume there is some trouble with that particular block on that > particular disk. > >> The disk may spend a lot of time trying to read it and would still fail.  > With > >> the current defaults I would wait 5x that time to finally get the error back. > >> Then I go to another mirror member and get my data from there. > > > > There are many RAID stacks that already solve this problem by having a policy > > of always reading all disk members for every transaction, and throwing > away the > > sub-transactions that arrive late.  It’s not a policy that is always > desired, but it > > serves a useful purpose for low-latency needs. > > That's another possible and useful strategy. > > >> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get > >> the error back sooner and try the other disk sooner.  Only if I know that there > >> are no other copies to try, then I would use the normal read with all the retrying. > >> > > > > I agree with Warner that what you are proposing is not correct.  It weakens the > > contract between the disk layer and the upper layers, making it less clear who is > > responsible for retries and less clear what “EIO” means.  That contract is already > > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I > > are working on a plan to fix that. > > Well...  I do realize now that there is some problem in this area, both you and > Warner mentioned it.  But knowing that it exists is not the same as knowing what > it is :-) > I understand that it could be rather complex and not easy to describe in a short > email... > > But then, this flag is optional, it's off by default and no one is forced to > used it.  If it's used only by ZFS, then it would not be horrible. > > > Except that it isn't the same flag as what Solaris has (its B_FAILFAST does > something different: it isn't about limiting retries but about failing ALL the > queued I/O for a unit, not just trying one retry), and the problems that it > solves are quite rare. And if you return a different errno, then the EIO > contract is still fulfilled.  Yes, it isn't the same. I think that illumos flag does even more. > Unless it makes things very hard for the infrastructure. > But I am circling back to not knowing what problem(s) you and Warner are > planning to fix. > > > The middle layers of the I/O system are a bit fragile in the face of I/O errors. > We're fixing that. What are the middle layers? > Of course, you still haven't articulated why this approach would be better Better than what? > nor > show any numbers as to how it makes things better. By now, I have. See my reply to Scott's email. -- Andriy Gapon From owner-freebsd-geom@freebsd.org Sat Nov 25 17:36:31 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9894ADEBB29; Sat, 25 Nov 2017 17:36:31 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f45.google.com (mail-lf0-f45.google.com [209.85.215.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 173D96767F; Sat, 25 Nov 2017 17:36:30 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f45.google.com with SMTP id k66so28461669lfg.3; Sat, 25 Nov 2017 09:36:30 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=iOl1D78pVkm1p2WO2iSswIdicfhFfzoRB2Bc9x8J0XA=; b=FwAhgIqykPRVE9Q778D+OPfmyn95E5zd8SIsartEGdWg1Ssj9VpFs/Gvcwq5N5rI9H 8jCfpIwC2dfgsmCoT7YdNA79XbZLsWf5M6TxrkJ5wTl7EiCNWYVFQHmxM/tN758mxqGU CXD6NL6lpvPFgTIjKy8/3ruVgDirqO4wOwg+LFRlaXftY4mRdtYkpTaTZGIuE3sk1fRS tMSOrAqrkxaNCqFBIDnxdKwu19Mqw2DTfK9+fcU3dNtKwo/tyRcsZVseXfphZ/t0lpFo uTO7GWHARKCOSkjmUCmBoJwIxNjPyzyNYRU+oq2OoJexdlVuqRKjQrMZexOnAD8HI3ZU Aplg== X-Gm-Message-State: AJaThX4VvLGCfaJtIBGAuRmop0Jb1Xk7CKTLNuZkpUJ67LBcdxotKHfj 520NpS6rN+tiPce5ICUf/OhRAutEV/c= X-Google-Smtp-Source: AGs4zMbBpiuwlbgmOFur1itv4JJ+4S3jA/+BP4yojPXM0feB1c+MAnucw6G9lWk0Z6UcVuGQhWCJQQ== X-Received: by 10.46.84.86 with SMTP id y22mr4011842ljd.89.1511631382512; Sat, 25 Nov 2017 09:36:22 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id f66sm1845239lfl.72.2017.11.25.09.36.20 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 25 Nov 2017 09:36:20 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Scott Long Cc: Warner Losh , FreeBSD FS , freebsd-geom@freebsd.org References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> From: Andriy Gapon Message-ID: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> Date: Sat, 25 Nov 2017 19:36:19 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 17:36:31 -0000 On 25/11/2017 12:54, Scott Long wrote: > > >> On Nov 24, 2017, at 10:17 AM, Andriy Gapon wrote: >> >> >>>> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first read, get >>>> the error back sooner and try the other disk sooner. Only if I know that there >>>> are no other copies to try, then I would use the normal read with all the retrying. >>>> >>> >>> I agree with Warner that what you are proposing is not correct. It weakens the >>> contract between the disk layer and the upper layers, making it less clear who is >>> responsible for retries and less clear what “EIO” means. That contract is already >>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I >>> are working on a plan to fix that. >> >> Well... I do realize now that there is some problem in this area, both you and >> Warner mentioned it. But knowing that it exists is not the same as knowing what >> it is :-) >> I understand that it could be rather complex and not easy to describe in a short >> email… >> > > There are too many questions to ask, I will do my best to keep the conversation > logical. First, how do you propose to distinguish between EIO due to a lengthy > set of timeouts, vs EIO due to an immediate error returned by the disk hardware? At what layer / component? If I am the issuer of the request then I know how I issued that request and what kind of request it was. If I am an intermediate layer, then what do I care. > CAM has an extensive table-driven error recovery protocol who’s purpose is to > decide whether or not to do retries based on hardware state information that is > not made available to the upper layers. Do you have a test case that demonstrates > the problem that you’re trying to solve? Maybe the error recovery table is wrong > and you’re encountering a case that should not be retried. If that’s what’s going on, > we should fix CAM instead of inventing a new work-around. Let's assume that I am talking about the case of not being able to read an HDD sector that is gone bad. Here is a real world example: Jun 16 10:40:18 trant kernel: ahcich0: NCQ error, slot = 20, port = -1 Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): Retrying command Jun 16 10:40:20 trant kernel: ahcich0: NCQ error, slot = 22, port = -1 Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:20 trant kernel: (ada0:ahcich0:0:0:0): Retrying command Jun 16 10:40:22 trant kernel: ahcich0: NCQ error, slot = 24, port = -1 Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:22 trant kernel: (ada0:ahcich0:0:0:0): Retrying command Jun 16 10:40:25 trant kernel: ahcich0: NCQ error, slot = 26, port = -1 Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:25 trant kernel: (ada0:ahcich0:0:0:0): Retrying command Jun 16 10:40:27 trant kernel: ahcich0: NCQ error, slot = 28, port = -1 Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 58 62 40 2c 00 00 08 00 00 Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA Status Error Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC ) Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 62 40 2c 00 00 00 00 Jun 16 10:40:27 trant kernel: (ada0:ahcich0:0:0:0): Error 5, Retries exhausted I do not see anything wrong in what CAM / ahci /ata_da did here. They did what I would expect them to do. They tried very hard to get data that I told them I need. Timestamp of the first error is Jun 16 10:40:18. Timestamp of the last error is Jun 16 10:40:27. So, it took additional 9 seconds to finally produce EIO. That disk is a part of a ZFS mirror. If the request was failed after the first attempt, then ZFS would be able to get the data from a good disk much sooner. And don't take me wrong, I do NOT want CAM or GEOM to make that decision by itself. I want ZFS to be able to tell the lower layers when they should try as hard as they normally do and when they should report an I/O error as soon as it happens without any retries. > Second, what about disk subsystems that do retries internally, out of the control > of the FreeBSD driver? This would include most hardware RAID controllers. > Should what you are proposing only work for a subset of the kinds of storage > systems that are available and in common use? Yes. I do not worry about things that are beyond my control. Those subsystems would behave as they do now. So, nothing would get worse. > Third, let’s say that you run out of alternate copies to try, and as you stated > originally, that will force you to retry the copies that had returned EIO. How > will you know when you can retry? How will you know how many times you > will retry? How will you know that a retry is even possible? I am continuing to use ZFS as an example. It already has all the logic built in. If all vdev zio-s (requests to mirror members as an example) fail, then their parent zio (a logical read from the mirror) will be retried (by ZFS) and when ZFS retries it sets a special flag (ZIO_FLAG_IO_RETRY) on that zio and on its future child zio-s. Essentially, my answer is you have to program it correctly, there is no magic. > Should the retries > be able to be canceled? I think that this is an orthogonal question. I do not have any answer and I am not ready to discuss this at all. > Why is overloading EIO so bad? brelse() will call bdirty() when a BIO_WRITE > command has failed with EIO. Calling bdirty() has the effect of retrying the I/O. > This disregards the fact that disk drivers only return EIO when they’ve decided > that the I/O cannot be retried. It has no termination condition for the retries, and > will endlessly retry I/O in vain; I’ve seen this quite frequently. It also disregards > the fact that I/O marked as B_PAGING can’t be retried in this fashion, and will > trigger a panic. Because we pretend that EIO can be retried, we are left with > a system that is very fragile when I/O actually does fail. Instead of adding > more special cases and blurred lines, I want to go back to enforcing strict > contracts between the layers and force the core parts of the system to respect > those contracts and handle errors properly, instead of just retrying and > hoping for the best. So, I suggest that the buffer layer (all the b* functions) does not use the proposed flag. Any problems that exist in it should be resolved first. ZFS does not use that layer. >> But then, this flag is optional, it's off by default and no one is forced to >> used it. If it's used only by ZFS, then it would not be horrible. >> Unless it makes things very hard for the infrastructure. >> But I am circling back to not knowing what problem(s) you and Warner are >> planning to fix. >> > > Saying that a feature is optional means nothing; while consumers of the API > might be able to ignore it, the producers of the API cannot ignore it. It is > these producers who are sick right now and should be fixed, instead of > creating new ways to get even more sick. I completely agree. But which producers of the API do you mean specifically? So far, you mentioned only the consumer level problems with the b-layer. Having said all of the above, I must admit one thing. When I proposed BIO_NORETRY I had only the simplest GEOM topology in mind: ZFS -> [partition] -> disk. Now that I start to think about more complex topologies I am not really sure how the flag should be handled by geom-s with complex internal behavior. If that can be implemented reasonably and clearly, if the flag will create a big mess. E.g., things like gmirrors on top of gmirrors and so on. Maybe the flag, if it ever accepted, should never be propagated automatically. Only geom-s that are aware of it should propagate or request it. That should be safer. -- Andriy Gapon From owner-freebsd-geom@freebsd.org Sat Nov 25 17:38:22 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0910DEBBEB for ; Sat, 25 Nov 2017 17:38:22 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-io0-x230.google.com (mail-io0-x230.google.com [IPv6:2607:f8b0:4001:c06::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BB2B6678DE for ; Sat, 25 Nov 2017 17:38:22 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-io0-x230.google.com with SMTP id d21so10099575ioe.7 for ; Sat, 25 Nov 2017 09:38:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=3iyg3cV4DFZHCSkAutindZTnEutdchf5thqOLOgiRDE=; b=sBaoG7R6xwKCU2ZqOf+FmtFLSmDy+IGCKl1rYjgYcQVPD8SX74GnlHXJjXAihqhzxB qYO1Nmusy0hdZBTlqRmtawdcgBucaQkU0Oi4tbIfNvRJu/E2quYUNAGNTvPTvGyMqzqF 0o22Xp9QXupfqN2fY2yqRsZsYLnvRU0yDtOe3kzqC5BNYKWHxtKVL84kJ6Cdtxc0IUaw b2zgXXFmsZ6mo28JPH6uzln2ufZRyYjV2DKL4MjiLRhTsiM1opHMTqoHN+68wIuDXMnW Wb6C83vs2nNGaHHQlk6LuX+AFdQqubx+QyjY6BScJsQNCu/TLG5x/xZaaZgF4FG3KN/4 Wzdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=3iyg3cV4DFZHCSkAutindZTnEutdchf5thqOLOgiRDE=; b=ryfpG30In2aVWFgFUkX7qEAQVgVRczrsLeTiuU5P7HtgVDJRbocgA56to7OUhLJv1K eW/JnwkOyM/mOKwcDaYBw0ErtaLtpAscCZX4KZkrch9SUpwn9dG20KvFaZ6FOS+B30BO ng6z5stY7nK6/lII0DuUia4xb9vfhi6G8dyt6TN+3uiXBPY80P5ka0IP0Txv4oixuPMs w0khl9tFWgLQSwiQEOWkx1bLUGpO3lM0zKRS7GECb06fyz9JRn0iMEiAX95Cfau8gw/Y A66Fl5eO7oysEUAuKxORtJ2z1T5e5bRm5x9RBWTGTgNEkQRhZTnpIJ1/ERGtD1a91cvn pqlA== X-Gm-Message-State: AJaThX40n+fq+h+BUKF/IwQuHui65gbflv3E2orB4hcRFA7ngp22RYqu h/HbtFHDs2eAavYzodywlgciNqJehq0yoMYFO6j7rg== X-Google-Smtp-Source: AGs4zMZpYS3c6FiEoS1YalzMszvrxFrWG+Ow/PBeOV+/wjBX9x9Kdxe/x6Qd2CetfGxNr8hrU8xe+Vh7WeKktp0Fnq4= X-Received: by 10.107.104.18 with SMTP id d18mr33320248ioc.136.1511631501988; Sat, 25 Nov 2017 09:38:21 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 09:38:21 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd] In-Reply-To: <27c9395f-5b3c-a062-3aee-de591770af0b@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <27c9395f-5b3c-a062-3aee-de591770af0b@FreeBSD.org> From: Warner Losh Date: Sat, 25 Nov 2017 10:38:21 -0700 X-Google-Sender-Auth: hRNqrbEqFn4VgCb2ByZQrwSf-oE Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: Scott Long , FreeBSD FS , freebsd-geom@freebsd.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 17:38:23 -0000 On Sat, Nov 25, 2017 at 9:58 AM, Andriy Gapon wrote: > On 25/11/2017 18:25, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon > > wrote: > > > > On 24/11/2017 16:57, Scott Long wrote: > > > > > > > > >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon > wrote: > > >> > > >> On 24/11/2017 15:08, Warner Losh wrote: > > >>> > > >>> > > >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > > >>> >> wrote: > > >>> > > >>> > > >>> https://reviews.freebsd.org/D13224 > > D13224 > > > > > >>> > > >>> Anyone interested is welcome to join the review. > > >>> > > >>> > > >>> I think it's a really bad idea. It introduces a > 'one-size-fits-all' > > notion of > > >>> QoS that seems misguided. It conflates a shorter timeout with > don't > > retry. And > > >>> why is retrying bad? It seems more a notion of 'fail fast' or s= o > other > > concept. > > >>> There's so many other ways you'd want to use it. And it uses th= e > same return > > >>> code (EIO) to mean something new. It's generally meant 'The > lower layers > > have > > >>> retried this, and it failed, do not submit it again as it will > not > > succeed' with > > >>> 'I gave it a half-assed attempt, and that failed, but > resubmission might > > work'. > > >>> This breaks a number of assumptions in the BUF/BIO layer as wel= l > as > > parts of CAM > > >>> even more than they are broken now. > > >>> > > >>> So let's step back a bit: what problem is it trying to solve? > > >> > > >> A simple example. I have a mirror, I issue a read to one of its > > members. Let's > > >> assume there is some trouble with that particular block on that > > particular disk. > > >> The disk may spend a lot of time trying to read it and would > still fail. > > With > > >> the current defaults I would wait 5x that time to finally get th= e > error back. > > >> Then I go to another mirror member and get my data from there. > > > > > > There are many RAID stacks that already solve this problem by > having a policy > > > of always reading all disk members for every transaction, and > throwing > > away the > > > sub-transactions that arrive late. It=E2=80=99s not a policy tha= t is > always > > desired, but it > > > serves a useful purpose for low-latency needs. > > > > That's another possible and useful strategy. > > > > >> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the > first read, get > > >> the error back sooner and try the other disk sooner. Only if I > know that there > > >> are no other copies to try, then I would use the normal read wit= h > all the retrying. > > >> > > > > > > I agree with Warner that what you are proposing is not correct. > It weakens the > > > contract between the disk layer and the upper layers, making it > less clear who is > > > responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D= means. That > contract is already > > > weak due to poor design decisions in VFS-BIO and GEOM, and Warner > and I > > > are working on a plan to fix that. > > > > Well... I do realize now that there is some problem in this area, > both you and > > Warner mentioned it. But knowing that it exists is not the same as > knowing what > > it is :-) > > I understand that it could be rather complex and not easy to > describe in a short > > email... > > > > But then, this flag is optional, it's off by default and no one is > forced to > > used it. If it's used only by ZFS, then it would not be horrible. > > > > > > Except that it isn't the same flag as what Solaris has (its B_FAILFAST > does > > something different: it isn't about limiting retries but about failing > ALL the > > queued I/O for a unit, not just trying one retry), and the problems tha= t > it > > solves are quite rare. And if you return a different errno, then the EI= O > > contract is still fulfilled. > > Yes, it isn't the same. > I think that illumos flag does even more. Since it isn't the same, and there's not other systems that do a similar thing, that ups the burden of proof that this is a good idea. > Unless it makes things very hard for the infrastructure. > > But I am circling back to not knowing what problem(s) you and Warne= r > are > > planning to fix. > > > > > > The middle layers of the I/O system are a bit fragile in the face of I/= O > errors. > > We're fixing that. > > What are the middle layers? The buffer cache and lower layers of the UFS code is where the problems chiefly lie. > Of course, you still haven't articulated why this approach would be bette= r > > Better than what? Well, anything? > > nor > > show any numbers as to how it makes things better. > > By now, I have. See my reply to Scott's email. I just checked my email, I've seen no such reply. I checked it before I replied. Maybe it's just delayed. Warner From owner-freebsd-geom@freebsd.org Sat Nov 25 17:41:03 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB5D7DEBD5E; Sat, 25 Nov 2017 17:41:03 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f42.google.com (mail-lf0-f42.google.com [209.85.215.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 6A68067AB8; Sat, 25 Nov 2017 17:41:03 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f42.google.com with SMTP id o41so28431705lfi.2; Sat, 25 Nov 2017 09:41:03 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=XPeLSjy+VQezLXaWyLqVGxRiCz8i9BQr86mV3Lj2YAY=; b=P5aLhmuYvRBS2YzstZB/VysjO8tDdJJbkA4wc5dEAqR37aBMeuaP2jmTHwBSSXeZoZ 0/2t++942PSBvM4SrNJidCkQSPuCaHdvGBF5xAkDIaPTmKg6msEv9PINiWbmNcYkLE3i vmXz5BwDSRT/pGxmmHsim4Zlm6/ifAOlBZAJRArWDncGHwtgtTrM/BjaFwrPg+9suF2q gUxNmXczOtc90V3bTXcDD6x8RCsqpVPSX2NIw8eolApVVfBsl/wF+oKJF8hpowlPprof E341dmmpia2l/OkYqRHuWxAPA/S3nnZCG8TfSC2fGc6RsaWZ8/3BeIyYuYb9VLIXjrs3 J0ug== X-Gm-Message-State: AJaThX5VQgIz99QNQimDK5dBVUgDFPyYvPs1xAHfwmF285C9KdPkF+Db 8WvrI088SaihgKke3mGbprI= X-Google-Smtp-Source: AGs4zMaKaiPdOiBetpyS/MEYxVNA+8qROI/0uCcumggNbeOONXIBzJLKs7oDi2qoq0qJD4SHl4FwIg== X-Received: by 10.46.99.211 with SMTP id s80mr11250056lje.7.1511631660987; Sat, 25 Nov 2017 09:41:00 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id m26sm5121410ljb.61.2017.11.25.09.40.59 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 25 Nov 2017 09:41:00 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Warner Losh Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> From: Andriy Gapon Message-ID: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> Date: Sat, 25 Nov 2017 19:40:59 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 17:41:04 -0000 Before anything else, I would like to say that I got an impression that we speak from so different angles that we either don't understand each other's words or, even worse, misinterpret them. On 25/11/2017 18:36, Warner Losh wrote: > > > On Fri, Nov 24, 2017 at 10:20 AM, Andriy Gapon > wrote: > > On 24/11/2017 18:33, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon > > >> wrote: > > > >     On 24/11/2017 15:08, Warner Losh wrote: > >     > > >     > > >     > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > >     > >>> wrote: > >     > > >     > > >     >     https://reviews.freebsd.org/D13224 > >      > > >     >> > >     > > >     >     Anyone interested is welcome to join the review. > >     > > >     > > >     > I think it's a really bad idea. It introduces a 'one-size-fits-all' > notion of > >     > QoS that seems misguided. It conflates a shorter timeout with don't > retry. And > >     > why is retrying bad? It seems more a notion of 'fail fast' or so > other concept. > >     > There's so many other ways you'd want to use it. And it uses the > same return > >     > code (EIO) to mean something new. It's generally meant 'The lower > layers have > >     > retried this, and it failed, do not submit it again as it will not > succeed' with > >     > 'I gave it a half-assed attempt, and that failed, but resubmission > might work'. > >     > This breaks a number of assumptions in the BUF/BIO layer as well as > parts of CAM > >     > even more than they are broken now. > >     > > >     > So let's step back a bit: what problem is it trying to solve? > > > >     A simple example.  I have a mirror, I issue a read to one of its > members.  Let's > >     assume there is some trouble with that particular block on that > particular disk. > >      The disk may spend a lot of time trying to read it and would still > fail.  With > >     the current defaults I would wait 5x that time to finally get the > error back. > >     Then I go to another mirror member and get my data from there. > >     IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first > read, get > >     the error back sooner and try the other disk sooner.  Only if I know > that there > >     are no other copies to try, then I would use the normal read with all the > >     retrying. > > > > > > It sounds like you are optimizing the wrong thing and taking an overly > > simplistic view of quality of service. > > First, failing blocks on a disk is fairly rare. Do you really want to optimize > > for that case? > > If it can be done without any harm to the sunny day scenario, then why not? > I think that 'robustness' is the word here, not 'optimization'. > > > I fail to see how it is a robustness issue. You've not made that case. You want > the I/O to fail fast so you can give another disk a shot sooner. That's > optimization. Then you can call a protection against denial-of-service an optimization too. You want to do things faster, right? > > Second, you're really saying 'If you can't read it fast, fail" since we only > > control the software side of read retry. > > Am I? > That's not what I wanted to say, really.  I just wanted to say, if this I/O > fails, don't retry it, leave it to me. > This is very simple, simplistic as you say, but I like simple. > > > Right. Simple doesn't make it right. In fact, simple often makes it wrong. I agree. The same applies to complex well. Let's stop at this. > We > have big issues with the nvd device today because it's mindlessly queues all the > trim requests to the NVMe device w/o collapsing them, resulting in horrible > performance. > > > There's new op codes being proposed > > that say 'read or fail within Xms' which is really what you want: if it's taking > > too long on disk A you want to move to disk B. The notion here was we'd return > > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation > > in software for drives that don't support this. You'd tweak this number to > > control performance. You're likely to get a much bigger performance win all the > > time by scheduling I/O to drives that have the best recent latency. > > ZFS already does some latency based decisions. > The things that you describe are very interesting, but they are for the future. > > > Third, do you have numbers that show this is actually a win? > > I do not have any numbers right now. > What kind of numbers would you like?  What kind of scenarios? > > > The usual kind. How is latency for I/O improved when you have a disk with a few > failing sectors that take a long time to read (which isn't a given: some sectors > fail fast). Today I gave an example of how four retries added about 9 seconds of additional delay. I think that that is significant. > What happens when you have a failed disk? etc. How does this compare > with the current system. I haven't done such an experiment. I guess it depends on how exactly the disk fails. There is a big difference between a disk dropping a link and a disk turning into a black hole. > Basically, how do you know this will really make things better and isn't some > kind of 'feel good' thing about 'doing something clever' about the problem that > may actually make things worse. > > > This is a terrible > > thing from an architectural view. > > You have said this several times, but unfortunately you haven't explained it > yet. > > > I have explained it. You weren't listening. This is the first time I see the below list or anything like it. > 1. It breaks the EIO contract that's currently in place. This needs further explanation. > 2. It presumes to know what kind of retries should be done at the upper layers > where today we have a system that's more black and white. I don't understand this argument. If your upper level code does not know how to do retries, then it should not concern itself with that and should not use the flag. > You don't know the > same info the low layers have to know whether to try another drive, or just > retry this one. Eh? Either we have different definitions of upper and lower layers or I don't understand how lower layers (e.g. CAM) can know about another drive. > 3. It assumes that retries are the source of latency in the system. they aren't > necessarily. I am not assuming that at all for the general case. > 4. It assumes retries are necessarily slow: they may be, they might not be. All > depends on the drive (SSDs repeated I/O are often faster than actual I/O). Of course. But X plus epsilon is always greater than X. And we know than in many practical cases epsilon can be rather large. > 5. It's just one bit when you really need more complex nuances to get good QoE > out of the I/O system. Retries is an incidental detail that's not that > important, while latency is what you care most about minimizing. You wouldn't > care if I tried to read the data 20 times if it got the result faster than going > to a different drive. That's a good point. But then again, it's the upper layers that have a better chance of predicting this kind of thing. That is, if I know that my backup storage is extremely slow, then I will allow the fast primary storage do all retries it wants to do. It's not CAM nor scsi_da nor a specific SIM that can make those decisions. It's an issuer of the I/O request [or an intermediate geom that encapsulates that knowledge and effectively acts as an issuer of I/O-s to the lower geoms]. > 6. It's putting the wrong kind of specific hints into the mix. This needs further explanation. > > Absent numbers that show it's a big win, I'm > > very hesitant to say OK. > > > > Forth, there's a large number of places in the stack today that need to > > communicate their I/O is more urgent, and we don't have any good way to > > communicate even that simple concept down the stack. > > That's unfortunately, but my proposal has quite little to do with I/O > scheduling, priorities, etc. > > > Except it does. It dictates error recovery policy which is I/O scheduling. > > > Finally, the only places that ZFS uses the TRYHARDER flag are for things like > > the super block if I'm reading the code right. It doesn't do it for normal I/O. > > Right.  But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the > same way as ZIO_FLAG_TRYHARD. > > > There's no code to cope with what would happen if all the copies of a block > > couldn't be read with the NORETRY flag. One of them might contain the data. > > ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above. > > > Except TRYHARD in ZFS means 'don't fail ****OTHER**** I/O in the queue when an > I/O fails' It doesn't control retries at all in Solaris. It's a different > concept entirely, and one badly thought out. I think that it does control retries. And it does even more. My understanding is that bio-s with B_FAILFAST can be failed immediately in the situation roughly equivalent to a CAM devq (or simq) being frozen. -- Andriy Gapon From owner-freebsd-geom@freebsd.org Sat Nov 25 17:50:08 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DE8B2DEBFAC; Sat, 25 Nov 2017 17:50:08 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f43.google.com (mail-lf0-f43.google.com [209.85.215.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 6AC8C67DE5; Sat, 25 Nov 2017 17:50:08 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f43.google.com with SMTP id x68so28511429lff.0; Sat, 25 Nov 2017 09:50:08 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=5ApCUC5DjZ1viDCnCrzu033PbzMQh+JO2M+WEA6YFh8=; b=CquXPToyf3XEhqCcFIHTuvpNPrIFqKCBFokh83IxQ+oYN3aSuSdvXrSOHqZFFQ84vz vhu8Qjz8JZoSka/xW6NTMm5ECJKoABu1aaHXhgjJqgcp3rByR/DgtG30q8O9RX7msDqq g4IxSrPcWYCMoQMA0gs8OANCF/PfAxxCjaYiAdRk+Plp0zYLU7NLFwgTEnzRddWj9gtO t0iSe/1/no3htvTsYhNsjvOqt2a2gXCfAX8aKDwO/Nq7RU4KOjWEv2GXSbtlfAt3Rc0P 5nxbcVEZzaf3uI04FEwySbDCnxaOce4fW/LRQ++dZCVhVjtLXMckjLh2bYT4lMQwlqGk 8Ggw== X-Gm-Message-State: AJaThX7sPSpEO7PZLcrL+gIQxz0jcCZMwX4PaulyjEzVMvq9IhRng6Bd 2RqYuI2wnS8U4nnsCkpfjxWlvMbVagw= X-Google-Smtp-Source: AGs4zMZZPtEaOC5dWapQhfIZTZBuOiIG6iLS8nMAw2SiAkHolrWh3Xnzl0XQnyoZVX4HQjvhBQqODQ== X-Received: by 10.46.77.148 with SMTP id c20mr13420749ljd.156.1511632205845; Sat, 25 Nov 2017 09:50:05 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id a9sm4184731lfg.12.2017.11.25.09.50.04 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 25 Nov 2017 09:50:05 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Warner Losh Cc: Scott Long , FreeBSD FS , freebsd-geom@freebsd.org References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <27c9395f-5b3c-a062-3aee-de591770af0b@FreeBSD.org> From: Andriy Gapon Message-ID: Date: Sat, 25 Nov 2017 19:50:03 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 17:50:09 -0000 On 25/11/2017 19:38, Warner Losh wrote: > > > On Sat, Nov 25, 2017 at 9:58 AM, Andriy Gapon > wrote: > > On 25/11/2017 18:25, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon > > >> wrote: > > > >     On 24/11/2017 16:57, Scott Long wrote: > >     > > >     > > >     >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon wrote: > >     >> > >     >> On 24/11/2017 15:08, Warner Losh wrote: > >     >>> > >     >>> > >     >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > >     > > >     >>> >>> wrote: > >     >>> > >     >>> > >     >>>    https://reviews.freebsd.org/D13224 > >      > > >     >> > >     >>> > >     >>>    Anyone interested is welcome to join the review. > >     >>> > >     >>> > >     >>> I think it's a really bad idea. It introduces a 'one-size-fits-all' > >     notion of > >     >>> QoS that seems misguided. It conflates a shorter timeout with don't > >     retry. And > >     >>> why is retrying bad? It seems more a notion of 'fail fast' or so other > >     concept. > >     >>> There's so many other ways you'd want to use it. And it uses the > same return > >     >>> code (EIO) to mean something new. It's generally meant 'The lower > layers > >     have > >     >>> retried this, and it failed, do not submit it again as it will not > >     succeed' with > >     >>> 'I gave it a half-assed attempt, and that failed, but resubmission > might > >     work'. > >     >>> This breaks a number of assumptions in the BUF/BIO layer as well as > >     parts of CAM > >     >>> even more than they are broken now. > >     >>> > >     >>> So let's step back a bit: what problem is it trying to solve? > >     >> > >     >> A simple example.  I have a mirror, I issue a read to one of its > >     members.  Let's > >     >> assume there is some trouble with that particular block on that > >     particular disk. > >     >> The disk may spend a lot of time trying to read it and would still > fail.  > >     With > >     >> the current defaults I would wait 5x that time to finally get the > error back. > >     >> Then I go to another mirror member and get my data from there. > >     > > >     > There are many RAID stacks that already solve this problem by having > a policy > >     > of always reading all disk members for every transaction, and throwing > >     away the > >     > sub-transactions that arrive late.  It’s not a policy that is always > >     desired, but it > >     > serves a useful purpose for low-latency needs. > > > >     That's another possible and useful strategy. > > > >     >> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first > read, get > >     >> the error back sooner and try the other disk sooner.  Only if I > know that there > >     >> are no other copies to try, then I would use the normal read with > all the retrying. > >     >> > >     > > >     > I agree with Warner that what you are proposing is not correct.  It > weakens the > >     > contract between the disk layer and the upper layers, making it less > clear who is > >     > responsible for retries and less clear what “EIO” means.  That > contract is already > >     > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I > >     > are working on a plan to fix that. > > > >     Well...  I do realize now that there is some problem in this area, > both you and > >     Warner mentioned it.  But knowing that it exists is not the same as > knowing what > >     it is :-) > >     I understand that it could be rather complex and not easy to describe > in a short > >     email... > > > >     But then, this flag is optional, it's off by default and no one is > forced to > >     used it.  If it's used only by ZFS, then it would not be horrible. > > > > > > Except that it isn't the same flag as what Solaris has (its B_FAILFAST does > > something different: it isn't about limiting retries but about failing ALL the > > queued I/O for a unit, not just trying one retry), and the problems that it > > solves are quite rare. And if you return a different errno, then the EIO > > contract is still fulfilled.  > > Yes, it isn't the same. > I think that illumos flag does even more. > > > Since it isn't the same, and there's not other systems that do a similar thing, > that ups the burden of proof that this is a good idea. > > >     Unless it makes things very hard for the infrastructure. > >     But I am circling back to not knowing what problem(s) you and Warner are > >     planning to fix. > > > > > > The middle layers of the I/O system are a bit fragile in the face of I/O errors. > > We're fixing that. > > What are the middle layers? > > > The buffer cache and lower layers of the UFS code is where the problems chiefly lie. Those are the upper layers from the point of view of GEOM and things below it. If they don't set that flag on the bio, then it is not going to magically appear in their I/O path. > > Of course, you still haven't articulated why this approach would be better > > Better than what? > > > Well, anything? I think that I have described how it is better than what we have now, which I think is a part of 'anything'. > > > nor > > show any numbers as to how it makes things better. > > By now, I have.  See my reply to Scott's email. > > > I just checked my email, I've seen no such reply. I checked it before I > replied.  Maybe it's just delayed. Sorry, my mistake. I thought that I sent that email in the morning, but I didn't. I have just sent it. Apologies again. -- Andriy Gapon From owner-freebsd-geom@freebsd.org Sat Nov 25 17:57:40 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3B7CEDEC22E for ; Sat, 25 Nov 2017 17:57:40 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x22c.google.com (mail-it0-x22c.google.com [IPv6:2607:f8b0:4001:c0b::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id F16BF681C7 for ; Sat, 25 Nov 2017 17:57:39 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x22c.google.com with SMTP id x13so16485614iti.4 for ; Sat, 25 Nov 2017 09:57:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=sDDg+58CGPKx0+tTc1ED8m6HEe5BrIsIIQB/kA3ChwU=; b=qPFfmzjRNBbP3DmUvRWgC5jMjFTpdNySHNEs8H/xdoUucHPhTJiEXi40KmJeeH77Ey sY1c6XJ+8SS4Z0oqTJQ9XQn1y9K6v09yaqrPtqKmfWVtforMcMiq7TYqaT9DLUuNZA2A WiGcKFBDw7GXRV9Z3D6RsIJ4XMyMU7TZNOuYra1Q8c9dnWl5m3DqvlaeK1OeayjCg8jB wIXzdVHAxvnLU05sVIu+j+ECf9iI8/rotfDVUVQOtJmW77SwHCZBIYDUAwBkj7yosMCR 56TfIrZhYVrxHbWJQFjg/YMXIHP/MXKrfEAcRxZ78HvVBH5c6Vqc1OvzZKJVarY8JVdX zAsw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=sDDg+58CGPKx0+tTc1ED8m6HEe5BrIsIIQB/kA3ChwU=; b=Wkas7BByVOu9fOxwWH80k1VVi0R6Af2UvuneONPYVeP9GQJ5z2Ycy1blOMhUZP+Nsy 8YNHB2vnHS6LgkoE7ynSodZSLqQkeLwQ2gRJExfzJw8lGDIRB32GxUD879P7Q6XDKK5B oMUFNgr6DPFRQa0VuuQJY+BF1DWC8vgriOGLXLjxtJzC9ZgXeO7M9CLkFgJHF+FAjz38 wgMEl+38RE9jqCu4lyJP93YT5thDSigJNFQVsjzXaOdKp62OdyP/RKXZTYt2ZTW+6vDj RWMAMO3PzQK1rS1364V9WGmQ1wL4ORhQOQKLG2WQC9O3Y3qmJbScYZVzoK8IcKO2dPVU gzBQ== X-Gm-Message-State: AJaThX691pSDEbiNbam1lowG1waMomVGhTosBuZaQ9W8rAScs8utnECc aGTe+R6r5dTOo4hhyhANeDNW/HpuZv04Ad5RMYSUOA== X-Google-Smtp-Source: AGs4zMYazlz2WrPFSx5PeIGxX4EhK1qMgTHVpcFmS/KlPnGtbEi0uWZuo3DIO5vwYgUNvIHh2OtB3xdssoaEXn1dta8= X-Received: by 10.36.164.13 with SMTP id z13mr22202600ite.115.1511632659185; Sat, 25 Nov 2017 09:57:39 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 09:57:38 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd] In-Reply-To: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> From: Warner Losh Date: Sat, 25 Nov 2017 10:57:38 -0700 X-Google-Sender-Auth: 6c28FYqH0hmmAfPKJFmQ73oXmro Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: Scott Long , FreeBSD FS , freebsd-geom@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 17:57:40 -0000 On Sat, Nov 25, 2017 at 10:36 AM, Andriy Gapon wrote: > > Timestamp of the first error is Jun 16 10:40:18. > Timestamp of the last error is Jun 16 10:40:27. > So, it took additional 9 seconds to finally produce EIO. > That disk is a part of a ZFS mirror. If the request was failed after the > first > attempt, then ZFS would be able to get the data from a good disk much > sooner. > > And don't take me wrong, I do NOT want CAM or GEOM to make that decision by > itself. I want ZFS to be able to tell the lower layers when they should > try as > hard as they normally do and when they should report an I/O error as soon > as it > happens without any retries. Let's walk through this. You see that it takes a long time to fail an I/O. Perfectly reasonable observation. There's two reasons for this. One is that the disks take a while to make an attempt to get the data. The second is that the system has a global policy that's biased towards 'recover the data' over 'fail fast'. These can be fixed by reducing the timeouts, or lowing the read-retry count for a given drive or globally as a policy decision made by the system administrator. It may be perfectly reasonable to ask the lower layers to 'fail fast' and have either a hard or a soft deadline on the I/O for a subset of I/O. A hard deadline would return ETIMEDOUT or something when it's passed and cancel the I/O. This gives better determinism in the system, but some systems can't cancel just 1 I/O (like SATA drives), so we have to flush the whole queue. If we get a lot of these, performance suffers. However, for some class of drives, you know that if it doesn't succeed in 1s after you submit it to the drive, it's unlikely to complete successfully and it's worth the performance hit on a drive that's already acting up. You could have a soft timeout, which says 'don't do any additional action after X time has elapsed and you get word about this I/O. This is similar to the hard timeout, but just stops retrying after the deadline has passed. This scenario is better on the other users of the drive, assuming that the read-recovery operations aren't starving them. It's also easier to implement, but has worse worst case performance characteristics. You aren't asking to limit retries. You're really asking to the I/O subsystem to limit, where it can, the amount of time on an I/O so you can try another one. You're means to doing this is to tell it not to retry. That's the wrong means. It shouldn't be listed in the API that it's a 'NO RETRY' request. It should be a QoS request flag: fail fast. Part of why I'm being so difficult is that you don't understand this and are proposing a horrible API. It should have a different name. The other reason is that I absolutely do not want to overload EIO. You must return a different error back up the stack. You've show no interest in this past, which is also a needless argument. We've given good reasons, and you've poopooed them with bad arguments. Also, this isn't the data I asked for. I know things can fail slowly. I was asking for how it would improve systems running like this. As in "I implemented it, and was able to fail over to this other drive faster" or something like that. Actual drive failure scenarios vary widely, and optimizing for this one failure is unwise. It may be the right optimization, but it may not. There's lots of tricky edges in this space. Warner From owner-freebsd-geom@freebsd.org Sat Nov 25 22:17:51 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DC285DF2033 for ; Sat, 25 Nov 2017 22:17:51 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x22d.google.com (mail-it0-x22d.google.com [IPv6:2607:f8b0:4001:c0b::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9D63E71CD8 for ; Sat, 25 Nov 2017 22:17:51 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x22d.google.com with SMTP id x13so16849348iti.4 for ; Sat, 25 Nov 2017 14:17:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=LJmZtGwOy8guPaxwjyXIT3tK6dOEY/LFRWx1tpnpJSQ=; b=RghMCEjfdPhoF8iirse/K/MLeeQwCyBBUBDY4GIH4TSvyo6PeJvihzYn5hso1rNlLS +tnqaKeLnRgMLGXoyhERUjT7jVHpi6mWQqbaE5BG1d/iz/Yi8lFIvSpFjp83KXBLDvzl mfGLNV228BEqzpy7IuGp+mk3G1IzPGrVULVFy5xZMNEBfNY7ukdLtdMcOIivnrJUHBQF MMZqu50cjK8yirLEsSf2SNIid288qKfqZ50lTqRa1G2qg8DaOkrlnfwGIFfhaPM7nWCk 9uvOhbf6/QzBkEBHFWCbW+pcDUuYoB53Zj157D7Up9UZff18kpkmcbCaYG620pYhofaI IJkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=LJmZtGwOy8guPaxwjyXIT3tK6dOEY/LFRWx1tpnpJSQ=; b=AF1k/Ew3L8mqYFPuzqNrvk+LEgxEQNAQlIt+FJv42dMvb1pKFu8gVkfyY0WdSpHr2Q GauqBw57RThHk7UWZZ37dvtP6M6oAztvZ8pVhwuyk8QBvd261M9jLI6k//e+OWnqg7dS /gewE8+V2pVQJAfH+NIwvUZ0e6Q8+lm3+5ihaKq1WcuzHbXY/iGjCUjrnvqhOUkwGSTE GhdxAyvV7Y7UtB7xhbYwOjiWyf6q8o8BOSviHVzUG3Ui8xYU1yHdX/g2GeciSRivXqYV 82liynPiNgT81CYKvXqT30EDNhXc6spdxlqPpaoAkzKKGQspLM1vTLEFL1oALowpWsWH nYeQ== X-Gm-Message-State: AJaThX72e4utA5FGx2x423j2LLYTB1rVeGRpmiaif9ILOEkKYPgDoSDL N1V+2pAa6I8T0CMpFA5ZpQ3kIRwSmHjbDzti3/5c2w== X-Google-Smtp-Source: AGs4zMbF8Ht5E9L9iEvEwyz0HA3vAeq7KZT0urdLQtXp+Ntp5RarLahEh59YUo6SC+4j4DZaU+ZV3bjeOtF98AIV5Mw= X-Received: by 10.36.164.13 with SMTP id z13mr22853940ite.115.1511648270873; Sat, 25 Nov 2017 14:17:50 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 14:17:49 -0800 (PST) X-Originating-IP: [50.253.99.174] In-Reply-To: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> From: Warner Losh Date: Sat, 25 Nov 2017 15:17:49 -0700 X-Google-Sender-Auth: OFWKq_KlfodVn8Jxgh-5DhOWhSE Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 22:17:52 -0000 On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon wrote: > > Before anything else, I would like to say that I got an impression that we > speak > from so different angles that we either don't understand each other's > words or, > even worse, misinterpret them. I understand what you are suggesting. Don't take my disagreement with your proposal as willful misinterpretation. You are proposing something that's a quick hack. Maybe a useful one, but it's still problematical because it has the upper layers telling the lower layers what to do (don't do your retry), rather than what service to provide (I prefer a fast error exit to over every effort to recover the data). And it also does it by overloading the meaning of EIO, which has real problems which you've not been open to listening, I assume due to your narrow use case apparently blinding you to the bigger picture issues with that route. However, there's a way forward which I think that will solve these objections. First, designate that I/O that fails due to short-circuiting the normal recovery process, return ETIMEDOUT. The I/O stack currently doesn't use this at all (it was introduced for the network side of things). This is a general catch-all for an I/O that we complete before the lower layers have given it the maximum amount of effort to recover the data, at the user request. Next, don't use a flag. Instead add a 32-bit field that is call bio_qos for quality of service hints and another 32-bit field for bio_qos_param. This allows us to pass down specific quality of service desires from the filesystem to the lower layers. The parameter will be unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name for a value to set it to (at the moment, just use 1). We'll assign the other QOS values later for other things. It would allow us to implement the other sorts of QoS things I talked about as well. As for B_FAILFAST, it's quite unlike what you're proposing, except in one incidental detail. It's a complicated state machine that the sd driver in solaris implemented. It's an entire protocol. When the device gets errors, it goes into this failfast state machine. The state machine makes a determination that the errors are indicators the device is GONE, at least for the moment, and it will fail I/Os in various ways from there. Any new I/Os that are submitted will be failed (there's conditional behavior here: depending on a global setting it's either all I/O or just B_FAILFAST I/O). ZFS appears to set this bit for its discovery code only, when a device not being there would significantly delay things. Anyway, when the device returns (basically an I/O gets through or maybe some other event happens), the driver exists this mode and returns to normal operation. It appears to be designed not for the use case that you described, but rather for a drive that's failing all over the place so that any pending I/Os get out of the way quickly. Your use case is only superficially similar to that use case, so the Solaris / Illumos experiences are mildly interesting, but due to the differences not a strong argument for doing this. This facility in Illumos is interesting, but would require significantly more retooling of the lower I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just Solaris) a daemon that looks at failures to manage them at a higher level, which might make for a better user experience for FreeBSD, so that's something that needs to be weighed as well. We've known for some time that HDD retry algorithms take a long time. Same is true of some SSD or NVMe algorithms, but not all. The other objection I have to 'noretry' naming is that it bakes the current observed HDD behavior and recovery into the API. This is undesirable as other storage technologies have retry mechanisms that happen quite quickly (and sometimes in the drive itself). The cutoff between fast and slow recovery is device specific, as are the methods used. For example, there's new proposals out in NVMe (and maybe T10/T13 land) to have new types of READ commands that specify the quality of service you expect, including providing some sort of deadline hint to clip how much effort is expended in trying to recover the data. It would be nice to design a mechanism that allows us to start using these commands when drives are available with them, and possibly using timeouts to allow for a faster abort. Most of your HDD I/O will complete within maybe ~150ms, with a long tail out to maybe as long as ~400ms. It might be desirable to set a policy that says 'don't let any I/Os remain in the device longer than a second' and use this mechanism to enforce that. Or don't let any I/Os last more than 20x the most recent median I/O time. A single bit is insufficiently expressive to allow these sorts of things, which is another reason for my objection to your proposal. With the QOS fields being independent, the clone routines just copies them and makes no judgement value on them. So, those are my problems with your proposal, and also some hopefully useful ways to move forward. I've chatted with others for years about introducing QoS things into the I/O stack, so I know most of the above won't be too contentious (though ETIMEDOUT I haven't socialized, so that may be an area of concern for people). Warner