From owner-freebsd-geom@freebsd.org Fri Nov 24 16:33:58 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 530B5DEB555 for ; Fri, 24 Nov 2017 16:33:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x231.google.com (mail-it0-x231.google.com [IPv6:2607:f8b0:4001:c0b::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 103EF7AE70 for ; Fri, 24 Nov 2017 16:33:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x231.google.com with SMTP id n134so14236534itg.3 for ; Fri, 24 Nov 2017 08:33:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=; b=oEuCnvZ2uOi0mqEEuCMF+Ea3KIQMW8xD4XIOzTDCGoKTFGxUdG8pKgCWcqLfEk63kB talGVqrxz368QnLhZqPCzYo+LUkeNRkAlTGomWkYu2Ub7QuPvtaoAKp40xt6ffwmQrI1 6bhPj6xeTqwjqqFGmnLbausFNdZ9KfBW/NiIgowQgVw8jOgUNPkDlucrsRZL4I9VIlqF zgkNAsNDKJ/7903BgsEKiaQm+8vo0GLzqp8Pl1pGYnwRWKHT2sha56QWI+/pcqLie5y7 N6jvMdc9N5PWnTQ47uUezW9hDMv0kGSpnrXcGiDVOg0A/icNn/iIc1VF9SD6mwKi6LK6 P0QQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=; b=kAZce0Pwh74Hf+0VNGYhrLALjpwwj2Yju67Dw9qybI+88JRtPcaeutFKRku3sIDYN4 rl4T1/ilqv67uup41C274cAzXdOIb5uDrVNDttJEA6nfaK7rt4HW5MIBYVkX3jH3s9cv TI0BqF1cWZ10N5SGvJ5QAA8RDdJM1f8yFKTH3oJ3EaLq/Sqm8Oe/cZJ6tEAitF3JryiX CdZVB2ywGYYPGdBljvTohphTuldWu0M+oDrVY4iNjGKkO8enM/Ezzg2Tzs/JoAsy5vo0 d0qOby0zpMT2aKtVQEbhGJto7YEz3TZ1ekXRgP8EXum41Qk4QuTBAb3GQYde/TsX9lY2 22zQ== X-Gm-Message-State: AJaThX6NVnjhNto+wjebDUIRwcdb57c2NxRNhxp5NVz5LnfFo2WApE96 H5UU6hl0HUltULJ+eaF4k5t4h9ITgzXI1/HYj3eqRA== X-Google-Smtp-Source: AGs4zMb2qcX4SYQ8dG9gG7QijUzFvld3wUlWcmNehzHGRemSK6Qjy7T0UBzw5z1gKtWtiDfacKGNpjOpyFOqluW63Yg= X-Received: by 10.36.77.143 with SMTP id l137mr17155000itb.50.1511541237108; Fri, 24 Nov 2017 08:33:57 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Fri, 24 Nov 2017 08:33:56 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:f964:7c3e:d2:aac5] In-Reply-To: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> From: Warner Losh Date: Fri, 24 Nov 2017 09:33:56 -0700 X-Google-Sender-Auth: 9YkRcBZ2y_mwK-i8n9pyzcXRGJA Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 16:33:58 -0000 On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon wrote: > On 24/11/2017 15:08, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > wrote: > > > > > > https://reviews.freebsd.org/D13224 D13224> > > > > Anyone interested is welcome to join the review. > > > > > > I think it's a really bad idea. It introduces a 'one-size-fits-all' > notion of > > QoS that seems misguided. It conflates a shorter timeout with don't > retry. And > > why is retrying bad? It seems more a notion of 'fail fast' or so other > concept. > > There's so many other ways you'd want to use it. And it uses the same > return > > code (EIO) to mean something new. It's generally meant 'The lower layers > have > > retried this, and it failed, do not submit it again as it will not > succeed' with > > 'I gave it a half-assed attempt, and that failed, but resubmission might > work'. > > This breaks a number of assumptions in the BUF/BIO layer as well as > parts of CAM > > even more than they are broken now. > > > > So let's step back a bit: what problem is it trying to solve? > > A simple example. I have a mirror, I issue a read to one of its members. > Let's > assume there is some trouble with that particular block on that particular > disk. > The disk may spend a lot of time trying to read it and would still fail. > With > the current defaults I would wait 5x that time to finally get the error > back. > Then I go to another mirror member and get my data from there. > IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first read, > get > the error back sooner and try the other disk sooner. Only if I know that > there > are no other copies to try, then I would use the normal read with all the > retrying. > It sounds like you are optimizing the wrong thing and taking an overly simplistic view of quality of service. First, failing blocks on a disk is fairly rare. Do you really want to optimize for that case? Second, you're really saying 'If you can't read it fast, fail" since we only control the software side of read retry. There's new op codes being proposed that say 'read or fail within Xms' which is really what you want: if it's taking too long on disk A you want to move to disk B. The notion here was we'd return EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation in software for drives that don't support this. You'd tweak this number to control performance. You're likely to get a much bigger performance win all the time by scheduling I/O to drives that have the best recent latency. Third, do you have numbers that show this is actually a win? This is a terrible thing from an architectural view. Absent numbers that show it's a big win, I'm very hesitant to say OK. Forth, there's a large number of places in the stack today that need to communicate their I/O is more urgent, and we don't have any good way to communicate even that simple concept down the stack. Finally, the only places that ZFS uses the TRYHARDER flag are for things like the super block if I'm reading the code right. It doesn't do it for normal I/O. There's no code to cope with what would happen if all the copies of a block couldn't be read with the NORETRY flag. One of them might contain the data. Warner