From owner-freebsd-fs@freebsd.org Fri Nov 24 16:33:58 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 529DEDEB554 for ; Fri, 24 Nov 2017 16:33:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x229.google.com (mail-it0-x229.google.com [IPv6:2607:f8b0:4001:c0b::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 059267AE6F for ; Fri, 24 Nov 2017 16:33:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x229.google.com with SMTP id y71so2400768ita.1 for ; Fri, 24 Nov 2017 08:33:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=; b=oEuCnvZ2uOi0mqEEuCMF+Ea3KIQMW8xD4XIOzTDCGoKTFGxUdG8pKgCWcqLfEk63kB talGVqrxz368QnLhZqPCzYo+LUkeNRkAlTGomWkYu2Ub7QuPvtaoAKp40xt6ffwmQrI1 6bhPj6xeTqwjqqFGmnLbausFNdZ9KfBW/NiIgowQgVw8jOgUNPkDlucrsRZL4I9VIlqF zgkNAsNDKJ/7903BgsEKiaQm+8vo0GLzqp8Pl1pGYnwRWKHT2sha56QWI+/pcqLie5y7 N6jvMdc9N5PWnTQ47uUezW9hDMv0kGSpnrXcGiDVOg0A/icNn/iIc1VF9SD6mwKi6LK6 P0QQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=0MVJoz49hDcnqmipn5h/XAsvMDtR8sTzp9kLvU5JS18=; b=h9bc0+DlBBd6DrnQ+A2bn8RdX3V7QqJwAcJ84f2QNJW3f0k6ck7pGe8NqwNIW5Tsfi //RMKNclWdVjGqvlScW7jLVhI/D+0Nu8AMA26SNFeq9XN4n+a9ITdCVv3lJXieOxjVHO LKjeK0msiEEDGPlNSb+lVZaXodW+tL/rW2vIPTl7A5ADNyN0/7s7ijui+K0qJJTIQ6JX bRoT+TM6u6bXNx8znaB2QX7Lic4fiWl9g30eZb6MsFcjjbfw+zAdHOEu2UwJvjaGSYMh ni0nBSyxFSm+gKnxbipAkeYb/oDagiU45N6FC1O8KPZ1CGFF6rNubg3NK5jIP7BE8UIh B8XQ== X-Gm-Message-State: AJaThX5dasDbBsz4AXep+Guml56r6Bw6gU/23UMPun8lOo7BmrKFjK9+ gVsUZDIZy7HMZ/IZc6YG1+d2YI5jt71IY4e+N2RA/Q== X-Google-Smtp-Source: AGs4zMb2qcX4SYQ8dG9gG7QijUzFvld3wUlWcmNehzHGRemSK6Qjy7T0UBzw5z1gKtWtiDfacKGNpjOpyFOqluW63Yg= X-Received: by 10.36.77.143 with SMTP id l137mr17155000itb.50.1511541237108; Fri, 24 Nov 2017 08:33:57 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Fri, 24 Nov 2017 08:33:56 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:f964:7c3e:d2:aac5] In-Reply-To: <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> From: Warner Losh Date: Fri, 24 Nov 2017 09:33:56 -0700 X-Google-Sender-Auth: 9YkRcBZ2y_mwK-i8n9pyzcXRGJA Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 16:33:58 -0000 On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon wrote: > On 24/11/2017 15:08, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > wrote: > > > > > > https://reviews.freebsd.org/D13224 D13224> > > > > Anyone interested is welcome to join the review. > > > > > > I think it's a really bad idea. It introduces a 'one-size-fits-all' > notion of > > QoS that seems misguided. It conflates a shorter timeout with don't > retry. And > > why is retrying bad? It seems more a notion of 'fail fast' or so other > concept. > > There's so many other ways you'd want to use it. And it uses the same > return > > code (EIO) to mean something new. It's generally meant 'The lower layers > have > > retried this, and it failed, do not submit it again as it will not > succeed' with > > 'I gave it a half-assed attempt, and that failed, but resubmission might > work'. > > This breaks a number of assumptions in the BUF/BIO layer as well as > parts of CAM > > even more than they are broken now. > > > > So let's step back a bit: what problem is it trying to solve? > > A simple example. I have a mirror, I issue a read to one of its members. > Let's > assume there is some trouble with that particular block on that particular > disk. > The disk may spend a lot of time trying to read it and would still fail. > With > the current defaults I would wait 5x that time to finally get the error > back. > Then I go to another mirror member and get my data from there. > IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first read, > get > the error back sooner and try the other disk sooner. Only if I know that > there > are no other copies to try, then I would use the normal read with all the > retrying. > It sounds like you are optimizing the wrong thing and taking an overly simplistic view of quality of service. First, failing blocks on a disk is fairly rare. Do you really want to optimize for that case? Second, you're really saying 'If you can't read it fast, fail" since we only control the software side of read retry. There's new op codes being proposed that say 'read or fail within Xms' which is really what you want: if it's taking too long on disk A you want to move to disk B. The notion here was we'd return EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation in software for drives that don't support this. You'd tweak this number to control performance. You're likely to get a much bigger performance win all the time by scheduling I/O to drives that have the best recent latency. Third, do you have numbers that show this is actually a win? This is a terrible thing from an architectural view. Absent numbers that show it's a big win, I'm very hesitant to say OK. Forth, there's a large number of places in the stack today that need to communicate their I/O is more urgent, and we don't have any good way to communicate even that simple concept down the stack. Finally, the only places that ZFS uses the TRYHARDER flag are for things like the super block if I'm reading the code right. It doesn't do it for normal I/O. There's no code to cope with what would happen if all the copies of a block couldn't be read with the NORETRY flag. One of them might contain the data. Warner