From owner-freebsd-geom@freebsd.org Fri Nov 24 17:21:01 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6ECE4DEC737; Fri, 24 Nov 2017 17:21:01 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f44.google.com (mail-lf0-f44.google.com [209.85.215.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EEA307C82E; Fri, 24 Nov 2017 17:21:00 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f44.google.com with SMTP id k66so26187653lfg.3; Fri, 24 Nov 2017 09:21:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=DDjvUMqXfUPMiy37wIeC40DYy9cXSNrEzbE7WjarZrI=; b=gNyJjZr6bI99/qbi+V8XV4IA1T1S2+/fRUqb39AZDQCgg3PX7D9gN2ysLvQtwLZ8H5 Yo7Gsdk1wkVEYO+y2LIqqkrKyj+zXF6gn6cC6Ur4JEHvNvdhgpvFlFWZoAWcm/yHbm1L Mce63ad4OhogFX0zKcswSZgMd6KnYhT3d1CKey2eKYlHZ4+BnOHBC3d1kcFB8xdpgbWh tjAtXW8h8AtAf4r6HBVG7kKkvyNiuIqVkpAdFSKYTWa9SXhhRijAg8wrcVa7exz/u/A+ 3IKDqrTJH3I0FJWsspvQt8M35TRFxkHwiJCGPG66fezK8SHENVOnMe2zNJ+5aWwxBT6Y ySmg== X-Gm-Message-State: AJaThX7zGQlMJydDAZghpL0gzV4NGfjTRXsdMzD9x5or3fOnDNnc9pSd oxYHDrdgStmvNNdh6t87syk= X-Google-Smtp-Source: AGs4zMYf29dJUpe8orSQznnn8j7bzE3GZVUW/OlfKvye3SHBc7zy493mcDJyGGVSAlryNcXXTU3M2A== X-Received: by 10.46.117.28 with SMTP id q28mr10136527ljc.14.1511544052915; Fri, 24 Nov 2017 09:20:52 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id u17sm3771525lfi.97.2017.11.24.09.20.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 24 Nov 2017 09:20:52 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Warner Losh Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> From: Andriy Gapon Message-ID: Date: Fri, 24 Nov 2017 19:20:51 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Nov 2017 17:21:01 -0000 On 24/11/2017 18:33, Warner Losh wrote: > > > On Fri, Nov 24, 2017 at 6:34 AM, Andriy Gapon > wrote: > > On 24/11/2017 15:08, Warner Losh wrote: > > > > > > On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon > > >> wrote: > > > > > >     https://reviews.freebsd.org/D13224 > > > > > >     Anyone interested is welcome to join the review. > > > > > > I think it's a really bad idea. It introduces a 'one-size-fits-all' notion of > > QoS that seems misguided. It conflates a shorter timeout with don't retry. And > > why is retrying bad? It seems more a notion of 'fail fast' or so other concept. > > There's so many other ways you'd want to use it. And it uses the same return > > code (EIO) to mean something new. It's generally meant 'The lower layers have > > retried this, and it failed, do not submit it again as it will not succeed' with > > 'I gave it a half-assed attempt, and that failed, but resubmission might work'. > > This breaks a number of assumptions in the BUF/BIO layer as well as parts of CAM > > even more than they are broken now. > > > > So let's step back a bit: what problem is it trying to solve? > > A simple example.  I have a mirror, I issue a read to one of its members.  Let's > assume there is some trouble with that particular block on that particular disk. >  The disk may spend a lot of time trying to read it and would still fail.  With > the current defaults I would wait 5x that time to finally get the error back. > Then I go to another mirror member and get my data from there. > IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first read, get > the error back sooner and try the other disk sooner.  Only if I know that there > are no other copies to try, then I would use the normal read with all the > retrying. > > > It sounds like you are optimizing the wrong thing and taking an overly > simplistic view of quality of service. > First, failing blocks on a disk is fairly rare. Do you really want to optimize > for that case? If it can be done without any harm to the sunny day scenario, then why not? I think that 'robustness' is the word here, not 'optimization'. > Second, you're really saying 'If you can't read it fast, fail" since we only > control the software side of read retry. Am I? That's not what I wanted to say, really. I just wanted to say, if this I/O fails, don't retry it, leave it to me. This is very simple, simplistic as you say, but I like simple. > There's new op codes being proposed > that say 'read or fail within Xms' which is really what you want: if it's taking > too long on disk A you want to move to disk B. The notion here was we'd return > EAGAIN (or some other error) if it failed after Xms, and maybe do some emulation > in software for drives that don't support this. You'd tweak this number to > control performance. You're likely to get a much bigger performance win all the > time by scheduling I/O to drives that have the best recent latency. ZFS already does some latency based decisions. The things that you describe are very interesting, but they are for the future. > Third, do you have numbers that show this is actually a win? I do not have any numbers right now. What kind of numbers would you like? What kind of scenarios? > This is a terrible > thing from an architectural view. You have said this several times, but unfortunately you haven't explained it yet. > Absent numbers that show it's a big win, I'm > very hesitant to say OK. > > Forth, there's a large number of places in the stack today that need to > communicate their I/O is more urgent, and we don't have any good way to > communicate even that simple concept down the stack. That's unfortunately, but my proposal has quite little to do with I/O scheduling, priorities, etc. > Finally, the only places that ZFS uses the TRYHARDER flag are for things like > the super block if I'm reading the code right. It doesn't do it for normal I/O. Right. But for normal I/O there is ZIO_FLAG_IO_RETRY which is honored in the same way as ZIO_FLAG_TRYHARD. > There's no code to cope with what would happen if all the copies of a block > couldn't be read with the NORETRY flag. One of them might contain the data. ZFS is not that fragile :) see ZIO_FLAG_IO_RETRY above. -- Andriy Gapon