From owner-freebsd-fs@freebsd.org Tue Dec 12 16:36:59 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2DD17E9E63F; Tue, 12 Dec 2017 16:36:59 +0000 (UTC) (envelope-from agapon@gmail.com) Received: from mail-lf0-f43.google.com (mail-lf0-f43.google.com [209.85.215.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D333C64E29; Tue, 12 Dec 2017 16:36:58 +0000 (UTC) (envelope-from agapon@gmail.com) Received: by mail-lf0-f43.google.com with SMTP id x204so23863680lfa.11; Tue, 12 Dec 2017 08:36:58 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=YRt6rkLRn6TLNgYxe3Sd2HrmEMbOa3nQmThHZmE1NhA=; b=JGP4BBO1/SthiEYwMJNjP/hyOpkGrshgsrW4YMWTVNpQoxAZ6jKsOngzcMsijuElbf YJRrU8IVU/ogC6c80oeWUt38EURtXEVUy5n2dfZQH5Osvcw4IyAu9KaBbBAv1zCw6YaT khUE0KG1EZKH/Vo4x3teR4IBYEmqAWkPHJeNqebo9M5s5xRPXwUYgbxYApszo62Ut9WB wIXQ5DB+UkevMRNGldXS/tlIOQG5ZqMxLvTvk3Em+w3xYexAgyHB+yLRdQi6yAT7doYV cuWmNMwtvF09KVjLfQ9mD/az3qwu3vh6lIrIkIYv048qwnIUTr8hhwxncke2mkmmXXZx jOEA== X-Gm-Message-State: AKGB3mIKfHuYt44OGt8vM0x0rn+4XAL8q44tSaj0QyLqAYv/CeItlLpp wWgsVTkepLyOSfldHiRH2Dc= X-Google-Smtp-Source: ACJfBotCEvH1N0NRbhohYdAxW4Kz1Jn40hOAP1fDk/naRm/qt9d5CTPR+UaCS6nsTStO/pVtcIaSVg== X-Received: by 10.46.83.9 with SMTP id h9mr2445871ljb.68.1513096611383; Tue, 12 Dec 2017 08:36:51 -0800 (PST) Received: from [192.168.0.88] (east.meadow.volia.net. [93.72.151.96]) by smtp.googlemail.com with ESMTPSA id h11sm3226856lfd.54.2017.12.12.08.36.50 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 12 Dec 2017 08:36:50 -0800 (PST) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Warner Losh Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> From: Andriy Gapon Message-ID: <38122ea9-ab20-f18a-90a2-04d2681e2ac9@FreeBSD.org> Date: Tue, 12 Dec 2017 18:36:49 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Dec 2017 16:36:59 -0000 On 26/11/2017 00:17, Warner Losh wrote: > > > On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon > wrote: > > > Before anything else, I would like to say that I got an impression that we speak > from so different angles that we either don't understand each other's words or, > even worse, misinterpret them. > > > I understand what you are suggesting. Don't take my disagreement with your > proposal as willful misinterpretation. You are proposing something that's a > quick hack. Very true. > Maybe a useful one, but it's still problematical because it has the > upper layers telling the lower layers what to do (don't do your retry), rather > than what service to provide (I prefer a fast error exit to over every effort to > recover the data). Also true. > And it also does it by overloading the meaning of EIO, which > has real problems which you've not been open to listening, I assume due to your > narrow use case apparently blinding you to the bigger picture issues with that > route. Quite likely. > However, there's a way forward which I think that will solve these objections. > First, designate that I/O that fails due to short-circuiting the normal recovery > process, return ETIMEDOUT. The I/O stack currently doesn't use this at all (it > was introduced for the network side of things). This is a general catch-all for > an I/O that we complete before the lower layers have given it the maximum amount > of effort to recover the data, at the user request. Next, don't use a flag. > Instead add a 32-bit field that is call bio_qos for quality of service hints and > another 32-bit field for bio_qos_param. This allows us to pass down specific > quality of service desires from the filesystem to the lower layers. The > parameter will be unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name > for a value to set it to (at the moment, just use 1). We'll assign the other QOS > values later for other things. It would allow us to implement the other sorts of > QoS things I talked about as well. That's a very interesting and workable suggestion. I will try to work on it. > As for B_FAILFAST, it's quite unlike what you're proposing, except in one > incidental detail. It's a complicated state machine that the sd driver in > solaris implemented. It's an entire protocol. When the device gets errors, it > goes into this failfast state machine. The state machine makes a determination > that the errors are indicators the device is GONE, at least for the moment, and > it will fail I/Os in various ways from there. Any new I/Os that are submitted > will be failed (there's conditional behavior here: depending on a global setting > it's either all I/O or just B_FAILFAST I/O). Yeah, I realized that B_FAILFAST was quite different from the first impression that I got from its name. Thank you for doing and sharing your analysis of how it actually works. > ZFS appears to set this bit for its > discovery code only, when a device not being there would significantly delay > things. I think that ZFS sets the bit for all 'first-attempt' I/O. It's the various retries / recovery where this bit is not set. > Anyway, when the device returns (basically an I/O gets through or maybe > some other event happens), the driver exists this mode and returns to normal > operation. It appears to be designed not for the use case that you described, > but rather for a drive that's failing all over the place so that any pending > I/Os get out of the way quickly. Your use case is only superficially similar to > that use case, so the Solaris / Illumos experiences are mildly interesting, but > due to the differences not a strong argument for doing this. This facility in > Illumos is interesting, but would require significantly more retooling of the > lower I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just > Solaris) a daemon that looks at failures to manage them at a higher level, which > might make for a better user experience for FreeBSD, so that's something that > needs to be weighed as well. Okay. > We've known for some time that HDD retry algorithms take a long time. Same is > true of some SSD or NVMe algorithms, but not all. The other objection I have to > 'noretry' namingĀ  is that it bakes the current observed HDD behavior and > recovery into the API. This is undesirable as other storage technologies have > retry mechanisms that happen quite quickly (and sometimes in the drive itself). > The cutoff between fast and slow recovery is device specific, as are the methods > used. For example, there's new proposals out in NVMe (and maybe T10/T13 land) to > have new types of READ commands that specify the quality of service you expect, > including providing some sort of deadline hint to clip how much effort is > expended in trying to recover the data. It would be nice to design a mechanism > that allows us to start using these commands when drives are available with > them, and possibly using timeouts to allow for a faster abort. Most of your HDD > I/O will complete within maybe ~150ms, with a long tail out to maybe as long as > ~400ms. It might be desirable to set a policy that says 'don't let any I/Os > remain in the device longer than a second' and use this mechanism to enforce > that. Or don't let any I/Os last more than 20x the most recent median I/O time. > A single bit is insufficiently expressive to allow these sorts of things, which > is another reason for my objection to your proposal. With the QOS fields being > independent, the clone routines just copies them and makes no judgement value on > them. I now agree with this. Thank you for the detailed explanation. > So, those are my problems with your proposal, and also some hopefully useful > ways to move forward. I've chatted with others for years about introducing QoS > things into the I/O stack, so I know most of the above won't be too contentious > (though ETIMEDOUT I haven't socialized, so that may be an area of concern for > people). Thank you! -- Andriy Gapon