From owner-freebsd-geom@freebsd.org Sat Nov 25 22:17:51 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DC285DF2033 for ; Sat, 25 Nov 2017 22:17:51 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x22d.google.com (mail-it0-x22d.google.com [IPv6:2607:f8b0:4001:c0b::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9D63E71CD8 for ; Sat, 25 Nov 2017 22:17:51 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x22d.google.com with SMTP id x13so16849348iti.4 for ; Sat, 25 Nov 2017 14:17:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=LJmZtGwOy8guPaxwjyXIT3tK6dOEY/LFRWx1tpnpJSQ=; b=RghMCEjfdPhoF8iirse/K/MLeeQwCyBBUBDY4GIH4TSvyo6PeJvihzYn5hso1rNlLS +tnqaKeLnRgMLGXoyhERUjT7jVHpi6mWQqbaE5BG1d/iz/Yi8lFIvSpFjp83KXBLDvzl mfGLNV228BEqzpy7IuGp+mk3G1IzPGrVULVFy5xZMNEBfNY7ukdLtdMcOIivnrJUHBQF MMZqu50cjK8yirLEsSf2SNIid288qKfqZ50lTqRa1G2qg8DaOkrlnfwGIFfhaPM7nWCk 9uvOhbf6/QzBkEBHFWCbW+pcDUuYoB53Zj157D7Up9UZff18kpkmcbCaYG620pYhofaI IJkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=LJmZtGwOy8guPaxwjyXIT3tK6dOEY/LFRWx1tpnpJSQ=; b=AF1k/Ew3L8mqYFPuzqNrvk+LEgxEQNAQlIt+FJv42dMvb1pKFu8gVkfyY0WdSpHr2Q GauqBw57RThHk7UWZZ37dvtP6M6oAztvZ8pVhwuyk8QBvd261M9jLI6k//e+OWnqg7dS /gewE8+V2pVQJAfH+NIwvUZ0e6Q8+lm3+5ihaKq1WcuzHbXY/iGjCUjrnvqhOUkwGSTE GhdxAyvV7Y7UtB7xhbYwOjiWyf6q8o8BOSviHVzUG3Ui8xYU1yHdX/g2GeciSRivXqYV 82liynPiNgT81CYKvXqT30EDNhXc6spdxlqPpaoAkzKKGQspLM1vTLEFL1oALowpWsWH nYeQ== X-Gm-Message-State: AJaThX72e4utA5FGx2x423j2LLYTB1rVeGRpmiaif9ILOEkKYPgDoSDL N1V+2pAa6I8T0CMpFA5ZpQ3kIRwSmHjbDzti3/5c2w== X-Google-Smtp-Source: AGs4zMbF8Ht5E9L9iEvEwyz0HA3vAeq7KZT0urdLQtXp+Ntp5RarLahEh59YUo6SC+4j4DZaU+ZV3bjeOtF98AIV5Mw= X-Received: by 10.36.164.13 with SMTP id z13mr22853940ite.115.1511648270873; Sat, 25 Nov 2017 14:17:50 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 14:17:49 -0800 (PST) X-Originating-IP: [50.253.99.174] In-Reply-To: <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <9f23f97d-3614-e4d2-62fe-99723c5e3879@FreeBSD.org> From: Warner Losh Date: Sat, 25 Nov 2017 15:17:49 -0700 X-Google-Sender-Auth: OFWKq_KlfodVn8Jxgh-5DhOWhSE Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: FreeBSD FS , freebsd-geom@freebsd.org, Scott Long Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 22:17:52 -0000 On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon wrote: > > Before anything else, I would like to say that I got an impression that we > speak > from so different angles that we either don't understand each other's > words or, > even worse, misinterpret them. I understand what you are suggesting. Don't take my disagreement with your proposal as willful misinterpretation. You are proposing something that's a quick hack. Maybe a useful one, but it's still problematical because it has the upper layers telling the lower layers what to do (don't do your retry), rather than what service to provide (I prefer a fast error exit to over every effort to recover the data). And it also does it by overloading the meaning of EIO, which has real problems which you've not been open to listening, I assume due to your narrow use case apparently blinding you to the bigger picture issues with that route. However, there's a way forward which I think that will solve these objections. First, designate that I/O that fails due to short-circuiting the normal recovery process, return ETIMEDOUT. The I/O stack currently doesn't use this at all (it was introduced for the network side of things). This is a general catch-all for an I/O that we complete before the lower layers have given it the maximum amount of effort to recover the data, at the user request. Next, don't use a flag. Instead add a 32-bit field that is call bio_qos for quality of service hints and another 32-bit field for bio_qos_param. This allows us to pass down specific quality of service desires from the filesystem to the lower layers. The parameter will be unused in your proposal. BIO_QOS_FAIL_EARLY may be a good name for a value to set it to (at the moment, just use 1). We'll assign the other QOS values later for other things. It would allow us to implement the other sorts of QoS things I talked about as well. As for B_FAILFAST, it's quite unlike what you're proposing, except in one incidental detail. It's a complicated state machine that the sd driver in solaris implemented. It's an entire protocol. When the device gets errors, it goes into this failfast state machine. The state machine makes a determination that the errors are indicators the device is GONE, at least for the moment, and it will fail I/Os in various ways from there. Any new I/Os that are submitted will be failed (there's conditional behavior here: depending on a global setting it's either all I/O or just B_FAILFAST I/O). ZFS appears to set this bit for its discovery code only, when a device not being there would significantly delay things. Anyway, when the device returns (basically an I/O gets through or maybe some other event happens), the driver exists this mode and returns to normal operation. It appears to be designed not for the use case that you described, but rather for a drive that's failing all over the place so that any pending I/Os get out of the way quickly. Your use case is only superficially similar to that use case, so the Solaris / Illumos experiences are mildly interesting, but due to the differences not a strong argument for doing this. This facility in Illumos is interesting, but would require significantly more retooling of the lower I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just Solaris) a daemon that looks at failures to manage them at a higher level, which might make for a better user experience for FreeBSD, so that's something that needs to be weighed as well. We've known for some time that HDD retry algorithms take a long time. Same is true of some SSD or NVMe algorithms, but not all. The other objection I have to 'noretry' naming is that it bakes the current observed HDD behavior and recovery into the API. This is undesirable as other storage technologies have retry mechanisms that happen quite quickly (and sometimes in the drive itself). The cutoff between fast and slow recovery is device specific, as are the methods used. For example, there's new proposals out in NVMe (and maybe T10/T13 land) to have new types of READ commands that specify the quality of service you expect, including providing some sort of deadline hint to clip how much effort is expended in trying to recover the data. It would be nice to design a mechanism that allows us to start using these commands when drives are available with them, and possibly using timeouts to allow for a faster abort. Most of your HDD I/O will complete within maybe ~150ms, with a long tail out to maybe as long as ~400ms. It might be desirable to set a policy that says 'don't let any I/Os remain in the device longer than a second' and use this mechanism to enforce that. Or don't let any I/Os last more than 20x the most recent median I/O time. A single bit is insufficiently expressive to allow these sorts of things, which is another reason for my objection to your proposal. With the QOS fields being independent, the clone routines just copies them and makes no judgement value on them. So, those are my problems with your proposal, and also some hopefully useful ways to move forward. I've chatted with others for years about introducing QoS things into the I/O stack, so I know most of the above won't be too contentious (though ETIMEDOUT I haven't socialized, so that may be an area of concern for people). Warner