From owner-freebsd-geom@freebsd.org Sat Nov 25 17:57:40 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3B7CEDEC22E for ; Sat, 25 Nov 2017 17:57:40 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-it0-x22c.google.com (mail-it0-x22c.google.com [IPv6:2607:f8b0:4001:c0b::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id F16BF681C7 for ; Sat, 25 Nov 2017 17:57:39 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-it0-x22c.google.com with SMTP id x13so16485614iti.4 for ; Sat, 25 Nov 2017 09:57:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=sDDg+58CGPKx0+tTc1ED8m6HEe5BrIsIIQB/kA3ChwU=; b=qPFfmzjRNBbP3DmUvRWgC5jMjFTpdNySHNEs8H/xdoUucHPhTJiEXi40KmJeeH77Ey sY1c6XJ+8SS4Z0oqTJQ9XQn1y9K6v09yaqrPtqKmfWVtforMcMiq7TYqaT9DLUuNZA2A WiGcKFBDw7GXRV9Z3D6RsIJ4XMyMU7TZNOuYra1Q8c9dnWl5m3DqvlaeK1OeayjCg8jB wIXzdVHAxvnLU05sVIu+j+ECf9iI8/rotfDVUVQOtJmW77SwHCZBIYDUAwBkj7yosMCR 56TfIrZhYVrxHbWJQFjg/YMXIHP/MXKrfEAcRxZ78HvVBH5c6Vqc1OvzZKJVarY8JVdX zAsw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=sDDg+58CGPKx0+tTc1ED8m6HEe5BrIsIIQB/kA3ChwU=; b=Wkas7BByVOu9fOxwWH80k1VVi0R6Af2UvuneONPYVeP9GQJ5z2Ycy1blOMhUZP+Nsy 8YNHB2vnHS6LgkoE7ynSodZSLqQkeLwQ2gRJExfzJw8lGDIRB32GxUD879P7Q6XDKK5B oMUFNgr6DPFRQa0VuuQJY+BF1DWC8vgriOGLXLjxtJzC9ZgXeO7M9CLkFgJHF+FAjz38 wgMEl+38RE9jqCu4lyJP93YT5thDSigJNFQVsjzXaOdKp62OdyP/RKXZTYt2ZTW+6vDj RWMAMO3PzQK1rS1364V9WGmQ1wL4ORhQOQKLG2WQC9O3Y3qmJbScYZVzoK8IcKO2dPVU gzBQ== X-Gm-Message-State: AJaThX691pSDEbiNbam1lowG1waMomVGhTosBuZaQ9W8rAScs8utnECc aGTe+R6r5dTOo4hhyhANeDNW/HpuZv04Ad5RMYSUOA== X-Google-Smtp-Source: AGs4zMYazlz2WrPFSx5PeIGxX4EhK1qMgTHVpcFmS/KlPnGtbEi0uWZuo3DIO5vwYgUNvIHh2OtB3xdssoaEXn1dta8= X-Received: by 10.36.164.13 with SMTP id z13mr22202600ite.115.1511632659185; Sat, 25 Nov 2017 09:57:39 -0800 (PST) MIME-Version: 1.0 Sender: wlosh@bsdimp.com Received: by 10.79.108.204 with HTTP; Sat, 25 Nov 2017 09:57:38 -0800 (PST) X-Originating-IP: [2603:300b:6:5100:9579:bb73:7b7f:aadd] In-Reply-To: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> From: Warner Losh Date: Sat, 25 Nov 2017 10:57:38 -0700 X-Google-Sender-Auth: 6c28FYqH0hmmAfPKJFmQ73oXmro Message-ID: Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom To: Andriy Gapon Cc: Scott Long , FreeBSD FS , freebsd-geom@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 17:57:40 -0000 On Sat, Nov 25, 2017 at 10:36 AM, Andriy Gapon wrote: > > Timestamp of the first error is Jun 16 10:40:18. > Timestamp of the last error is Jun 16 10:40:27. > So, it took additional 9 seconds to finally produce EIO. > That disk is a part of a ZFS mirror. If the request was failed after the > first > attempt, then ZFS would be able to get the data from a good disk much > sooner. > > And don't take me wrong, I do NOT want CAM or GEOM to make that decision by > itself. I want ZFS to be able to tell the lower layers when they should > try as > hard as they normally do and when they should report an I/O error as soon > as it > happens without any retries. Let's walk through this. You see that it takes a long time to fail an I/O. Perfectly reasonable observation. There's two reasons for this. One is that the disks take a while to make an attempt to get the data. The second is that the system has a global policy that's biased towards 'recover the data' over 'fail fast'. These can be fixed by reducing the timeouts, or lowing the read-retry count for a given drive or globally as a policy decision made by the system administrator. It may be perfectly reasonable to ask the lower layers to 'fail fast' and have either a hard or a soft deadline on the I/O for a subset of I/O. A hard deadline would return ETIMEDOUT or something when it's passed and cancel the I/O. This gives better determinism in the system, but some systems can't cancel just 1 I/O (like SATA drives), so we have to flush the whole queue. If we get a lot of these, performance suffers. However, for some class of drives, you know that if it doesn't succeed in 1s after you submit it to the drive, it's unlikely to complete successfully and it's worth the performance hit on a drive that's already acting up. You could have a soft timeout, which says 'don't do any additional action after X time has elapsed and you get word about this I/O. This is similar to the hard timeout, but just stops retrying after the deadline has passed. This scenario is better on the other users of the drive, assuming that the read-recovery operations aren't starving them. It's also easier to implement, but has worse worst case performance characteristics. You aren't asking to limit retries. You're really asking to the I/O subsystem to limit, where it can, the amount of time on an I/O so you can try another one. You're means to doing this is to tell it not to retry. That's the wrong means. It shouldn't be listed in the API that it's a 'NO RETRY' request. It should be a QoS request flag: fail fast. Part of why I'm being so difficult is that you don't understand this and are proposing a horrible API. It should have a different name. The other reason is that I absolutely do not want to overload EIO. You must return a different error back up the stack. You've show no interest in this past, which is also a needless argument. We've given good reasons, and you've poopooed them with bad arguments. Also, this isn't the data I asked for. I know things can fail slowly. I was asking for how it would improve systems running like this. As in "I implemented it, and was able to fail over to this other drive faster" or something like that. Actual drive failure scenarios vary widely, and optimizing for this one failure is unwise. It may be the right optimization, but it may not. There's lots of tricky edges in this space. Warner