From owner-freebsd-geom@freebsd.org Sat Nov 25 10:54:06 2017 Return-Path: Delivered-To: freebsd-geom@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B2DE1DDE418; Sat, 25 Nov 2017 10:54:06 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 74B507D2CE; Sat, 25 Nov 2017 10:54:06 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 3DBAD20C14; Sat, 25 Nov 2017 05:54:04 -0500 (EST) Received: from frontend1 ([10.202.2.160]) by compute6.internal (MEProxy); Sat, 25 Nov 2017 05:54:04 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsco.org; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc; s=fm1; bh=RaB/AKEW6ssBr1WgIVTDCxOVmtpaK 56Amk4OLGmTdd0=; b=hktOnAyGXDhIs2jROcS4re2WNoJh4EeJcnIG3biKqwep0 akgP6d18WTEK6J1ICq9cpd2J+xSeYABCXdEmdICogmastpBYdhIKhtfzvndsl79D i5EmDeyE93bqaV04cBchYRjRnZETgmhl93xVqXNR+pHLdmkJFNnCCqcDR10JHNKd sW4nmpmR4WvxZgVG7LkbmFiaQRBshJ/11h3azhVaDWOz4j+npO8EkjMMqVwGMFoz O0r8uzyyxCywUzYnmpMcFubBfZZqzMvrMltmhIS722Ke6+buQ3DeHJVqz5eiyu8N p4rAYrqIiGUwhoJzNuD35XfIldWe74sR5ZzBnOFvA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=RaB/AK EW6ssBr1WgIVTDCxOVmtpaK56Amk4OLGmTdd0=; b=m6bVt2S+eF+rrbG6orqHnO zLcu2MALeeuZEucMSts3+TTTGB/11L6qVJz04n5Hzhy36weGuekbGzE7peUo/W5v wkAa4wi5FhjOTO0BIJGwKS5raiEIdaPRTsl5aSmvzrLrio76NFtRRClWs+1+yesY +QW8EfoEgQ3Sh+3je5TIy/j5sC4GHZj0e4DBQ5UT3LttTzXU4ZU1NcRc499IsvFq c/ulQSKhAl+o+AedGUD43cBWkuaZl0s57+wqEvEbfQ6uzieN7QeVYEXoYpHlpWBq zAXov7tTEaWXhk9icxs1hduwtOYNUXsHBluVO5bAZB6ccqa9uCjhU4Pb3LVUXdKQ == X-ME-Sender: Received: from [192.168.0.106] (unknown [161.97.249.191]) by mail.messagingengine.com (Postfix) with ESMTPA id BDD0E7E6EE; Sat, 25 Nov 2017 05:54:03 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom From: Scott Long In-Reply-To: Date: Sat, 25 Nov 2017 03:54:01 -0700 Cc: Warner Losh , FreeBSD FS , freebsd-geom@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> To: Andriy Gapon X-Mailer: Apple Mail (2.3445.4.7) X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Nov 2017 10:54:06 -0000 > On Nov 24, 2017, at 10:17 AM, Andriy Gapon wrote: >=20 >=20 >>> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the first = read, get >>> the error back sooner and try the other disk sooner. Only if I know = that there >>> are no other copies to try, then I would use the normal read with = all the retrying. >>>=20 >>=20 >> I agree with Warner that what you are proposing is not correct. It = weakens the >> contract between the disk layer and the upper layers, making it less = clear who is >> responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D = means. That contract is already >> weak due to poor design decisions in VFS-BIO and GEOM, and Warner and = I >> are working on a plan to fix that. >=20 > Well... I do realize now that there is some problem in this area, = both you and > Warner mentioned it. But knowing that it exists is not the same as = knowing what > it is :-) > I understand that it could be rather complex and not easy to describe = in a short > email=E2=80=A6 >=20 There are too many questions to ask, I will do my best to keep the = conversation logical. First, how do you propose to distinguish between EIO due to a = lengthy set of timeouts, vs EIO due to an immediate error returned by the disk = hardware? CAM has an extensive table-driven error recovery protocol who=E2=80=99s = purpose is to decide whether or not to do retries based on hardware state information = that is not made available to the upper layers. Do you have a test case that = demonstrates the problem that you=E2=80=99re trying to solve? Maybe the error = recovery table is wrong and you=E2=80=99re encountering a case that should not be retried. If = that=E2=80=99s what=E2=80=99s going on, we should fix CAM instead of inventing a new work-around. Second, what about disk subsystems that do retries internally, out of = the control of the FreeBSD driver? This would include most hardware RAID = controllers. Should what you are proposing only work for a subset of the kinds of = storage systems that are available and in common use? Third, let=E2=80=99s say that you run out of alternate copies to try, = and as you stated originally, that will force you to retry the copies that had returned = EIO. How will you know when you can retry? How will you know how many times you will retry? How will you know that a retry is even possible? Should = the retries be able to be canceled? Why is overloading EIO so bad? brelse() will call bdirty() when a = BIO_WRITE command has failed with EIO. Calling bdirty() has the effect of = retrying the I/O. This disregards the fact that disk drivers only return EIO when = they=E2=80=99ve decided that the I/O cannot be retried. It has no termination condition for the = retries, and will endlessly retry I/O in vain; I=E2=80=99ve seen this quite = frequently. It also disregards the fact that I/O marked as B_PAGING can=E2=80=99t be retried in this = fashion, and will trigger a panic. Because we pretend that EIO can be retried, we are = left with a system that is very fragile when I/O actually does fail. Instead of = adding more special cases and blurred lines, I want to go back to enforcing = strict contracts between the layers and force the core parts of the system to = respect those contracts and handle errors properly, instead of just retrying and hoping for the best. > But then, this flag is optional, it's off by default and no one is = forced to > used it. If it's used only by ZFS, then it would not be horrible. > Unless it makes things very hard for the infrastructure. > But I am circling back to not knowing what problem(s) you and Warner = are > planning to fix. >=20 Saying that a feature is optional means nothing; while consumers of the = API might be able to ignore it, the producers of the API cannot ignore it. = It is these producers who are sick right now and should be fixed, instead of creating new ways to get even more sick. Scott