From owner-freebsd-fs@freebsd.org Mon Nov 27 15:29:22 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5B1ABE54EA9; Mon, 27 Nov 2017 15:29:22 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 2766977653; Mon, 27 Nov 2017 15:29:21 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 9910B209F9; Mon, 27 Nov 2017 10:29:14 -0500 (EST) Received: from frontend2 ([10.202.2.161]) by compute6.internal (MEProxy); Mon, 27 Nov 2017 10:29:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsco.org; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc; s=fm1; bh=MHh2Qhy/trxoyPVOFzbQh5tKlSTVu Pvo190/Bdb/M5w=; b=DpDiqHRqwco+jcWBg0YjR06Th+9defM8k+vSKzMdzLT4i MVUEsVLNvKgxeqhql3PemZvGgH5wuHpOGiU/XHp/ZecDHTTY6COOZ5i361tI3Ccg dK1XGPfx1Ijawa1oiWS034WNGrhcfZEGobc1b1qmwrVQ7DVRVjaCjgPDjWlB7vy3 uUSTIypyJ9O5OT3kg6uFwoq/PVuv1RQ7WprvajNFQs9pmbvj7Og/9bxi+K7D5jgl RaPGzIoRbmIFsHMg6flns2T8Qo/NR79XbNozX1rmkPShigFSCt7ukS1pIP8w+3Cr f8vTJ75PzX/lHi7y6tuBIVpExOCNzgHWpv7oMy3WQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=MHh2Qh y/trxoyPVOFzbQh5tKlSTVuPvo190/Bdb/M5w=; b=A2uJyB7oFCSO0UNCw6TB/V I2kBiWs9brQFLhQfD6q+RLghECWtdg323jvRLzAmmlNm7XpvqWsZ467/4FgNCN7G N7cd5ci1pjLryyl8P8IiLM3aKM73VKY0Iv7OIG3/vxBPZR4Pp7ZJfvfikI6L3qj2 sppsejgEmaEpkwmeingIvxgQAj1/lhWYJrZ+vSEX9PnETEvrIBkYrK5Sr1jVvrAb fRRG724bc30i4Y0iIgRZNH+0i6/BXbWGxJ2+ZuW0HrxDUgQ/76NuagebA0WnD5b5 DqyIIjQy3P12rADec16JZsLV7Vzhta3O4oF4H/0EAD+uPn+icWd9pDI0+6IocBgA == X-ME-Sender: Received: from [192.168.0.119] (unknown [161.97.249.191]) by mail.messagingengine.com (Postfix) with ESMTPA id ED66F24009; Mon, 27 Nov 2017 10:29:13 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom From: Scott Long In-Reply-To: <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> Date: Mon, 27 Nov 2017 08:29:11 -0700 Cc: Warner Losh , FreeBSD FS , freebsd-geom@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <783CA790-C0D3-442D-A5A2-4CB424FCBD62@samsco.org> References: <391f2cc7-0036-06ec-b6c9-e56681114eeb@FreeBSD.org> <64f37301-a3d8-5ac4-a25f-4f6e4254ffe9@FreeBSD.org> <39E8D9C4-6BF3-4844-85AD-3568A6D16E64@samsco.org> <33101e6c-0c74-34b7-ee92-f9c4a11685d5@FreeBSD.org> To: Andriy Gapon X-Mailer: Apple Mail (2.3445.4.7) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Nov 2017 15:29:22 -0000 > On Nov 25, 2017, at 10:36 AM, Andriy Gapon wrote: >=20 > On 25/11/2017 12:54, Scott Long wrote: >>=20 >>=20 >>> On Nov 24, 2017, at 10:17 AM, Andriy Gapon wrote: >>>=20 >>>=20 >>>>> IMO, this is not optimal. I'd rather pass BIO_NORETRY to the = first read, get >>>>> the error back sooner and try the other disk sooner. Only if I = know that there >>>>> are no other copies to try, then I would use the normal read with = all the retrying. >>>>>=20 >>>>=20 >>>> I agree with Warner that what you are proposing is not correct. It = weakens the >>>> contract between the disk layer and the upper layers, making it = less clear who is >>>> responsible for retries and less clear what =E2=80=9CEIO=E2=80=9D = means. That contract is already >>>> weak due to poor design decisions in VFS-BIO and GEOM, and Warner = and I >>>> are working on a plan to fix that. >>>=20 >>> Well... I do realize now that there is some problem in this area, = both you and >>> Warner mentioned it. But knowing that it exists is not the same as = knowing what >>> it is :-) >>> I understand that it could be rather complex and not easy to = describe in a short >>> email=E2=80=A6 >>>=20 >>=20 >> There are too many questions to ask, I will do my best to keep the = conversation >> logical. First, how do you propose to distinguish between EIO due to = a lengthy >> set of timeouts, vs EIO due to an immediate error returned by the = disk hardware? >=20 > At what layer / component? > If I am the issuer of the request then I know how I issued that = request and what > kind of request it was. If I am an intermediate layer, then what do I = care. >=20 I don=E2=80=99t understand your response, I think we are talking about 2 = different things. >> CAM has an extensive table-driven error recovery protocol who=E2=80=99s= purpose is to >> decide whether or not to do retries based on hardware state = information that is >> not made available to the upper layers. Do you have a test case that = demonstrates >> the problem that you=E2=80=99re trying to solve? Maybe the error = recovery table is wrong >> and you=E2=80=99re encountering a case that should not be retried. = If that=E2=80=99s what=E2=80=99s going on, >> we should fix CAM instead of inventing a new work-around. >=20 > Let's assume that I am talking about the case of not being able to = read an HDD > sector that is gone bad. > Here is a real world example: >=20 > Jun 16 10:40:18 trant kernel: ahcich0: NCQ error, slot =3D 20, port =3D = -1 > Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED. = ACB: 60 > 00 00 58 62 40 2c 00 00 08 00 00 > Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): CAM status: ATA = Status Error > Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): ATA status: 41 = (DRDY ERR), > error: 40 (UNC ) > Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): RES: 41 40 68 58 = 62 40 2c 00 > 00 00 00 > Jun 16 10:40:18 trant kernel: (ada0:ahcich0:0:0:0): Retrying command > Jun 16 10:40:20 trant kernel: ahcich0: NCQ error, slot =3D 22, port =3D = -1 ... > I do not see anything wrong in what CAM / ahci /ata_da did here. > They did what I would expect them to do. They tried very hard to get = data that > I told them I need. Two things I see here. The first is that the drive is trying for 2 = seconds to get good data off of the media. The second is that it=E2=80=99s failing and = reporting the error as uncorrectable. I think that retries at the OS/driver layer are therefore useless; the drive is already doing a bunch of its = own retries and is failing, and is telling you that it=E2=80=99s failing. In the past, = the conventional wisdom has been to do retries, because 30 years ago drives had minimal firmware and = didn=E2=80=99t do a good job at managing data integrity. Now they do an extensive amount = of analysis-driven error correction and retries, so I think it=E2=80=99s = time to change the=20 conventional wisdom. I=E2=80=99d propose that for direct-attach SSDs = and HDDs we treat this error as non-retriable. Normally this would be a one-line change, but I = think it needs to be nuanced to distinguish between optical drives, simple flash media = drives, and regular HDDs and SSDs. An interim solution would be to just set the kern.cam.ada.retry_count to = 0. >=20 > Timestamp of the first error is Jun 16 10:40:18. > Timestamp of the last error is Jun 16 10:40:27. > So, it took additional 9 seconds to finally produce EIO. > That disk is a part of a ZFS mirror. If the request was failed after = the first > attempt, then ZFS would be able to get the data from a good disk much = sooner. It=E2=80=99s the fault of ZFS for not reading all members in parallel. = I ask people to consider and possibly accept that ZFS is not perfect. >=20 > And don't take me wrong, I do NOT want CAM or GEOM to make that = decision by > itself. I want ZFS to be able to tell the lower layers when they = should try as > hard as they normally do and when they should report an I/O error as = soon as it > happens without any retries. >=20 Again, I=E2=80=99m not sure I understand your answer. It is the job of = the disk drivers to reliably complete storage requests, not to just try once and then assume = that an upper layer will figure things out. As I said in my previous email, the = upper layers, and in this case I mean everything above the drivers and CAM, do not = have the error information to know why something failed, and thus have only = limited knowledge of what to do to remediate a failure. The error recovery = knowledge is in CAM and the drivers. It sounds like you=E2=80=99re proposing to make these simple, dumb = layers and move the error recovery up to ZFS. I understand what you=E2=80=99re saying about = wanting this being opt-in and not affecting other filesystems, but I think that you are missing the point of what = CAM does and provides. By definition the =E2=80=98da=E2=80=99 and =E2=80=98ada=E2=80=99= drivers provide reliable I/O transactions to the drive. That means that they handle error recovery, retries, etc. = If you do not want reliable I/O transactions, then you should not use the da/ada = drivers, and instead should either write your own periph driver (let=E2=80=99s call = it =E2=80=9Czfsda=E2=80=9D), or you should have ZFS communicate with the disks via the pass driver. That is = what userland tools like cdparanoia do when they want to bypass the = traditional operational characteristics and constraints of the standard periphs. Think of it this way, would you propose a flag to a TCP socket that = tells TCP to not do retransmits? No! If you wanted that semantic then you=E2=80=99d = use UDP and invent your own retry/reliability layer. I think it=E2=80=99s a big = job to re-invent error recovery into ZFS, while it would be relatively easy to adapt the CAM retry policy to the behavior of modern disks, i.e. believe = them when they say that there=E2=80=99s an uncorrectable read error and = don=E2=80=99t blindly attempt a retry. >> Second, what about disk subsystems that do retries internally, out of = the control >> of the FreeBSD driver? This would include most hardware RAID = controllers. >> Should what you are proposing only work for a subset of the kinds of = storage >> systems that are available and in common use? >=20 > Yes. I do not worry about things that are beyond my control. > Those subsystems would behave as they do now. So, nothing would get = worse. It gets worse because it builds inconsistency into the system and makes = expectations for behavior unclear. That turns into more hacky and incorrect code = over time. >=20 >> Third, let=E2=80=99s say that you run out of alternate copies to try, = and as you stated >> originally, that will force you to retry the copies that had returned = EIO. How >> will you know when you can retry? How will you know how many times = you >> will retry? How will you know that a retry is even possible? >=20 > I am continuing to use ZFS as an example. It already has all the = logic built > in. If all vdev zio-s (requests to mirror members as an example) = fail, then > their parent zio (a logical read from the mirror) will be retried (by = ZFS) and > when ZFS retries it sets a special flag (ZIO_FLAG_IO_RETRY) on that = zio and on > its future child zio-s. >=20 > Essentially, my answer is you have to program it correctly, there is = no magic. Again, I think that you=E2=80=99re missing the point of what I=E2=80=99m = asking. All that ZFS can see is =E2=80=9CEIO=E2=80=9D. Maybe the EIO was generated because = of an uncorrectable media error, and all attempts to retry will be futile. Maybe it was = generated because of a correctable error, and a retry might be useful. There are = many, many, many error cases that are tested for an handled in CAM. You are = collapsing them down to 1 case, and I don=E2=80=99t agree with that. >=20 >> Should the retries >> be able to be canceled? >=20 > I think that this is an orthogonal question. > I do not have any answer and I am not ready to discuss this at all. >=20 Please be ready to discuss this. >> Why is overloading EIO so bad? brelse() will call bdirty() when a = BIO_WRITE >> command has failed with EIO. Calling bdirty() has the effect of = retrying the I/O. >> This disregards the fact that disk drivers only return EIO when = they=E2=80=99ve decided >> that the I/O cannot be retried. It has no termination condition for = the retries, and >> will endlessly retry I/O in vain; I=E2=80=99ve seen this quite = frequently. It also disregards >> the fact that I/O marked as B_PAGING can=E2=80=99t be retried in this = fashion, and will >> trigger a panic. Because we pretend that EIO can be retried, we are = left with >> a system that is very fragile when I/O actually does fail. Instead = of adding >> more special cases and blurred lines, I want to go back to enforcing = strict >> contracts between the layers and force the core parts of the system = to respect >> those contracts and handle errors properly, instead of just retrying = and >> hoping for the best. >=20 > So, I suggest that the buffer layer (all the b* functions) does not = use the > proposed flag. Any problems that exist in it should be resolved = first. > ZFS does not use that layer. >=20 I think it=E2=80=99s foolish to propose a change in the disk API and = then say =E2=80=9Conly ZFS should use this=E2=80=9D >>> But then, this flag is optional, it's off by default and no one is = forced to >>> used it. If it's used only by ZFS, then it would not be horrible. >>> Unless it makes things very hard for the infrastructure. >>> But I am circling back to not knowing what problem(s) you and Warner = are >>> planning to fix. >>>=20 >>=20 >> Saying that a feature is optional means nothing; while consumers of = the API >> might be able to ignore it, the producers of the API cannot ignore = it. It is >> these producers who are sick right now and should be fixed, instead = of >> creating new ways to get even more sick. >=20 > I completely agree. > But which producers of the API do you mean specifically? > So far, you mentioned only the consumer level problems with the = b-layer. >=20 > Having said all of the above, I must admit one thing. > When I proposed BIO_NORETRY I had only the simplest GEOM topology in = mind: > ZFS -> [partition] -> disk. > Now that I start to think about more complex topologies I am not = really sure how > the flag should be handled by geom-s with complex internal behavior. = If that > can be implemented reasonably and clearly, if the flag will create a = big mess. > E.g., things like gmirrors on top of gmirrors and so on. > Maybe the flag, if it ever accepted, should never be propagated = automatically. > Only geom-s that are aware of it should propagate or request it. That = should be > safer. >=20 Yes, it gets chaotic and unreliable once you have multiple layers each = thinking that they know better than the previous layer. That is the situation = that FreeBSD is currently in with CAM, GEOM, and VFS-BIO. I do not propose to make = it even more chaotic. Through just basic rules of abstraction, basic computer = science principles, it follows that the layers of software closest to the = operation (in this case, the hardware) have the least abstraction and the most amount of information on how to act. Saying that ZFS knows best how the hardware = works, when it has no access to the actual control channel information of the = hardware, is confusing and illogical to me. I think that there=E2=80=99s a legitimate problem with CAM error = recovery trying too hard to do retries when the disk is clearly telling it that it=E2=80=99s = futile. I don=E2=80=99t have a quick patch for that because I would like to think how to make it dependent on = the media type. However, it=E2=80=99s something that I=E2=80=99m willing to = discuss further and work on. Scott