From owner-freebsd-arch Fri Mar 30 10:48: 3 2001 Delivered-To: freebsd-arch@freebsd.org Received: from beastie.mckusick.com (beastie.mckusick.com [209.31.233.184]) by hub.freebsd.org (Postfix) with ESMTP id DDF5437B71A for ; Fri, 30 Mar 2001 10:47:55 -0800 (PST) (envelope-from mckusick@mckusick.com) Received: from beastie.mckusick.com (localhost [127.0.0.1]) by beastie.mckusick.com (8.9.3/8.9.3) with ESMTP id KAA10189; Fri, 30 Mar 2001 10:47:46 -0800 (PST) (envelope-from mckusick@beastie.mckusick.com) Message-Id: <200103301847.KAA10189@beastie.mckusick.com> To: Terry Lambert Subject: Re: Background Fsck Cc: arch@freebsd.org In-Reply-To: Your message of "Fri, 30 Mar 2001 12:52:29 GMT." <200103301252.FAA06540@usr05.primenet.com> Date: Fri, 30 Mar 2001 10:47:46 -0800 From: Kirk McKusick Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG From: Terry Lambert Subject: Re: Background Fsck To: mckusick@mckusick.com (Kirk McKusick) Date: Fri, 30 Mar 2001 12:52:29 +0000 (GMT) Cc: arch@FreeBSD.ORG I have a question avout the safety of this approach: You don't seem to be able to distinguish between: 1) Hardware crash without data coruption - e.g. power failure 2) Hardware crash with data corruption - e.g. disk/controller/memory failure 3) Software crash without data corruption - e.q. resource availability failure, or panic as a result of coding error 4) Software crash with data corruption - e.g. a panic resulting from kernel data becoming corrupt, with an unknown interval preceeding the crash in which some of these structures might have had FS data in them, or a such crash in the FS code path itself, where the data corruption was a primary effect instead of a side effect It seems to me tha background checking is only safe in cases 1 and 3, and (the current California power grid reliability not withstanding), that these cases are not provably the statistically most common cases. The reason Whistle did not do this work earlier was that we were unable to address this concern adequately without non-volatile RAM to store the failure reason and the disk write cache status. Since panic reasons are mathematically indistinguishable in the limit, were were also unable to address differentiating 3 and 4, without placing the FS and I/O subsystem into a seperate protection domain. Even doing this, we would only gain some statistical protection against #4, which means the only value which we could add was to case #1, were we to invest in the additional hardware. In other words, it was not speed of fsck which drove Whistle to soft updates. My question is this: how were you able to address these issues in your implementation? Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. In general, your observations are correct. In the current framework it is not possible to guarrantee that you can always sort out which of the four cases above you are in and to then take the correct action. Whistle needed to make those sorts of guarantees, and consequently could not fall back to something like background fsck. I do not purport to make this sort of guarantee. I say only that I will do the right thing in cases #1 and #3 and that I will do my best to detect that I am in cases #2 and #4 and exit gracefully after logging a message saying that an unexpected inconsistency has arisen and that manual intervention is needed. For systems where this is not good enough, the system administrator has the option of forcing foreground checks or not using soft updates at all. Kirk To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message