From owner-freebsd-arch Fri Mar 30 4:52:53 2001 Delivered-To: freebsd-arch@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id F0E1A37B71B for ; Fri, 30 Mar 2001 04:52:45 -0800 (PST) (envelope-from tlambert@usr05.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id FAA15865; Fri, 30 Mar 2001 05:49:20 -0700 (MST) Received: from usr05.primenet.com(206.165.6.205) via SMTP by smtp03.primenet.com, id smtpdAAAPZai8E; Fri Mar 30 05:49:12 2001 Received: (from tlambert@localhost) by usr05.primenet.com (8.8.5/8.8.5) id FAA06540; Fri, 30 Mar 2001 05:52:34 -0700 (MST) From: Terry Lambert Message-Id: <200103301252.FAA06540@usr05.primenet.com> Subject: Re: Background Fsck To: mckusick@mckusick.com (Kirk McKusick) Date: Fri, 30 Mar 2001 12:52:29 +0000 (GMT) Cc: arch@FreeBSD.ORG In-Reply-To: <200103290522.VAA06966@beastie.mckusick.com> from "Kirk McKusick" at Mar 28, 2001 09:22:10 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I have a question avout the safety of this approach: You don't seem to be able to distinguish between: 1) Hardware crash without data coruption - e.g. power failure 2) Hardware crash with data corruption - e.g. disk/controller/memory failure 3) Software crash without data corruption - e.q. resource availability failure, or panic as a result of coding error 4) Software crash with data corruption - e.g. a panic resulting from kernel data becoming corrupt, with an unknown interval preceeding the crash in which some of these structures might have had FS data in them, or a such crash in the FS code path itself, where the data corruption was a primary effect instead of a side effect It seems to me tha background checking is only safe in cases 1 and 3, and (the current California power grid reliability not withstanding), that these cases are not provably the statistically most common cases. The reason Whistle did not do this work earlier was that we were unable to address this concern adequately without non-volatile RAM to store the failure reason and the disk write cache status. Since panic reasons are mathematically indistinguishable in the limit, were were also unable to address differentiating 3 and 4, without placing the FS and I/O subsystem into a seperate protection domain. Even doing this, we would only gain some statistical protection against #4, which means the only value which we could add was to case #1, were we to invest in the additional hardware. In other words, it was not speed of fsck which drove Whistle to soft updates. My question is this: how were you able to address these issues in your implementation? Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message