From owner-freebsd-arch  Fri Mar 30  4:52:53 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP id F0E1A37B71B
	for <arch@FreeBSD.ORG>; Fri, 30 Mar 2001 04:52:45 -0800 (PST)
	(envelope-from tlambert@usr05.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.9.3/8.9.3) id FAA15865;
	Fri, 30 Mar 2001 05:49:20 -0700 (MST)
Received: from usr05.primenet.com(206.165.6.205)
 via SMTP by smtp03.primenet.com, id smtpdAAAPZai8E; Fri Mar 30 05:49:12 2001
Received: (from tlambert@localhost)
	by usr05.primenet.com (8.8.5/8.8.5) id FAA06540;
	Fri, 30 Mar 2001 05:52:34 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200103301252.FAA06540@usr05.primenet.com>
Subject: Re: Background Fsck
To: mckusick@mckusick.com (Kirk McKusick)
Date: Fri, 30 Mar 2001 12:52:29 +0000 (GMT)
Cc: arch@FreeBSD.ORG
In-Reply-To: <200103290522.VAA06966@beastie.mckusick.com> from "Kirk McKusick" at Mar 28, 2001 09:22:10 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

I have a question avout the safety of this approach:

You don't seem to be able to distinguish between:

1)	Hardware crash without data coruption -
		e.g. power failure

2)	Hardware crash with data corruption -
		e.g. disk/controller/memory failure

3)	Software crash without data corruption -
		e.q. resource availability failure, or panic
		     as a result of coding error

4)	Software crash with data corruption -
		e.g. a panic resulting from kernel data
		     becoming corrupt, with an unknown
		     interval preceeding the crash in which
		     some of these structures might have
		     had FS data in them, or a such crash in
		     the FS code path itself, where the data
		     corruption was a primary effect instead
		     of a side effect

It seems to me tha background checking is only safe in cases
1 and 3, and (the current California power grid reliability not
withstanding), that these cases are not provably the statistically
most common cases.

The reason Whistle did not do this work earlier was that we were
unable to address this concern adequately without non-volatile
RAM to store the failure reason and the disk write cache status.
Since panic reasons are mathematically indistinguishable in the
limit, were were also unable to address differentiating 3 and 4,
without placing the FS and I/O subsystem into a seperate
protection domain.  Even doing this, we would only gain some
statistical protection against #4, which means the only value
which we could add was to case #1, were we to invest in the
additional hardware.

In other words, it was not speed of fsck which drove Whistle to
soft updates.

My question is this: how were you able to address these issues
in your implementation?


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message