From owner-freebsd-current@FreeBSD.ORG  Tue Mar 25 17:42:54 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9BFE337B401
	for <current@freebsd.org>; Tue, 25 Mar 2003 17:42:54 -0800 (PST)
Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net
	[207.217.120.188])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E06AE43FA3
	for <current@freebsd.org>; Tue, 25 Mar 2003 17:42:53 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0212.cvx21-bradley.dialup.earthlink.net ([209.179.192.212]
	helo=mindspring.com)
	by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)	id 18xzwD-0005zS-00; Tue, 25 Mar 2003 17:42:46 -0800
Message-ID: <3E810547.3653FFEA@mindspring.com>
Date: Tue, 25 Mar 2003 17:41:27 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: The Anarcat <anarcat@anarcat.ath.cx>
References: <20030324215712.GA844@fump.kawo2.rwth-aachen.de>
	<3E7FE3CE.ECD2775F@mindspring.com>
	<20030325110843.GF1700@fump.kawo2.rwth-aachen.de>
	<3E804392.40844D63@mindspring.com> <20030325161632.GB600@lenny.anarcat.ath.cx>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4198e648d972572b02390b181c5f52680a2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c
X-Spam-Status: No, hits=-22.2 required=5.0
	tests=EMAIL_ATTRIBUTION,QUOTED_EMAIL_TEXT,RCVD_IN_OSIRUSOFT_COM,
	      REFERENCES,REPLY_WITH_QUOTES
	autolearn=ham	version=2.50
X-Spam-Level: 
X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp)
cc: current@FreeBSD.org
cc: Alexander Langer <alex@big.endian.de>
Subject: [Re: several background fsck panics
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Mar 2003 01:42:59 -0000
X-List-Received-Date: Wed, 26 Mar 2003 01:42:59 -0000

The Anarcat wrote:
> > When you killed the power on your system and reset it, you
> > lost the cached data sitting in the ATA disk.  This is due
> > to the fact that the ATA disk lied, and claimed that it had
> > committed some writes to stable storage, when in fact it had
> > only copied them to the disk cache.  As a result, when the
> > device reset happened, you lost some writes which were in
> > progress.  Therefore you disk image was corrupt, and so your
> > FS was *not* in a self-consistent state.
> 
> Shouldn't fsck run in the foreground for disks setup with WC? That
> would be a quick hack solving this issue altogether.

There are a lot of "quick hacks" that can be done to solve the
issue.  There are also real fixes:

o	Disable BG fsck if WC is on; I dislike this hack,
	mostly because of postings by drive engineers to
	FreeBSD lists, indicating a willingness to address
	ATA issues like this, and the fact that most SCSI
	drives don't actually have this issue.

o	Put a counter in the first superblock; it would be
	incremented when the BG fsck is started, and reset
	to zero when it completes.  If the counter reaches
	3 (or some command line specified number), then the
	BG flagging is ignored, and a full FG fsck is then
	performed instead.  I like this idea because it will
	always work, and it's not actually a hack, it's a
	correct solution.

o	Implement "soft read-only".  The place that most of
	the complaints are coming from is desktop users, with
	relatively quiescent machines.  Though swap is used,
	it does not occur in an FS partition.  As a result,
	the FS could be marked "read-only" for long period of
	time.  This marking would be in memory.  The clean bit
	would be set on the superblock.  When a write occurs,
	the clean bit would be reset to "dirty", and committed
	to disk prior to the write operation being permitted
	to proceed (a stall barrier).  I like this idea because,
	for the most part, it eliminates fsck, both BG and FG,
	on systems that crash while it's in effect.  The net
	result is a system that is statistically much more
	tolerant of failures, but which still requires another
	safety net, such as the previous solution.

o	Disk manufacturers could fix the ATA write caching
	problem.  I think this will happen eventually, so the
	first "solution" is out.

o	PC manufacturers could provide OS-usable NVRAM scratch
	areas, which would permit an OS to allocate a section,
	and use it.  The OS would then write the FreeBSD marker
	into an area to allocate it, and then write "power fail"
	as the failure code into the allocated area.  When a
	panic or hardware failure occurred, it could write "panic"
	or "hardware fail" as the failure code.  When the system
	came back up, it would be able to distinguish which type
	of failure by reading the NVRAM area.  If it was something
	like "panic with sync", it could run the BG fsck, otherwise
	it would run the FG fsck.  I really like this idea, too.  I
	believe that more modern systems have this capability, but
	it has not yet been standardized.  Therefore we should take
	a "wait and see" attitude towards it.

o	Disk manufacturers could provide a Lithium battery on board
	disks.  This would not only bound their "planned obsolesence"
	curve to 5 years or so (lifetime of the battery), it would
	give them an aftermarket.  The battery would trickle-charge
	from the disk drive power, and would be used to commit the
	write cache in event of power failure.  I like this too; it
	makes disk drives obsolete at about 2X the distance that they
	become obsolete, it gives the drive manufacturers a bone for
	playing along, and it actually solves the problem at it's
	source.  People might not like "your disk lasts 5 years" vs.
	"your warranty is one year", but smoothing the market demand
	function is probably worth more, in terms of lower cost to
	consumers and assured profit to disk manufacturers, and it
	can be billed as a marketing checkbox item, to force all the
	other disk manufacturers into implementing it, too, so there
	should be no downside.

o	We can change our file system structure to "journalled"; I like
	this as well, but there are some issues with manufacturers who
	do not provide track bondary information, so you can assure
	yourselves that a track boundary doesn't span a corruption
	boundary, in the event of a power failure.  If you can do this,
	journalling actually becomes incredibly fast, since you know
	the disk writes backwards on a given track, so you can just
	implemente the "completed write" datestamp, and perform a single
	write, instead of two writes, in order to get a track on the
	disk.

There are other approaches that I'm not prepared to share in a forum
where they might be made public, but you get the idea.  Several of the
above are implementable now, particularly the counter and the soft
read-only, with a day or less of effort.

-- Terry