From owner-freebsd-current@FreeBSD.ORG  Fri Mar 28 21:47:37 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 560FF37B43C; Fri, 28 Mar 2003 21:47:36 -0800 (PST)
Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net
	[207.217.120.188])	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 7D03D43F75; Fri, 28 Mar 2003 21:47:35 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0191.cvx21-bradley.dialup.earthlink.net ([209.179.192.191]
	helo=mindspring.com)
	by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)	id 18z9Bk-0004Hq-00; Fri, 28 Mar 2003 21:47:33 -0800
Message-ID: <3E853324.16550524@mindspring.com>
Date: Fri, 28 Mar 2003 21:46:12 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: David Schultz <das@FreeBSD.ORG>
References: <20030324215712.GA844@fump.kawo2.rwth-aachen.de>
	<3E7FE3CE.ECD2775F@mindspring.com>
	<20030325110843.GF1700@fump.kawo2.rwth-aachen.de>
	<3E804392.40844D63@mindspring.com> <20030325161632.GB600@lenny.anarcat.ath.cx>
	<3E810547.3653FFEA@mindspring.com>
	<20030328235250.GA22044@HAL9000.homeunix.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4ffff83acf947647016893acd4cc33683666fa475841a1c7a350badd9bab72f9c350badd9bab72f9c
cc: current@FreeBSD.ORG
cc: Alexander Langer <alex@big.endian.de>
Subject: Re: [Re: several background fsck panics
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Mar 2003 05:47:39 -0000

David Schultz wrote:
> Thus spake Terry Lambert <tlambert2@mindspring.com>:
> > o     Put a counter in the first superblock; it would be
> >       incremented when the BG fsck is started, and reset
> >       to zero when it completes.  If the counter reaches
> >       3 (or some command line specified number), then the
> >       BG flagging is ignored, and a full FG fsck is then
> >       performed instead.  I like this idea because it will
> >       always work, and it's not actually a hack, it's a
> >       correct solution.
> 
> I'm glad you like it because AFAIK, it is already implemented.  ;-)

Nope.  What's implemented is the FS_NEEDSFSCK flag.  But that
flag is not set in the superblock flags field as *the very first
thing done*.

Thus a failure that results in a panic will not set the flag in
pfatal(), since it never gets there.

Probably the correct thing to do is to set the flag as the very
first operation, and then it will work as expected.

FWIW, it looks like the code in pfatal() wanted to be in main(),
since it complains about not being able to run in the background,
the same way main() does.

However, this still leaves a race window.

The reason the panic happens is that FreeBSD is running processes
on a corrupt FS.

Even in the best case, this panic may occur when anything is
loaded off the FS, so it could happen on init, or on fsck
itself, etc..

So really, the only solution is a counter that the FS kernel
code counts up, which is reset to zero when a BG fsck completes
successfully.	Say grabbing the first byte of fs_sparecon32[].

BTW: This still leaves a failure case: the BG fsck has to be
able to complete successfully... but that's not enough to stave
off a future panic from an undetected error that the fsck didn't
see, because it was only pruning CG bitmaps.

So the correct place to zero the counter is, once again, in the
kernel.  As a result of a successful unmount, from a non-panic
shutdown.

This does mean that three (or "count") consecutive power failures
gets you a FG fsck, but that's probably livable (if you were that
certain there was no corruption, you could boot to a shell and
override the "count" parameter to the FG fsck trigger threshold).


> > o     Implement "soft read-only".  The place that most of
> >       the complaints are coming from is desktop users, with
> >       relatively quiescent machines.  Though swap is used,
> >       it does not occur in an FS partition.  As a result,
> >       the FS could be marked "read-only" for long period of
> >       time.  This marking would be in memory.  The clean bit
> >       would be set on the superblock.  When a write occurs,
> >       the clean bit would be reset to "dirty", and committed
> >       to disk prior to the write operation being permitted
> >       to proceed (a stall barrier).  I like this idea because,
> >       for the most part, it eliminates fsck, both BG and FG,
> >       on systems that crash while it's in effect.  The net
> >       result is a system that is statistically much more
> >       tolerant of failures, but which still requires another
> >       safety net, such as the previous solution.
> 
> I was thinking of doing something like this myself as part of an
> ``idle timeout'' for disks.  (Marking the filesystem clean after a
> period of quiescence would actually interfere with ATA disks'
> built-in mechanism for spinning down after a timeout, which is
> important for laptops, so the OS would have to track the true
> amount of idle time.)  Annoyingly, I can never get the disk
> containing /var to remain quiescent for long while cron is running
> (even without any crontabs), and I hope this can be solved without
> disabling cron or adding a nontrivial hack to bio.

We implemented this when we implemented soft updates in FFS under
Windows at Artisoft.  That was back before ATX power supplies were
wide spread, and we needed to be tolerant of users who simply
turned off the power switch, without running the Windows95
shutdown sequence.

I dunno about cron.  I think it "noticing" crontab changes
"automatically" has maybe made it too smart for its own good.

Cron updates the "access" time on the crontab file every time it
runs, which is once a second.  If you disabled this for fstat,
the problem would go away.  I'm not sure the semantics are OK,
though.

The old pre-"smarter" cron would not have this problem, as it
would run on intervals, and sleep for long periods (until the
next job was scheduled to run), and you had to hit it over the
head with "kill -HUP" to tell it the file changed.

Probably the correct thing to do is to use old-style long delta
intervals, and register a kevent interest in file modifications.

The cruddy thing is, if it were really read-only, then the access
time update wouldn't happen.  Catch-22.

I think maybe it's useful to distinguish the POSIX semantics here:
"shall be scheduled for update" is not the same thing, really, as
"shall be updated".  So, in practice, you could cache the access
time update for long periods, as long as the correct time was
marked in memory, and the write is scheduled to occur "eventually".
So it's possible there is an "out", without having to worry about
fixing cron so it's not so darn aggressive.

Gotta wonder how much rewriting of one area of the disk with great
frequency you can handle, before it becomes a cause of disk wear
enough to shorten the MTBF.  8-(.

-- Terry