From owner-freebsd-current@FreeBSD.ORG Fri Mar 28 21:47:37 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 560FF37B43C; Fri, 28 Mar 2003 21:47:36 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7D03D43F75; Fri, 28 Mar 2003 21:47:35 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0191.cvx21-bradley.dialup.earthlink.net ([209.179.192.191] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18z9Bk-0004Hq-00; Fri, 28 Mar 2003 21:47:33 -0800 Message-ID: <3E853324.16550524@mindspring.com> Date: Fri, 28 Mar 2003 21:46:12 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: David Schultz References: <20030324215712.GA844@fump.kawo2.rwth-aachen.de> <3E7FE3CE.ECD2775F@mindspring.com> <20030325110843.GF1700@fump.kawo2.rwth-aachen.de> <3E804392.40844D63@mindspring.com> <20030325161632.GB600@lenny.anarcat.ath.cx> <3E810547.3653FFEA@mindspring.com> <20030328235250.GA22044@HAL9000.homeunix.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4ffff83acf947647016893acd4cc33683666fa475841a1c7a350badd9bab72f9c350badd9bab72f9c cc: current@FreeBSD.ORG cc: Alexander Langer Subject: Re: [Re: several background fsck panics X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Mar 2003 05:47:39 -0000 David Schultz wrote: > Thus spake Terry Lambert : > > o Put a counter in the first superblock; it would be > > incremented when the BG fsck is started, and reset > > to zero when it completes. If the counter reaches > > 3 (or some command line specified number), then the > > BG flagging is ignored, and a full FG fsck is then > > performed instead. I like this idea because it will > > always work, and it's not actually a hack, it's a > > correct solution. > > I'm glad you like it because AFAIK, it is already implemented. ;-) Nope. What's implemented is the FS_NEEDSFSCK flag. But that flag is not set in the superblock flags field as *the very first thing done*. Thus a failure that results in a panic will not set the flag in pfatal(), since it never gets there. Probably the correct thing to do is to set the flag as the very first operation, and then it will work as expected. FWIW, it looks like the code in pfatal() wanted to be in main(), since it complains about not being able to run in the background, the same way main() does. However, this still leaves a race window. The reason the panic happens is that FreeBSD is running processes on a corrupt FS. Even in the best case, this panic may occur when anything is loaded off the FS, so it could happen on init, or on fsck itself, etc.. So really, the only solution is a counter that the FS kernel code counts up, which is reset to zero when a BG fsck completes successfully. Say grabbing the first byte of fs_sparecon32[]. BTW: This still leaves a failure case: the BG fsck has to be able to complete successfully... but that's not enough to stave off a future panic from an undetected error that the fsck didn't see, because it was only pruning CG bitmaps. So the correct place to zero the counter is, once again, in the kernel. As a result of a successful unmount, from a non-panic shutdown. This does mean that three (or "count") consecutive power failures gets you a FG fsck, but that's probably livable (if you were that certain there was no corruption, you could boot to a shell and override the "count" parameter to the FG fsck trigger threshold). > > o Implement "soft read-only". The place that most of > > the complaints are coming from is desktop users, with > > relatively quiescent machines. Though swap is used, > > it does not occur in an FS partition. As a result, > > the FS could be marked "read-only" for long period of > > time. This marking would be in memory. The clean bit > > would be set on the superblock. When a write occurs, > > the clean bit would be reset to "dirty", and committed > > to disk prior to the write operation being permitted > > to proceed (a stall barrier). I like this idea because, > > for the most part, it eliminates fsck, both BG and FG, > > on systems that crash while it's in effect. The net > > result is a system that is statistically much more > > tolerant of failures, but which still requires another > > safety net, such as the previous solution. > > I was thinking of doing something like this myself as part of an > ``idle timeout'' for disks. (Marking the filesystem clean after a > period of quiescence would actually interfere with ATA disks' > built-in mechanism for spinning down after a timeout, which is > important for laptops, so the OS would have to track the true > amount of idle time.) Annoyingly, I can never get the disk > containing /var to remain quiescent for long while cron is running > (even without any crontabs), and I hope this can be solved without > disabling cron or adding a nontrivial hack to bio. We implemented this when we implemented soft updates in FFS under Windows at Artisoft. That was back before ATX power supplies were wide spread, and we needed to be tolerant of users who simply turned off the power switch, without running the Windows95 shutdown sequence. I dunno about cron. I think it "noticing" crontab changes "automatically" has maybe made it too smart for its own good. Cron updates the "access" time on the crontab file every time it runs, which is once a second. If you disabled this for fstat, the problem would go away. I'm not sure the semantics are OK, though. The old pre-"smarter" cron would not have this problem, as it would run on intervals, and sleep for long periods (until the next job was scheduled to run), and you had to hit it over the head with "kill -HUP" to tell it the file changed. Probably the correct thing to do is to use old-style long delta intervals, and register a kevent interest in file modifications. The cruddy thing is, if it were really read-only, then the access time update wouldn't happen. Catch-22. I think maybe it's useful to distinguish the POSIX semantics here: "shall be scheduled for update" is not the same thing, really, as "shall be updated". So, in practice, you could cache the access time update for long periods, as long as the correct time was marked in memory, and the write is scheduled to occur "eventually". So it's possible there is an "out", without having to worry about fixing cron so it's not so darn aggressive. Gotta wonder how much rewriting of one area of the disk with great frequency you can handle, before it becomes a cause of disk wear enough to shorten the MTBF. 8-(. -- Terry