From owner-freebsd-current@FreeBSD.ORG Tue Mar 25 17:42:54 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9BFE337B401 for ; Tue, 25 Mar 2003 17:42:54 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id E06AE43FA3 for ; Tue, 25 Mar 2003 17:42:53 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0212.cvx21-bradley.dialup.earthlink.net ([209.179.192.212] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18xzwD-0005zS-00; Tue, 25 Mar 2003 17:42:46 -0800 Message-ID: <3E810547.3653FFEA@mindspring.com> Date: Tue, 25 Mar 2003 17:41:27 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: The Anarcat References: <20030324215712.GA844@fump.kawo2.rwth-aachen.de> <3E7FE3CE.ECD2775F@mindspring.com> <20030325110843.GF1700@fump.kawo2.rwth-aachen.de> <3E804392.40844D63@mindspring.com> <20030325161632.GB600@lenny.anarcat.ath.cx> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4198e648d972572b02390b181c5f52680a2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c X-Spam-Status: No, hits=-22.2 required=5.0 tests=EMAIL_ATTRIBUTION,QUOTED_EMAIL_TEXT,RCVD_IN_OSIRUSOFT_COM, REFERENCES,REPLY_WITH_QUOTES autolearn=ham version=2.50 X-Spam-Level: X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) cc: current@FreeBSD.org cc: Alexander Langer Subject: [Re: several background fsck panics X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Mar 2003 01:42:59 -0000 X-List-Received-Date: Wed, 26 Mar 2003 01:42:59 -0000 The Anarcat wrote: > > When you killed the power on your system and reset it, you > > lost the cached data sitting in the ATA disk. This is due > > to the fact that the ATA disk lied, and claimed that it had > > committed some writes to stable storage, when in fact it had > > only copied them to the disk cache. As a result, when the > > device reset happened, you lost some writes which were in > > progress. Therefore you disk image was corrupt, and so your > > FS was *not* in a self-consistent state. > > Shouldn't fsck run in the foreground for disks setup with WC? That > would be a quick hack solving this issue altogether. There are a lot of "quick hacks" that can be done to solve the issue. There are also real fixes: o Disable BG fsck if WC is on; I dislike this hack, mostly because of postings by drive engineers to FreeBSD lists, indicating a willingness to address ATA issues like this, and the fact that most SCSI drives don't actually have this issue. o Put a counter in the first superblock; it would be incremented when the BG fsck is started, and reset to zero when it completes. If the counter reaches 3 (or some command line specified number), then the BG flagging is ignored, and a full FG fsck is then performed instead. I like this idea because it will always work, and it's not actually a hack, it's a correct solution. o Implement "soft read-only". The place that most of the complaints are coming from is desktop users, with relatively quiescent machines. Though swap is used, it does not occur in an FS partition. As a result, the FS could be marked "read-only" for long period of time. This marking would be in memory. The clean bit would be set on the superblock. When a write occurs, the clean bit would be reset to "dirty", and committed to disk prior to the write operation being permitted to proceed (a stall barrier). I like this idea because, for the most part, it eliminates fsck, both BG and FG, on systems that crash while it's in effect. The net result is a system that is statistically much more tolerant of failures, but which still requires another safety net, such as the previous solution. o Disk manufacturers could fix the ATA write caching problem. I think this will happen eventually, so the first "solution" is out. o PC manufacturers could provide OS-usable NVRAM scratch areas, which would permit an OS to allocate a section, and use it. The OS would then write the FreeBSD marker into an area to allocate it, and then write "power fail" as the failure code into the allocated area. When a panic or hardware failure occurred, it could write "panic" or "hardware fail" as the failure code. When the system came back up, it would be able to distinguish which type of failure by reading the NVRAM area. If it was something like "panic with sync", it could run the BG fsck, otherwise it would run the FG fsck. I really like this idea, too. I believe that more modern systems have this capability, but it has not yet been standardized. Therefore we should take a "wait and see" attitude towards it. o Disk manufacturers could provide a Lithium battery on board disks. This would not only bound their "planned obsolesence" curve to 5 years or so (lifetime of the battery), it would give them an aftermarket. The battery would trickle-charge from the disk drive power, and would be used to commit the write cache in event of power failure. I like this too; it makes disk drives obsolete at about 2X the distance that they become obsolete, it gives the drive manufacturers a bone for playing along, and it actually solves the problem at it's source. People might not like "your disk lasts 5 years" vs. "your warranty is one year", but smoothing the market demand function is probably worth more, in terms of lower cost to consumers and assured profit to disk manufacturers, and it can be billed as a marketing checkbox item, to force all the other disk manufacturers into implementing it, too, so there should be no downside. o We can change our file system structure to "journalled"; I like this as well, but there are some issues with manufacturers who do not provide track bondary information, so you can assure yourselves that a track boundary doesn't span a corruption boundary, in the event of a power failure. If you can do this, journalling actually becomes incredibly fast, since you know the disk writes backwards on a given track, so you can just implemente the "completed write" datestamp, and perform a single write, instead of two writes, in order to get a track on the disk. There are other approaches that I'm not prepared to share in a forum where they might be made public, but you get the idea. Several of the above are implementable now, particularly the counter and the soft read-only, with a day or less of effort. -- Terry