Date: Tue, 25 Mar 2003 03:54:58 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: Alexander Langer <alex@big.endian.de> Cc: current@FreeBSD.org Subject: Re: several background fsck panics Message-ID: <3E804392.40844D63@mindspring.com> References: <20030324215712.GA844@fump.kawo2.rwth-aachen.de> <3E7FE3CE.ECD2775F@mindspring.com> <20030325110843.GF1700@fump.kawo2.rwth-aachen.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Alexander Langer wrote: > Thus spake Terry Lambert (tlambert2@mindspring.com): > > Disable write caching on your ATA drive. You should be able to > > "safely" reset after that. > > Good idea, thanks. Nevertheless: I don't think the system should > panic on background fsck's, while a manual fsck works. A manual fsck can deal with corrupt data. A background fsck can only deal with invalid cylinder group bitmaps, and operates on a snapshot. For a background fsck to be feasible, the FS has to be in a self-consistent state already, which it wasn't. When you killed the power on your system and reset it, you lost the cached data sitting in the ATA disk. This is due to the fact that the ATA disk lied, and claimed that it had committed some writes to stable storage, when in fact it had only copied them to the disk cache. As a result, when the device reset happened, you lost some writes which were in progress. Therefore you disk image was corrupt, and so your FS was *not* in a self-consistent state. This type of error happens on ATA disks because they do not permit disconnects during writes, only during reads. If you want to be able to reset your machine out from under your disk, with caching turned on, buy SCSI hardware, instead of ATA hardware: it does not lie to the host system, and claim tagged writes have been committed to stable storage when they have not, and are only in (volatile) cache RAM. The panic was not, in fact, a result of the background fsck itself: it was a result of an attempt to access FS structures by the kernel through the FS, assuming -- incorrectly -- that the FS structures were in a self-consistent state. This assumption was bogus, but there was no way for the kernel to know this because the failure state was not recovered, and that happened because PC hardware is bogus. This happened because you had background fsck enabled, and it was unable to tell the difference between a power failure vs. a panic vs. some other cause for a system crash (hardware or other failure). This is because the PC hardware itself doesn't record these types of events in NVRAM (e.g. CMOS), nor does it have sufficient DC holdup time that it could write a failure code to NVRAM, before losing its marbles. Hope this explains why you had the problem, and why real servers tend to specify SCSI hardware, and tend not to be PC-class hardware (i.e. an RS/6000 would have known the failure cause when it came back up from reading it's NVRAM, and performed a full recovery appropriate to the failure). PS: Unfortunately, this will not change on PC's any time soon, because people have been trained by computer vendors, disk vendors, and OS vendors that it's OK for PC's to "need" rebooting, and/or to crash unexpectedly in catastrophic ways that require reinstalling the OS. So people tolerate hardware that has ambiguous failure modes, as long as it costs less. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E804392.40844D63>