Date: Wed, 27 Oct 2004 03:56:21 -0700 (PDT) From: Don Lewis <truckman@FreeBSD.org> To: freebsd-fs@merdin.com Cc: freebsd-fs@FreeBSD.org Subject: Re: Re[4]: panic again Message-ID: <200410271056.i9RAuLcT020382@gw.catspoiler.org> In-Reply-To: <766160464.20041027132419@merdin.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 27 Oct, Pavel Merdine wrote: > Hello , > > Wednesday, October 27, 2004, 12:25:33 AM, you wrote: > >> On 26 Oct, Pavel Merdine wrote: > >>> Again, somehow after a panic on ONE file system, other filesystems are >>> not fully synced. The system conplaints that they are dirty after >>> restart. So it seems like one panic lead to corruption of another >>> systems. Maybe I'm wrong here too. But I dont see any good in fsck-ing >>> each time. > >> When the OS detects these types of problems, then something (we don't >> know what) unexpected has happened, so we can no longer trust the state >> of the machine. If we can't trust the state of the machine, then it is >> dangerous to sync any of the file systems, because doing so could damage >> them with corrupt data. > > I'm right then. Number of panic()s should be minimum. Because > currently one error in one partition leads to corruption of other > immediately (providing they do writes often). I think that is not > acceptable. I just didn't make fsck, don't shoot me! The panics only happen when a problem is detected that should never happen. In normal operation, certain operations on a file system may place it temporarily in an inconsistent state, but the data on the disk is changed in a particular order so that if the system crashes in the middle of an operation due to a power failure or system panic, the inconsistencies have certain, known properties such that these inconsistencies can be anticipated and repaired by fsck and the file system can be safely accessed even before the inconsistencies are repaired. It is possible for a file system to sustain types of damage that are not anticipated in case of a power failure. If the disk does write caching, data is likely to be written to the platters in a different order than the file system code expects, so a power failure during a sequence of writes may result in a partial set of writes that put the file system in a corrupt state that it is not possible to automatically repair. It is also possible for the disk to corrupt data other than what is being written. The other file systems will be marked as dirty, but they should not be corrupt. If softupdates is in use, the only inconsistency should be that some blocks and/or inodes make be marked as allocated when they are actually not in use. In this case, the background fsck is able to detect the inconsistency and mark these blocks and/or inodes as being free so that they can be reused. >>> Background fsck does not work in reality as well, because the system >>> can panic thousand times before errors are fixed. > >> It might be a good idea to force a foreground fsck if the system panics >> before a background fsck has marked a dirty filesystem clean. > > What I mean is there is no point having background fsck which can lead > to corruption of all system partitions. Explanation: there is not > guarantee that panic will not occur before fsck is done; that panic > leads to reboot without other filesystems sync, so it'll lead the > their corruption. If all file systems except one were initially in a valid and consistent state and one file system had some sort of damage that caused a system panic, they would all be marked as dirty when the system crashed and rebooted. The only file system that could cause another panic would be the one that was originally corrupt. The only possible inconsistencies in all the other file systems would be those that can be repaired by a background fsck, and accessing these file systems before they have been marked as clean by the background fsck should not result in a panic. There have been bugs that caused system panics when a file system that is undergoing a background fsck has a lot of write activity before the fsck operation finishes. These types of bugs should be tracked down and fixed, though this can be difficult. A system panic in this case makes it *easier* to find the bug. The sooner the system detects a problem and panics, the closer the panic and the debug information that it produces is to the actual software bug. If the file system code just ignored the inconsistencies and tried to keep running, it is quite possible that the file system would be totally trashed and all of its data lost.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200410271056.i9RAuLcT020382>
