From owner-freebsd-fs@FreeBSD.ORG Wed Oct 27 11:48:36 2004 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5BF1716A4CE for ; Wed, 27 Oct 2004 11:48:36 +0000 (GMT) Received: from lancia.kaluga.ru (lancia.kaluga.ru [62.148.128.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6828C43D41 for ; Wed, 27 Oct 2004 11:48:35 +0000 (GMT) (envelope-from freebsd-fs@merdin.com) Received: from localhost (net.stencil.kaluga.ru [62.148.158.62]) by lancia.kaluga.ru (8.12.10/8.12.10) with ESMTP id i9RBmVnn052893 for ; Wed, 27 Oct 2004 15:48:32 +0400 (MSD) Received: from localhost ([127.0.0.1]) by [127.0.0.1] with ESMTP (SpamPal v1.581) sender ; 27 Oct 2004 15:48:32 +0400 Date: Wed, 27 Oct 2004 15:48:31 +0400 From: Pavel Merdine X-Priority: 3 (Normal) Message-ID: <999608774.20041027154831@merdin.com> To: Don Lewis In-Reply-To: <200410271056.i9RAuLcT020382@gw.catspoiler.org> References: <766160464.20041027132419@merdin.com> <200410271056.i9RAuLcT020382@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Subject: Re[6]: panic again X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Oct 2004 11:48:36 -0000 Hello , Wednesday, October 27, 2004, 2:56:21 PM, you wrote: > On 27 Oct, Pavel Merdine wrote: >> Hello , >> >> Wednesday, October 27, 2004, 12:25:33 AM, you wrote: >> >>> On 26 Oct, Pavel Merdine wrote: >> >>>> Again, somehow after a panic on ONE file system, other filesystems are >>>> not fully synced. The system conplaints that they are dirty after >>>> restart. So it seems like one panic lead to corruption of another >>>> systems. Maybe I'm wrong here too. But I dont see any good in fsck-ing >>>> each time. >> >>> When the OS detects these types of problems, then something (we don't >>> know what) unexpected has happened, so we can no longer trust the state >>> of the machine. If we can't trust the state of the machine, then it is >>> dangerous to sync any of the file systems, because doing so could damage >>> them with corrupt data. >> >> I'm right then. Number of panic()s should be minimum. Because >> currently one error in one partition leads to corruption of other >> immediately (providing they do writes often). I think that is not >> acceptable. I just didn't make fsck, don't shoot me! > The panics only happen when a problem is detected that should never > happen. First of all, they happen. Believe me. I saw them more than ten times on non-faulty disks. > In normal operation, certain operations on a file system may place it > temporarily in an inconsistent state, but the data on the disk is > changed in a particular order so that if the system crashes in the > middle of an operation due to a power failure or system panic, the > inconsistencies have certain, known properties such that these > inconsistencies can be anticipated and repaired by fsck and the file > system can be safely accessed even before the inconsistencies are > repaired. That is in theory. I didn't write to the list if there was no problem at all. > It is possible for a file system to sustain types of damage that are not > anticipated in case of a power failure. If the disk does write caching, > data is likely to be written to the platters in a different order than > the file system code expects, so a power failure during a sequence of > writes may result in a partial set of writes that put the file system in > a corrupt state that it is not possible to automatically repair. It is > also possible for the disk to corrupt data other than what is being > written. Why the file system cannot be repaired on the fly? Is it the filesystem limitation? Why, say, NTFS can repair itself without a blue screen or a disk check? > The other file systems will be marked as dirty, but they should not be > corrupt. If softupdates is in use, the only inconsistency should be > that some blocks and/or inodes make be marked as allocated when they are > actually not in use. In this case, the background fsck is able to > detect the inconsistency and mark these blocks and/or inodes as being > free so that they can be reused. You should agree that there is no guarantee that there will be no panic before fsck finished. In my opinion what you say is theory based on probability theory. Am I right? I see what you mean. If write caching is disabled then softupdates should work fine. I'm wondering then why it is left enabled by default. But I still see problem with fsck. We can hardly afford 40+ minute of fsck each time. Background fsck is not a solution too (because 5.x is not stable and there is no guarantee of reliability). >>>> Background fsck does not work in reality as well, because the system >>>> can panic thousand times before errors are fixed. >> >>> It might be a good idea to force a foreground fsck if the system panics >>> before a background fsck has marked a dirty filesystem clean. >> >> What I mean is there is no point having background fsck which can lead >> to corruption of all system partitions. Explanation: there is not >> guarantee that panic will not occur before fsck is done; that panic >> leads to reboot without other filesystems sync, so it'll lead the >> their corruption. > If all file systems except one were initially in a valid and consistent > state and one file system had some sort of damage that caused a system > panic, they would all be marked as dirty when the system crashed and > rebooted. The only file system that could cause another panic would be > the one that was originally corrupt. The only possible inconsistencies > in all the other file systems would be those that can be repaired by a > background fsck, and accessing these file systems before they have been > marked as clean by the background fsck should not result in a panic. "Should not" is the key phrase. > There have been bugs that caused system panics when a file system that > is undergoing a background fsck has a lot of write activity before the > fsck operation finishes. These types of bugs should be tracked down and > fixed, though this can be difficult. A system panic in this case makes > it *easier* to find the bug. The sooner the system detects a problem > and panics, the closer the panic and the debug information that it > produces is to the actual software bug. If the file system code just > ignored the inconsistencies and tried to keep running, it is quite > possible that the file system would be totally trashed and all of its > data lost. I know what panics are. I just say that a panic should be avoided when possible. Any panic on busy server leads to some loss. Even if a filesystem is checked after reboot, some files can be lost. I'm sure that it's possible to make some actions like fsck does when an error is found. BTW, newfs -g 375000 -h 8000 and mkdir dir1 after that on newly created partition can cause panic as well :) -- / Pavel Merdine