From owner-freebsd-current@FreeBSD.ORG Sat Jun 12 19:57:28 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F259B16A4CE for ; Sat, 12 Jun 2004 19:57:27 +0000 (GMT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 928EA43D54 for ; Sat, 12 Jun 2004 19:57:27 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.11/8.12.11) with ESMTP id i5CJtfjv095561; Sat, 12 Jun 2004 15:55:41 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)i5CJteR5095558; Sat, 12 Jun 2004 15:55:40 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Sat, 12 Jun 2004 15:55:40 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Anthony Ginepro In-Reply-To: <20040612154224.GB895@renaissance.homeip.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: current@freebsd.org cc: Kris Kennaway Subject: Re: bg fsck and fs corruption X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Jun 2004 19:57:28 -0000 On Sat, 12 Jun 2004, Anthony Ginepro wrote: > > If you allow bgfsck to complete, does it eventually clean this up? > > I already had similar "corruptions" (as I never lost a file that way it > isn't as terrible as a really corrupted file). It's worth noting that the problem I'm describing actually isn't corruption as defined by soft updates: it's consistent subject to the consistency model of soft updates. The problem is that it conflicts with reasonable user expectation ("No, there really isn't anything in the directory, so don't tell me it's empty!"). > Complete bgfsck don't clean this up as it often chokes on this error > (can't reming the exact error report, something like "SOFTDEP > INCONSISTENCY"). This is a result of the assumptions of soft updates being violated. There are a few reasons this might happen: (1) Bug in UFS/soft updates, resulting in things being sent to disk without correct dependency ordering (or corruption or whatever). (2) Bug in the storage layer, be it GEOM, device driver, et al, which causes ordering requirements to be lost, or acknowledges a write request as complete when it's not. (3) Bug in the hardware, such as acknowledging a write request as complete where it's not. They're all serious issues, especially in the presence of system failure (i.e., power failure, panic, etc). Soft updates offers some nice efficiency gains and fairly reasonable guarantees, but a lot of cheap PC hardware completely fails to meet its requirements now (drives will lie and indicate a change was committed to disk when it's really just in cache, for example). That makes it a bit hard to track these down. The cases I find most interesting, though, are the ones where we know the system halted for a reason that doesn't give the disks and excuse not eventually have committed to disk. A panic in the network stack, for example, if failing stop, shouldn't result in corruption that can't be recovered from by bgfsck. And I've seen cases where that hasn't happened -- since there's no power off, and there's a long delay before reboot, it's unlikely either the disks/controllers are losing state, or that the state was flushed during the soft reboot. > However an fsck from single user always cleaned this. Still highly undesirable, though, as it means the expected consistency on the disk that soft updates relies on isn't present. UFS and UFS-like file systems (ext2fs, etc) aren't laid out in such a way that it's possible to be highly tolerant of disk corruption. The UFS implementation tolerates some sorts of flaws, but can panic (cycles in the name space) or experience additional corruption for other flaws (such as multiple files owning the same block). Some of those corruption modes are more likely than others in the presence of simple failures (power loss, etc). I had a conversation with Tom Van Vleck recently on the file system used in Multics, which was capable of detecting and tolerating a broad range of corruption, and there are some interesting ideas there I'd love to see in a modern UNIX... Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Senior Research Scientist, McAfee Research > > > Robert N M Watson FreeBSD Core Team, TrustedBSD Projects > > robert@fledge.watson.org Senior Research Scientist, McAfee Research > > > > > > > > > > twinsun# rm -rf old > > > rm: old/26422/usr/local/lib: Directory not empty > > > rm: old/26422/usr/local: Directory not empty > > > rm: old/26422/usr: Directory not empty > > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5/5.8.4/mach/auto/threads: Directory not empty > > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5/5.8.4/mach/auto: Directory not empty > > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5/5.8.4/mach: Directory not empty > > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5/5.8.4: Directory not empty > > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5: Directory not empty > > > rm: old/26422/var/tmp/instmp.laCtQf/lib: Directory not empty > > > rm: old/26422/var/tmp/instmp.laCtQf: Directory not empty > > > rm: old/26422/var/tmp: Directory not empty > > > rm: old/26422/var: Directory not empty > > > rm: old/26422: Directory not empty > > > rm: old: Directory not empty > > > twinsun# ls -l old/26422/usr/local/lib > > > total 0 > > > > > > bg fsck noticed the usual softdep problems, but did not report or fix > > > the corruption: > > > > > > [...] > > > Jun 12 07:38:47 twinsun fsck: /dev/da1c: INCORRECT BLOCK COUNT I=4381849 (4 should be 0) (CORRECTED) > > > Jun 12 07:38:47 twinsun fsck: /dev/da1c: INCORRECT BLOCK COUNT I=4381850 (4 should be 0) (CORRECTED) > > > Jun 12 07:38:47 twinsun fsck: /dev/da1c: INCORRECT BLOCK COUNT I=4381853 (4 should be 0) (CORRECTED) > > > Jun 12 07:38:47 twinsun fsck: > > > > > > Note the lack of summary line. I don't know if it was trying to log > > > the more serious corruption but didn't because of a bug, or if it just > > > didn't detect it. > > > > > > Kris > > > > > > > _______________________________________________ > > freebsd-current@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-current > > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" >