Date: Fri, 30 Mar 2001 10:47:46 -0800 From: Kirk McKusick <mckusick@mckusick.com> To: Terry Lambert <tlambert@primenet.com> Cc: arch@freebsd.org Subject: Re: Background Fsck Message-ID: <200103301847.KAA10189@beastie.mckusick.com> In-Reply-To: Your message of "Fri, 30 Mar 2001 12:52:29 GMT." <200103301252.FAA06540@usr05.primenet.com>
next in thread | previous in thread | raw e-mail | index | archive | help
From: Terry Lambert <tlambert@primenet.com> Subject: Re: Background Fsck To: mckusick@mckusick.com (Kirk McKusick) Date: Fri, 30 Mar 2001 12:52:29 +0000 (GMT) Cc: arch@FreeBSD.ORG I have a question avout the safety of this approach: You don't seem to be able to distinguish between: 1) Hardware crash without data coruption - e.g. power failure 2) Hardware crash with data corruption - e.g. disk/controller/memory failure 3) Software crash without data corruption - e.q. resource availability failure, or panic as a result of coding error 4) Software crash with data corruption - e.g. a panic resulting from kernel data becoming corrupt, with an unknown interval preceeding the crash in which some of these structures might have had FS data in them, or a such crash in the FS code path itself, where the data corruption was a primary effect instead of a side effect It seems to me tha background checking is only safe in cases 1 and 3, and (the current California power grid reliability not withstanding), that these cases are not provably the statistically most common cases. The reason Whistle did not do this work earlier was that we were unable to address this concern adequately without non-volatile RAM to store the failure reason and the disk write cache status. Since panic reasons are mathematically indistinguishable in the limit, were were also unable to address differentiating 3 and 4, without placing the FS and I/O subsystem into a seperate protection domain. Even doing this, we would only gain some statistical protection against #4, which means the only value which we could add was to case #1, were we to invest in the additional hardware. In other words, it was not speed of fsck which drove Whistle to soft updates. My question is this: how were you able to address these issues in your implementation? Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. In general, your observations are correct. In the current framework it is not possible to guarrantee that you can always sort out which of the four cases above you are in and to then take the correct action. Whistle needed to make those sorts of guarantees, and consequently could not fall back to something like background fsck. I do not purport to make this sort of guarantee. I say only that I will do the right thing in cases #1 and #3 and that I will do my best to detect that I am in cases #2 and #4 and exit gracefully after logging a message saying that an unexpected inconsistency has arisen and that manual intervention is needed. For systems where this is not good enough, the system administrator has the option of forcing foreground checks or not using soft updates at all. Kirk To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200103301847.KAA10189>