Date: Fri, 27 Sep 2002 11:31:08 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Alexander Leidinger <Alexander@Leidinger.net> Cc: freebsd-current@FreeBSD.ORG Subject: Re: Journaled filesystem in CURRENT Message-ID: <3D94A3EC.160D5D71@mindspring.com> References: <200209251319.g8PDJYoD047918@ib.com.ua> <20020925111232.B3686@Odin.AC.HMC.Edu> <20020926111949.5c0da160.Alexander@Leidinger.net> <20020926090325.A24614@zardoc.esmtp.org> <3D93459B.E4405568@mindspring.com> <20020926210947.5d5fdd45.Alexander@Leidinger.net> <3D9362C1.CFA66F90@mindspring.com> <20020927114503.7c839b9b.Alexander@Leidinger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Alexander Leidinger wrote: > > > Sorry, I don't get it. Can you please be more verbose? > > > > This has been discussed to death before, and Kirk McKusick has > > already posted the definitive post on the topic to FreeBSD-FS. > = > Keywords (besides SO and Kirk McKusick)/timeframe/message ID/URL? "McKusick AND fsck" finds the following in the FreeBSD-arch archives: <http://www.FreeBSD.org/cgi/getmsg.cgi?fetch=3D278083+282035+/usr/local/w= ww/db/text/2001/freebsd-arch/20010401.freebsd-arch> > > Note that recent disk drives (I *will not* call them "modern") > > will potentially trash sectors, if a power failure occurs during > > writes. > = > They don't have a power reservoir large enough to write the entire > content of their cache to disk? Damn. But I shouldn't wonder, the actua= l > economy is the result of letting marketing people make decissions. No, they do not. At one time Qunatum manufactured a 7200 RPM drive that could do a write/seek/write, using the rotational energy of th disk, but they quit manufactuing these. The main reason was that "multimedia" disk drives valued the ability to store quickly over the ability to store correctly (e.g. no thermal recalibration, etc.). > > One way to handle Scott Dodson's problem (for example) is to add > > a "softcheck started" flag in the superblock, so that if a crash > > occurs durin the abbreviated check, then the full check is done > = > I asked Kirk a while ago what happens if we have a power failure while > we do a bg-fsck. He told me that this isn't harmful, the actual code > DTRT. I'm aware of this posting as well. The issue here is is the answer to the question "What happens when I fail in such a way as to need a full fsck?". If your automated default is to do a backgroundfsck, your kernel will potentially panic as a result of you running on the FS with only a background fsck in progress (the panic which could occur would occur because of normal FS operations in progress, not as a result of the background fsck operation itself). After the panic, if you are in a "background fsck mode", you come up and you panic again, for the same reason, because the underlying condition of the FS is not related to relatively harmless overallocations in the cylinder group bitmap, which is the base assumption the background fsck makes. Thus, to "correct" Scott's problem, you need to mark the start and end of a background fsck cycle, such that if there is a failure in the middle of a background fsck, when the system is rebooted, the failure is dealt with via a full non-background fsck, if the disk is in the state "background fsck started but not yet completed". To deal with intermittant power outages as the source of failure, you could, as a tunable parameter, set a count of the number of times a fatal failure must occur before a background fsck is no longer an option (e.g. 3 failures duing a background fsck, and you go to a foreground fsck). I have quoted "correct" here, because it's really a workaround, not a fix, for the underlying problem. > > The write occurs, or it does not. The journal entry timestamp > > gets updated after the write completes, or it does not. > > > > Thus, you can always recover a JFS to a consistent state almost > > instantaneously, simply by finding the most recent valid journal > > entry timestamp, and ignoring anything else -- as long as data is > > journalled, and not just metadata. > = > I'm with Matthias Sch=FCndeh=FCtte here. SO writes the data and then it= > writes the metadata. So either the just written blocks get referenced b= y > metadata or it does not. So we can recover to a consistent state almost= > instantaneously too. Recovering to a consistent state is uninteresting. Let me explain: you do not want to recover to a consistent state, per se, you want to recover to *the* consistent state that the FS *would have been in*, has the failure not occurred, and the operations which can be rolled forward *had been successful*, and the operations which can not be rolled forward *had not been attempted*. This is the same argument against recovery of an async mounted FS, following a crash: the number of outstanding operations minux one is the number of potential operations that left the disk in its current end state. Thus, if operations are not ordered, the number of potential start states grows exponentially. For example, say I had N related operations in progress; therefore, the number of consistent states that could have led to the state if the disk at the time the recovery is attempted is (2^(N-1)). For ordered operations, N is always 1, and the result is always (2^0), or (1) -- therefore it is always possible to recover to *the* consistent state, rather than *a* consistent state. Only recovering to *a* consistent state loses implied metadata (e.g. related updates to record and index files in a relational database, etc.). This is unacceptable. > The only problem is: when you delete some files, > and the metadata (directory entries) is written, but the free blocks > information isn't updated yet. Then you have to use (bg-)fsck to correc= t > the free block information. But if you need to go online as fast as > possible with a consistent FS SO doesn't holds you back from this. It's more about having a *correct* FS, than a *consistent* FS. An *empty* FS is consistent, after all, so the fastest recovery possible would be to newfs the thing, right? 8-) 8-). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D94A3EC.160D5D71>