From owner-freebsd-fs Tue Mar 18 14:43:57 2003 Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 979D537B404; Tue, 18 Mar 2003 14:43:50 -0800 (PST) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9CEA143F3F; Tue, 18 Mar 2003 14:43:49 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0203.cvx22-bradley.dialup.earthlink.net ([209.179.198.203] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18vPnu-0001Io-00; Tue, 18 Mar 2003 14:43:31 -0800 Message-ID: <3E77A0C0.B1BEA2D7@mindspring.com> Date: Tue, 18 Mar 2003 14:42:08 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Brad Knowles Cc: Bakul Shah , Julian Elischer , FreeBSD current users , fs@FreeBSD.ORG Subject: Re: Anyone working on fsck? References: <200303180459.XAA15021@goliath.cnchost.com> <3E76C15F.D3C7F9A7@mindspring.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4c02ff085df222b2c13f20f3d35e2cfb1350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Brad Knowles wrote: > At 10:49 PM -0800 2003/03/17, Terry Lambert wrote: > > Yes, I know. I'm aware. He has a lot of data to transfer in > > from the disk in order to do the reverse lookup, with that much > > data on the disk. > > I'm confused. In situations where you need to do reverse > lookups, don't you normally tend to create doubly-linked lists, or > other structures that you can traverse quickly and efficiently? Yes and no. "Yes", if that usage is typical. "No", if it's do extraordinary that it's to the point of "infintessimally small". This is a case of optimizing the success code path at the expense of the failure code path, and is not only accepted practice, it's good practice. The specific issue here is that the CG bitmaps normally exist to give layout hints to the block allocator, which would normally take a huge amount of traversal in order to find free blocks on its own. They exist, in other words, to find free blocks fast. The problem that happens is when the data in a bitmap is stale; but it can only be stale in a particular way: when a block is freed back to the FS, the last operation which occurs, in the ordered update of the metadata, is to mark it free in the CG bitmap itself. So the failure case from which the recovery is proceeding is "a failure to mark a block which is otherwise unallocated as unallocated in the CG bitmap". What this boils down to is looking to see if your table *doesn't* contain an entry, after recovery from a catastropic failure, where the only thing that could be out of date is there being an entry in the table that shouldn't be there. Meanwhile, when performing allocations, you are looking for the 0 bits... the entries which are *not* there. Because of this, it's safe to go through and use the FS, and make new allocations, since the table can only have erroneously set bits... meaning you cannot allocate a block that's not allocated. You will *never* make the mistake of allocating a block that's *already* allocated. So the only thing you care about on recovery, which can, in theory, take as much time as necessary to do it's job, is clearing the bits that aren't valid. Meanwhile, new allocations are occurring all the time, since the FS in question is "live". So what you do is bound the problem set by establishing a "snapshot" of the FS, which, itself, adds considerable overhead, and then you allow any allocations you want to to continue in the live FS. This is because the bits you want to clear will not be validly set by the live FS: you can clear them any time, and it will be valid to clear them in both the snapshot and in the live FS, without risking the integrity of the live FS. Another way of bounding the problem would be to "read lock" the CG's whose bitmaps you are currently examining, and any new allocations which attempt to fall in these locked regions are redirected. There are a couple of downsides to that approach, too, but it can be made to work. In any case, whether you are operating on a live FS or a snapshot, or performing a foreground fsck on a quiescent/RO/unmounted FS, the problem is the same: you must select a CG, cache a copy of it in memory, create a "scratch" copy, and then examine all block allocations, everywhere, and where they indicate a block allocated in the cylinder group in hand, clear the bit in the scratch copy. The set of bits you are left with is the set of bits that are not actaully allocated. At this point, you lock the live FS version of the CG (if you are not operating on a read-locked CG *in* a live FS, of course), and AND the real CG with the NOT of the copy, and use that to replace the old bitmap... thus clearing the "phantom allocations". Then you go onto the next CG bitmap. Make sense now? > > He could change the code so that that everything that wasn't in > > use was zeroed. That would help, since then he could just create > > a second CG map in an empty file by traversing all inodes and > > indirect blocks for non-zero values, not caring if the indirect > > blocks were referenced by inodes. Assuming indirect blocks and > > data blocks were distinguishable. > > > > Like I said, there are a lot of FS changes that could help. > > I'm not sure I understand how this would be a filesystem change. > Since this second CG map would be used only during fsck and not > otherwise present on the system, couldn't you just throw it away when > you were done, resulting in a filesystem that was structurally > unchanged from before? Newfs would have to do a heck of a lot more writing than before. Freeing would have to write zeroed information. Data blocks that might be used as indirect blocks would have to be zeroed. Etc.. Basically, you would eat the penalty as a small additional overhead during normal operation, to reduce the cost by one order, in case of failure. For a really, really high failure cost at a given known probability of failure, this cost might amortize out as cheaper, in terms of overall operational cost. > > The *reason* it doesn't matter in the CG bitmap case, is that a > > bitmap can only tell "allocated" vs. "unallocated"; there is no > > third state that means "I haven't checked this bit yet". [ ... ] > > Seems to me that you could easily solve this with a second CG map > (as you proposed above), that starts of initially zeroed for all > blocks, and as you go through the "normal" CG maps, you turn on all > the corresponding bits in the corresponding second map as you touch > the blocks that are referenced. No; it's more tricky; see my example above. Your scratch block has to start with all the bits set that are already set in the CG bitmap, not with all zeros. The reason for this is complicated, but what it boils down to is that you need to be able to make additional allocations, and you need to know the difference between new additional allocations which occurred as a result of operations in files/areas you have already traversed in the currently occurring pass. Otherwise, you can't just read-lock the CG bitmap, you must write-lock it as well. Using the snapshot approach is one way of avoiding this issue. > This should be a single linear pass, and then you'd be done. If > you had one I/O process and multiple worker processes actually > scanning the CG maps and updating the second copy, this should just > absolutely *fly*. > > What am I missing here? The fact that for a 1TB FS, you don't have enough virtual address space to hold all the CG bitmaps and all the scratch memory and the mapped buffers of the FS data you are traversing in your head at once? 8-) 8-). Even if you did, you'd end up paging, and you are actually worse off doing that, then you are doing multiple linear passes, after breaking the CG's into blocks of CG's to check in linear runs. That's why I suggested an MADV_WIRE for mapping memory to contain the CG bitmaps you are interested in perusing per pass. The reason you are worse off comes down to the fact that, on a linear traversal of disk data which is so much larger than your available cache, you will *never* have a case where the block you are interested in looking at is already in cache. One of the "sneaky" ways of dealing with this is to use an "elevator" algorithm: for one CG set, scan forward through the FS; for the next, scan backward. This means you will get cache effects, since the data you examine will be the data you just examined, until you run out of LRU, and then you degrade to having to read all data in. An even more "sneaky" way would be to *know* your LRU list length, and start a *forward* scan from there. That would be 2X as effective, since you would get a 100% cache hit, at least for the length of the LRU, until you, once again, end up in the degenerate case. This is one of the reasons VMS had the concept of being able to know both your "working set quota" and your "current working set size". Another "sneaky" approach, which I've already suggested, is to reduce the size of the working set you care about, by increasing the locality between "random" FS data and the CG bitmap, so that any allocations you care about will be in a smaller portion of the disk. This is automatic, in the case of "make smaller partitions"; it requires a change in disk layout policy otherwise, and that change could damage overall performance, in the same way "growfs" does, by increasing regional fragmentation. Even so, there's no way to keep it from breaking down for very large files which spanned a (for lack of better terminology) "cylinder group group" boundary. Even a "sliding group window" would only save you another 50% of the time (assume average locality is pinned to the inode location). It's a pretty intractable problem, with existing FFS structure; you can get minor performance wins, by holding multiple CG's in hand to do the job (maybe all of them, if the FS is small enough, which would make it 1 pass), but for a very large FS, you are pretty much toast, unless you are willing to change FS layout *some*. I rather expect that UFS2 will end up changing layout, as people start using it for very large filesystems. Actually, thinking about it, you could bound it to a single non-CG bitmap metadata traversal in all recovery situations... if you were willing to change FS layout, in a probably compatible way. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message