From owner-freebsd-current Mon Mar 17 19:43: 9 2003 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D904737B401; Mon, 17 Mar 2003 19:43:03 -0800 (PST) Received: from bluejay.mail.pas.earthlink.net (bluejay.mail.pas.earthlink.net [207.217.120.218]) by mx1.FreeBSD.org (Postfix) with ESMTP id 08E5B43F3F; Mon, 17 Mar 2003 19:43:03 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0217.cvx22-bradley.dialup.earthlink.net ([209.179.198.217] helo=mindspring.com) by bluejay.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18v806-0000xC-00; Mon, 17 Mar 2003 19:42:55 -0800 Message-ID: <3E76956E.5E0FCC6B@mindspring.com> Date: Mon, 17 Mar 2003 19:41:34 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Bakul Shah Cc: Julian Elischer , FreeBSD current users , fs@FreeBSD.ORG Subject: Re: Anyone working on fsck? References: <200303172126.QAA23205@thunderer.cnchost.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a41dbd95575530f85a57bab592527d699e350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Bakul Shah wrote: > UFS is the real problem here, not fsck. Its tradeoffs for > improving normal access latencies may have been right in the > past but not for modern big disks. The seek time & RPM have > not improved very much in the past 20 years while disk > capacity has increased by a factor of about 20,000 (and GB/$ > even more). IMHO there is not much you can do at the fsck > level -- you stil have to visit all the cyl groups and what > not. Even a factor of 10 improvement in fsck means 36 > minutes which is far too long. Keeping track of 67M to 132M > blocks and (assuming avg file size of 8k to 16k) something > like 60M to 80M files takes quite a bit of time when you are > also seeking all over the disk. Sorry, but the track-to-track seek latency optimizations you are referring to are turned off, given the newfs defaults, and have been for a very long time. Please see revision 1.7 of /usr/src/sbin/newfs/newfs.c: revision 1.7 date: 1995/02/05 08:42:31; author: phk; state: Exp; lines: +13 -2 Basically, the problem is that for a BG fsck, it's not possible to lock access on a cylinder group basis, and then there is a hell of a lot of data on that drive. What Julian needs is a Journalling or Log-structured FS. Even that will not save him in some failure cases, unless the hardware has a CMOS data area that can be written with the failure cause, even if it's a double panic (must distinguish power failure from kernel panic, as kernel panics result from corrupt kernel data, which means you must check all files). > When you have about 67M (2^26) files, ideally you want to > *avoid* checking as many as you can. Given access times, you > are only going to be able to do a few hundred disk accesses > at most in a minute. So you are going to have only a few > files/dirs that may be inconsistent in case of a crash. Why > not keep track of that somehow? The BG fsck code specifically *only* checks the CG bitmap allocations. Basically, this means that anything that might be using the CG has to be checked to see if it references blocks in the bitmaps. The easiest way to do this is to provide some sort of domain control, so that you limit the files you have to check to a smaller set of files per CG, so that you don't have to check all the files to check a particular CG -- and then preload the CG's for the sets of files you *do* have to check. Another way of saying this is... don't put all your space in a single FS. 8-). The problem with having to (effectively) read every inode and direct block on the disk is really insurmountable, I think. There are some FS design changes that could be made to "fix" the issue, by partitioning CG's by locality, and then providing locality forward references as a seperate bitmap, and forcing files to stick to their locality until they can't, (basically, the AIX GFS PP and LP voume management trick), but that would require a layout change and an additional dependency insertion. Too bad the soft updates implementation isn't a generic graph dependency mechanism, with node/node dependency resolvers that get registered for edges, or you could do this easily (as well as implying edges between stacking layers, and exporting a transaction interface to user space). > Typically only a few cyl grps may be inconsistent in case of > a crash. May be some info about which cyl groups need to be > checked can be stored so that brute force checking of all > grps can be avoided. This would work well... under laboratory test conditions. In the field, if the machine has a more general workload (or even a heterogeneous workload, but with a hell of a lot of files, like a big mail server), this falls down, as the number of bits marking "unupdated cylinder groups" becomes large. ...AND... The problem is still that you must scan everything on the disk (practically) to identify the inode or indirect block that references the block on the cylinder roup in question, and THAT's the problem. If you knew a small set of CG's, that needed checked, vs. "all of them", it would *still* bust the cache, which is what takes all the time. Assume, on average, each file goes into indirect blocks. That's basically an N^3 algorithm to find a given block for a given CG; and the problem is the cache busting. Reducing this to N^2*M, where M < N, or even M << N, will not significantly speed up the process; e.g.: ,-------.***** |+++++++|***** * Where you miss |+++,--------. + Where you have to repeat `---| | ****| | ****| | ****`--------' ...So even if you pick your repetition area right (you could have accidently picked the big one instead of the small one, if the files were very large, or there was a very large number of small ones), you don't save a lot. > Typically a file will be stored in one or a small number of > cyl groups. If that info. is stored somewhere it can speed > things up. The problem is reverse mapping a bit in a CG bitmap to the file that reference it... 8^p. > Extant based allocation will reduce the number of indirect > blocks. But may be this is not such a big issue if most of > your files fit in a few blocks. Extents would save considerably, since you could target specific cylinger group bitmaps, cached in RAM, by section, for processing, and the amount of references you had to make to know if a given inode had an extent covering a region controlled by a given bitmap would be reduced. However, this would only be a win, if you knew, a priori, that you were storing mostly very big files. Even so, you are still scanning every inode on the system to get the extent list in order to get the CG intersection to verify a given bit. > Anyway, support for all of these have to be done in the > filesystem first before fsck can benefit. > > If instead you spend time "optimizing" just fsck, you will > likely make it far more complex (and potentially harder to > get right). Multithreading fsck would be an incredibly bad idea. Being able to control the locality of the search process is your primary means of controlling the amount of cache busting that can happen (see the "squares" diagram, above). The win you get, if any, has to come from preloading CG bitmaps into RAM, effectively, so that the process doesn't end up swapping. Multithreading this destroys your locality of reference. The 10 second delay in starting fack's on different FS's in fsck right now is probably well past time to be removed. The "-l" option would have been useful here, but it's been removed. If the disks are on the same spindle, you still destroy locality by seeking between regions while it's going. BTW: someone needs to update the man page, particularly the "SYNOPSIS". If this is a BG fsck on an FS, you might want to consider something like an "MADV_WIRE" argument to madvise(2), for the CG bitmaps in core, so that the pages don't end up being forced in and out of swap through normal cache effects. For this to work well, you would want to make it a root-only option, and to restrict its size to a fraction of physical memory. Personally, I recommend using a different FS, if you are going to create a honing big FS as a single partition. 8-(. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message