From owner-freebsd-current  Mon Mar 17 19:43: 9 2003
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id D904737B401; Mon, 17 Mar 2003 19:43:03 -0800 (PST)
Received: from bluejay.mail.pas.earthlink.net (bluejay.mail.pas.earthlink.net [207.217.120.218])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 08E5B43F3F; Mon, 17 Mar 2003 19:43:03 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0217.cvx22-bradley.dialup.earthlink.net ([209.179.198.217] helo=mindspring.com)
	by bluejay.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)
	id 18v806-0000xC-00; Mon, 17 Mar 2003 19:42:55 -0800
Message-ID: <3E76956E.5E0FCC6B@mindspring.com>
Date: Mon, 17 Mar 2003 19:41:34 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Bakul Shah <bakul@bitblocks.com>
Cc: Julian Elischer <julian@elischer.org>,
	FreeBSD current users <current@FreeBSD.ORG>, fs@FreeBSD.ORG
Subject: Re: Anyone working on fsck?
References: <200303172126.QAA23205@thunderer.cnchost.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a41dbd95575530f85a57bab592527d699e350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG

Bakul Shah wrote:
> UFS is the real problem here, not fsck.  Its tradeoffs for
> improving normal access latencies may have been right in the
> past but not for modern big disks.  The seek time & RPM have
> not improved very much in the past 20 years while disk
> capacity has increased by a factor of about 20,000 (and GB/$
> even more).  IMHO there is not much you can do at the fsck
> level -- you stil have to visit all the cyl groups and what
> not.  Even a factor of 10 improvement in fsck means 36
> minutes which is far too long.  Keeping track of 67M to 132M
> blocks and (assuming avg file size of 8k to 16k) something
> like 60M to 80M files takes quite a bit of time when you are
> also seeking all over the disk.

Sorry, but the track-to-track seek latency optimizations you
are referring to are turned off, given the newfs defaults, and
have been for a very long time.  Please see revision 1.7 of
/usr/src/sbin/newfs/newfs.c:

  revision 1.7
  date: 1995/02/05 08:42:31;  author: phk;  state: Exp;  lines: +13 -2

Basically, the problem is that for a BG fsck, it's not possible
to lock access on a cylinder group basis, and then there is a hell
of a lot of data on that drive.

What Julian needs is a Journalling or Log-structured FS.  Even
that will not save him in some failure cases, unless the hardware
has a CMOS data area that can be written with the failure cause,
even if it's a double panic (must distinguish power failure from
kernel panic, as kernel panics result from corrupt kernel data,
which means you must check all files).


> When you have about 67M (2^26) files, ideally you want to
> *avoid* checking as many as you can.  Given access times, you
> are only going to be able to do a few hundred disk accesses
> at most in a minute.  So you are going to have only a few
> files/dirs that may be inconsistent in case of a crash.  Why
> not keep track of that somehow?

The BG fsck code specifically *only* checks the CG bitmap
allocations.

Basically, this means that anything that might be using the CG
has to be checked to see if it references blocks in the bitmaps.

The easiest way to do this is to provide some sort of domain
control, so that you limit the files you have to check to a
smaller set of files per CG, so that you don't have to check
all the files to check a particular CG -- and then preload the
CG's for the sets of files you *do* have to check.

Another way of saying this is... don't put all your space in a
single FS.  8-).

The problem with having to (effectively) read every inode and
direct block on the disk is really insurmountable, I think.

There are some FS design changes that could be made to "fix"
the issue, by partitioning CG's by locality, and then providing
locality forward references as a seperate bitmap, and forcing
files to stick to their locality until they can't, (basically,
the AIX GFS PP and LP voume management trick), but that would
require a layout change and an additional dependency insertion.

Too bad the soft updates implementation isn't a generic graph
dependency mechanism, with node/node dependency resolvers that
get registered for edges, or you could do this easily (as well as
implying edges between stacking layers, and exporting a transaction
interface to user space).


> Typically only a few cyl grps may be inconsistent in case of
> a crash.  May be some info about which cyl groups need to be
> checked can be stored so that brute force checking of all
> grps can be avoided.

This would work well... under laboratory test conditions.  In
the field, if the machine has a more general workload (or even
a heterogeneous workload, but with a hell of a lot of files,
like a big mail server), this falls down, as the number of bits
marking "unupdated cylinder groups" becomes large.

...AND... The problem is still that you must scan everything on
the disk (practically) to identify the inode or indirect block
that references the block on the cylinder roup in question, and
THAT's the problem.  If you knew a small set of CG's, that needed
checked, vs. "all of them", it would *still* bust the cache, which
is what takes all the time.

Assume, on average, each file goes into indirect blocks.  That's
basically an N^3 algorithm to find a given block for a given CG;
and the problem is the cache busting.  Reducing this to N^2*M,
where M < N, or even M << N, will not significantly speed up the
process; e.g.:

	,-------.*****
	|+++++++|*****			* Where you miss
	|+++,--------.			+ Where you have to repeat
	`---|        |
	****|        |
	****|        |
	****`--------'

...So even if you pick your repetition area right (you could have
accidently picked the big one instead of the small one, if the files
were very large, or there was a very large number of small ones),
you don't save a lot.


> Typically a file will be stored in one or a small number of
> cyl groups.  If that info. is stored somewhere it can speed
> things up.

The problem is reverse mapping a bit in a CG bitmap to the file
that reference it... 8^p.


> Extant based allocation will reduce the number of indirect
> blocks.  But may be this is not such a big issue if most of
> your files fit in a few blocks.

Extents would save considerably, since you could target specific
cylinger group bitmaps, cached in RAM, by section, for processing,
and the amount of references you had to make to know if a given
inode had an extent covering a region controlled by a given bitmap
would be reduced.  However, this would only be a win, if you knew,
a priori, that you were storing mostly very big files.  Even so,
you are still scanning every inode on the system to get the extent
list in order to get the CG intersection to verify a given bit.


> Anyway, support for all of these have to be done in the
> filesystem first before fsck can benefit.
> 
> If instead you spend time "optimizing" just fsck, you will
> likely make it far more complex (and potentially harder to
> get right).

Multithreading fsck would be an incredibly bad idea.  Being
able to control the locality of the search process is your
primary means of controlling the amount of cache busting that
can happen (see the "squares" diagram, above).  The win you
get, if any, has to come from preloading CG bitmaps into RAM,
effectively, so that the process doesn't end up swapping.

Multithreading this destroys your locality of reference.  The
10 second delay in starting fack's on different FS's in fsck
right now is probably well past time to be removed.  The "-l"
option would have been useful here, but it's been removed.  If
the disks are on the same spindle, you still destroy locality
by seeking between regions while it's going.  BTW: someone
needs to update the man page, particularly the "SYNOPSIS".

If this is a BG fsck on an FS, you might want to consider
something like an "MADV_WIRE" argument to madvise(2), for
the CG bitmaps in core, so that the pages don't end up being
forced in and out of swap through normal cache effects.  For
this to work well, you would want to make it a root-only
option, and to restrict its size to a fraction of physical
memory.

Personally, I recommend using a different FS, if you are going
to create a honing big FS as a single partition.  8-(.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message