From owner-freebsd-arch  Fri Nov  3 10:49:45 2000
Delivered-To: freebsd-arch@freebsd.org
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
	by hub.freebsd.org (Postfix) with ESMTP id 4C73237B4CF
	for <arch@FreeBSD.ORG>; Fri,  3 Nov 2000 10:49:41 -0800 (PST)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.9.3/8.9.3) id LAA08446;
	Fri, 3 Nov 2000 11:45:39 -0700 (MST)
Received: from usr07.primenet.com(206.165.6.207)
 via SMTP by smtp02.primenet.com, id smtpdAAAvfaOCq; Fri Nov  3 11:45:31 2000
Received: (from tlambert@localhost)
	by usr07.primenet.com (8.8.5/8.8.5) id LAA20781;
	Fri, 3 Nov 2000 11:49:15 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200011031849.LAA20781@usr07.primenet.com>
Subject: Re: Like to commit my diskprep
To: dhh@androcles.com (Duane H. Hesser)
Date: Fri, 3 Nov 2000 18:49:14 +0000 (GMT)
Cc: des@ofug.org (Dag-Erling Smorgrav), arch@FreeBSD.ORG,
	rjesup@wgate.com (Randell Jesup),
	mbendiks@eunet.no (Marius Bendiksen),
	dillon@earth.backplane.com (Matt Dillon),
	Cy.Schubert@uumail.gov.bc.ca (Cy Schubert - ITSD Open Systems Group)
In-Reply-To: <XFMail.001103085038.dhh@androcles.com> from "Duane H. Hesser" at Nov 03, 2000 08:50:38 AM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> You are too optimistic, when you say "nearly ten years".  McKusick,
> et al's paper "A Fast Filesystem for Unix", which describes the
> design of the 4.2BSD FFS, and some of the testing upon which it
> was based, is marked as "Revised July 27, 1983" in my copy of the
> 4.2BSD manuals printed by Usenix for 4.2BSD.  The copy in
> /usr/share/doc/smm/05.fastfs/ is "Revised February 18, 1984".

[ ... ]

> Perhaps it *is* time to rethink defaults.  Proabably should be done
> at least once every millenium.

The defaults were rethought once.  The fictional geometry that
FreeBSD uses today ignore sthe track-to-track seek times.  This
was changed in the mid 1990's to account for disks that lied
about their geometry.

Using the fictional geometry, all of the optimizations related
to seek reduction, one of the primary foci of the FFS paper,
are disabled.

The block/cluster issue is one of fragmentation, not really of
optimization.  The ability to do clusters effectively prevents
fragmentation, taking it down to 50% of a frag size, on average;
for a 4k block size FS, this is 512b, and for an 8k, it's 1k,
yielding unused frag averages of 256b and 512b, respectively.

The clustering code is mean to ensure relative locality of much
data within a single cylinder, while not penalizing the multiple
process locality case with too much seeking or rotational
latency.

Some of the assumptions there have changed, such as inverted
track recording order, to ensure sequential reads are in the
cache (basically, prefetched by starting to read wherever you
seek to, and returning data once the sector you had asked for
has been read), and the number of sectors in a track.

One potential performance benefit for large files would be to
increase the number of sectors in a cylinder group.  This is
not necessarily as big a win as you might think, since most DB
access is random.  The only thing that would change this is if
the average data object was larger than one cluster in size;
even then, the actual optimial cluster size would really depend;
for fixed size records, it would be "exactly one record".  If
the records weren't stored on at least 512b boundaries, this
would turn into a loss, since, given random I/O (poor locality),
you will still span a cluster at the start and end an average
probability of:

	oddness = (cluster_size%rec_size) ? 1 : 0
	r_per_c = (cluster_size/record_size)
	P = r_per_c : oddness + .5

... the probability of a record spanning any given boundary.

Generally, a DB that valued speed would frag storage by never
spanning a physical media blocksize boundary (though for small
records, it would probably put more than one per block, if it
had a reasonable confidence that it wouldn't have to move them
during a record expansion later).

If you guys want to experiment with log and block structured
FSs, by all means, do so, but I don't think that you'll end up
optimizing things, unless your average object size is larger
than 1/2 of a cluster in size, which to my mind, is a very
large object indeed (18k/36k for 4k/8k block size @ 9 blocks
per cluster).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message