From owner-freebsd-fs@FreeBSD.ORG  Wed Oct 11 16:59:20 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6B0C316A47C
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 16:59:20 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id CE80A43D99
	for <freebsd-fs@freebsd.org>; Wed, 11 Oct 2006 16:55:40 +0000 (GMT)
	(envelope-from scottl@samsco.org)
Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0)
	by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k9BGtRo7063167;
	Wed, 11 Oct 2006 10:55:33 -0600 (MDT)
	(envelope-from scottl@samsco.org)
Message-ID: <452D21F6.20601@samsco.org>
Date: Wed, 11 Oct 2006 10:55:18 -0600
From: Scott Long <scottl@samsco.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Mike Tancsa <mike@sentex.net>
References: <45297DA2.4000509@fluffles.net>
	<20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org>
	<ltgli2hqprmfqq3ttahrtbppj4l3dkt4b7@4ax.com>
	<452AC4EB.8000006@samsco.org>
	<dl4qi2t4ms9gp44crfa0marascvoo93grf@4ax.com>
In-Reply-To: <dl4qi2t4ms9gp44crfa0marascvoo93grf@4ax.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed 
	version=3.1.1
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org
Cc: freebsd-fs@freebsd.org
Subject: Re: 2 bonnies can stop disk activity permanently
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 11 Oct 2006 16:59:20 -0000

Mike Tancsa wrote:
> On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you
> wrote:
> 
> 
>>Mike Tancsa wrote:
>>
>>>On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
>>>wrote:
>>>
>>>
>>>
>>>>this is only a crude hack.  I get around this right now by not using a
>>>>disklabel or fdisk table on arrays where I value speed.  For those, I
>>>>just put a filesystem directly on the array, and boot off of a small
>>>>system disk.
>>>
>>>
>>>
>>>	How is that done ?  just newfs -O2 -U /dev/da0  ?
>>
>>Yup.
> 
> 
> Hi,
> 	Is this going to work in most/all cases ?  In other words, how
> to I make sure the file system I lay down is indeed properly /
> optimally aligned with the underlying structure ?
> 
> 	---Mike

UFS1 skips the first 8k of its space to allow for
bootstrapping/partitioning data.  UFS2 skips the first 64k.
Blocks are then aligned to that skip.  64K is a good alignment
for most RAID cases.  But understanding exactly how RAID-5 works
will help you make appropriate choices.

(Note that in the follow write-up I'm actually describing RAID-4.
The only difference between RAID-4 and 5 is that the parity data
is spread out to all of the disks instead of being kept all on a
single disk.  However, this is just a performance detail, and it's
easier to describe how things work if you ignore it)

As you might know, RAID-4/5 takes N disks and writes data to N-1 of
them while computing and writing a parity calculation to the Nth
disk.  That parity calculation is a logical XOR of the data disks.
One of the neat properties of XOR is that it's a reversible algorithm;
you can take the final answer and re-run the XOR using all but one of
the opriginal comoponents and get an answer that represents the data of
the missing component.

The array is divided into 'stripes', each stripe containing a equal
subsection of each data disk plus the parity disk.  When we talk about
'stripe size', what we are refering to is the size of one of those
subsections.  A 64K stripe size means that each disk is divided into
64K subsections.  The total amount of data in a stripe is then a
function of the stripe size and the number of disks in the array.  If
you have 5 disks in your array and have set a stripe size of 64K, each
stripe will hold a total of 256K of data (4 data disks and 1 parity
disk).

Every time you write to an RAID-5 array, parity needs to be updated.
As everything operates in terms of the stripes, the most straight
forward way to do this is to read all of the data from the stripe,
replace the portion that is being written, recompute the parity, and
then write out the updates.  This is also the slowest way to do it.

An easy optimization is to buffer the writes and look for situations
where all of the data in a stripe is being written sequentially.  If
all of the data in the stripe is being replaced, there is no need to
read any of the old data.  Just collect all of the writes together,
compute the parity, and write everything out all at once.

Another optimization is to recognize when only one member of the stripe
is being updated.  For that, you read the parity, read the old data, and
then XOR out the old data and XOR in the new data.  You still have the
latency of waiting for a read, but on a busy system you reduce head
movement on all of the disks, which is a big win.

Both of these optmizations rely on the writes having a certain amount
of alignment.  If your stripe size is 64k and your writes are 64k, but
they all start at an 8k offset into the stripe, you loose.  Each 64K
write will have to touch 56k of one disk and 8k of the next disk.  But,
an 8k offset can be made to work if you reduce your stripe size to 8k.
It then becomes an excercise in balancing the parameters of FS block
size and array stripe size to give you the best peformance for your
needs.  The 64k offset in UFS2 gives you more room to work here, so
that's why I say at the beginning that it's a good value.  In any case,
you want to choose parameters that result in each block write covering
either a single disk or a whole stripe.

Where things really go bad for BSD is when a _63_ sector offset gets
introduced for the MBR.  Now everything is offset to an odd,
non-power-of-2 value, and there isn't anything that you can tweak in the
filesystem or array to compensate.  The best you can do is to manually
calculate a compensating offset in the disklabel for each partition.
But at the point, it often becomes easier to just ditch all of that and
put the fielsystem directly on the disk.

Scott