From owner-freebsd-fs@FreeBSD.ORG Wed Oct 11 16:59:20 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6B0C316A47C for ; Wed, 11 Oct 2006 16:59:20 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id CE80A43D99 for ; Wed, 11 Oct 2006 16:55:40 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [10.10.3.185] ([165.236.175.187]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id k9BGtRo7063167; Wed, 11 Oct 2006 10:55:33 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <452D21F6.20601@samsco.org> Date: Wed, 11 Oct 2006 10:55:18 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20060206 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Mike Tancsa References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org> <452AC4EB.8000006@samsco.org> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=3.8 tests=none autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Oct 2006 16:59:20 -0000 Mike Tancsa wrote: > On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you > wrote: > > >>Mike Tancsa wrote: >> >>>On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you >>>wrote: >>> >>> >>> >>>>this is only a crude hack. I get around this right now by not using a >>>>disklabel or fdisk table on arrays where I value speed. For those, I >>>>just put a filesystem directly on the array, and boot off of a small >>>>system disk. >>> >>> >>> >>> How is that done ? just newfs -O2 -U /dev/da0 ? >> >>Yup. > > > Hi, > Is this going to work in most/all cases ? In other words, how > to I make sure the file system I lay down is indeed properly / > optimally aligned with the underlying structure ? > > ---Mike UFS1 skips the first 8k of its space to allow for bootstrapping/partitioning data. UFS2 skips the first 64k. Blocks are then aligned to that skip. 64K is a good alignment for most RAID cases. But understanding exactly how RAID-5 works will help you make appropriate choices. (Note that in the follow write-up I'm actually describing RAID-4. The only difference between RAID-4 and 5 is that the parity data is spread out to all of the disks instead of being kept all on a single disk. However, this is just a performance detail, and it's easier to describe how things work if you ignore it) As you might know, RAID-4/5 takes N disks and writes data to N-1 of them while computing and writing a parity calculation to the Nth disk. That parity calculation is a logical XOR of the data disks. One of the neat properties of XOR is that it's a reversible algorithm; you can take the final answer and re-run the XOR using all but one of the opriginal comoponents and get an answer that represents the data of the missing component. The array is divided into 'stripes', each stripe containing a equal subsection of each data disk plus the parity disk. When we talk about 'stripe size', what we are refering to is the size of one of those subsections. A 64K stripe size means that each disk is divided into 64K subsections. The total amount of data in a stripe is then a function of the stripe size and the number of disks in the array. If you have 5 disks in your array and have set a stripe size of 64K, each stripe will hold a total of 256K of data (4 data disks and 1 parity disk). Every time you write to an RAID-5 array, parity needs to be updated. As everything operates in terms of the stripes, the most straight forward way to do this is to read all of the data from the stripe, replace the portion that is being written, recompute the parity, and then write out the updates. This is also the slowest way to do it. An easy optimization is to buffer the writes and look for situations where all of the data in a stripe is being written sequentially. If all of the data in the stripe is being replaced, there is no need to read any of the old data. Just collect all of the writes together, compute the parity, and write everything out all at once. Another optimization is to recognize when only one member of the stripe is being updated. For that, you read the parity, read the old data, and then XOR out the old data and XOR in the new data. You still have the latency of waiting for a read, but on a busy system you reduce head movement on all of the disks, which is a big win. Both of these optmizations rely on the writes having a certain amount of alignment. If your stripe size is 64k and your writes are 64k, but they all start at an 8k offset into the stripe, you loose. Each 64K write will have to touch 56k of one disk and 8k of the next disk. But, an 8k offset can be made to work if you reduce your stripe size to 8k. It then becomes an excercise in balancing the parameters of FS block size and array stripe size to give you the best peformance for your needs. The 64k offset in UFS2 gives you more room to work here, so that's why I say at the beginning that it's a good value. In any case, you want to choose parameters that result in each block write covering either a single disk or a whole stripe. Where things really go bad for BSD is when a _63_ sector offset gets introduced for the MBR. Now everything is offset to an odd, non-power-of-2 value, and there isn't anything that you can tweak in the filesystem or array to compensate. The best you can do is to manually calculate a compensating offset in the disklabel for each partition. But at the point, it often becomes easier to just ditch all of that and put the fielsystem directly on the disk. Scott