Date: Wed, 11 Oct 2006 14:41:19 -0500 From: Eric Anderson <anderson@centtech.com> To: Scott Long <scottl@samsco.org> Cc: freebsd-fs@freebsd.org Subject: Re: 2 bonnies can stop disk activity permanently Message-ID: <452D48DF.5010502@centtech.com> In-Reply-To: <452D21F6.20601@samsco.org> References: <45297DA2.4000509@fluffles.net> <20061010051216.G814@epsplex.bde.org> <452AB55D.9090607@samsco.org> <ltgli2hqprmfqq3ttahrtbppj4l3dkt4b7@4ax.com> <452AC4EB.8000006@samsco.org> <dl4qi2t4ms9gp44crfa0marascvoo93grf@4ax.com> <452D21F6.20601@samsco.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 10/11/06 11:55, Scott Long wrote: > Mike Tancsa wrote: >> On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you >> wrote: >> >> >>> Mike Tancsa wrote: >>> >>>> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you >>>> wrote: >>>> >>>> >>>> >>>>> this is only a crude hack. I get around this right now by not using a >>>>> disklabel or fdisk table on arrays where I value speed. For those, I >>>>> just put a filesystem directly on the array, and boot off of a small >>>>> system disk. >>>> >>>> >>>> How is that done ? just newfs -O2 -U /dev/da0 ? >>> Yup. >> >> Hi, >> Is this going to work in most/all cases ? In other words, how >> to I make sure the file system I lay down is indeed properly / >> optimally aligned with the underlying structure ? >> >> ---Mike > > UFS1 skips the first 8k of its space to allow for > bootstrapping/partitioning data. UFS2 skips the first 64k. > Blocks are then aligned to that skip. 64K is a good alignment > for most RAID cases. But understanding exactly how RAID-5 works > will help you make appropriate choices. > > (Note that in the follow write-up I'm actually describing RAID-4. > The only difference between RAID-4 and 5 is that the parity data > is spread out to all of the disks instead of being kept all on a > single disk. However, this is just a performance detail, and it's > easier to describe how things work if you ignore it) > > As you might know, RAID-4/5 takes N disks and writes data to N-1 of > them while computing and writing a parity calculation to the Nth > disk. That parity calculation is a logical XOR of the data disks. > One of the neat properties of XOR is that it's a reversible algorithm; > you can take the final answer and re-run the XOR using all but one of > the opriginal comoponents and get an answer that represents the data of > the missing component. > > The array is divided into 'stripes', each stripe containing a equal > subsection of each data disk plus the parity disk. When we talk about > 'stripe size', what we are refering to is the size of one of those > subsections. A 64K stripe size means that each disk is divided into > 64K subsections. The total amount of data in a stripe is then a > function of the stripe size and the number of disks in the array. If > you have 5 disks in your array and have set a stripe size of 64K, each > stripe will hold a total of 256K of data (4 data disks and 1 parity > disk). > > Every time you write to an RAID-5 array, parity needs to be updated. > As everything operates in terms of the stripes, the most straight > forward way to do this is to read all of the data from the stripe, > replace the portion that is being written, recompute the parity, and > then write out the updates. This is also the slowest way to do it. > > An easy optimization is to buffer the writes and look for situations > where all of the data in a stripe is being written sequentially. If > all of the data in the stripe is being replaced, there is no need to > read any of the old data. Just collect all of the writes together, > compute the parity, and write everything out all at once. > > Another optimization is to recognize when only one member of the stripe > is being updated. For that, you read the parity, read the old data, and > then XOR out the old data and XOR in the new data. You still have the > latency of waiting for a read, but on a busy system you reduce head > movement on all of the disks, which is a big win. > > Both of these optmizations rely on the writes having a certain amount > of alignment. If your stripe size is 64k and your writes are 64k, but > they all start at an 8k offset into the stripe, you loose. Each 64K > write will have to touch 56k of one disk and 8k of the next disk. But, > an 8k offset can be made to work if you reduce your stripe size to 8k. > It then becomes an excercise in balancing the parameters of FS block > size and array stripe size to give you the best peformance for your > needs. The 64k offset in UFS2 gives you more room to work here, so > that's why I say at the beginning that it's a good value. In any case, > you want to choose parameters that result in each block write covering > either a single disk or a whole stripe. > > Where things really go bad for BSD is when a _63_ sector offset gets > introduced for the MBR. Now everything is offset to an odd, > non-power-of-2 value, and there isn't anything that you can tweak in the > filesystem or array to compensate. The best you can do is to manually > calculate a compensating offset in the disklabel for each partition. > But at the point, it often becomes easier to just ditch all of that and > put the fielsystem directly on the disk. > > Scott Scott, Just wanted to say thanks for such a well put explanation on this, with all the right details. Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Anything that works is better than anything that doesn't. ------------------------------------------------------------------------
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?452D48DF.5010502>