Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 6 Sep 2001 15:23:01 +0930
From:      Greg Lehey <grog@FreeBSD.org>
To:        Doug Hardie <bc979@lafn.org>
Cc:        David Gilbert <dgilbert@velocet.ca>, Lawrence Farr <l.farr@epcdirect.co.uk>, 'Lawrence Farr' <lawrence@epcdirect.co.uk>, 'Chris BeHanna' <behanna@zbzoom.net>, 'FreeBSD-Stable' <stable@FreeBSD.ORG>
Subject:   Re: [stable] Re: RAID5
Message-ID:  <20010906152301.J24413@wantadilla.lemis.com>
In-Reply-To: <f0433010ab7bc414d46c7@[10.0.1.100]>; from bc979@lafn.org on Wed, Sep 05, 2001 at 01:57:50PM -0700
References:  <002c01c135e4$69c924d0$c80aa8c0@lfarr> <f04330116b7bbe195e610@[10.0.1.100]> <15254.16593.350305.548246@trooper.velocet.net> <f0433010ab7bc414d46c7@[10.0.1.100]>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday,  5 September 2001 at 13:57:50 -0700, Doug Hardie wrote:
> At 11:12 -0400 9/5/01, David Gilbert wrote:
>> Well... FreeBSD doesn't use a 'fast write' disk (although this is an
>> interesting idea), but writing a single block of RAID-5 data requires
>> a read of the previous data, a read of the parity block then a write
>> of the data and a write of the parity block --- 4 I/O operations.
>
> It is the distributing of the data among the disks that is required
> for write that makes it slower than read.

No, on RAID-5 the issue is more complicated.  You first need to
calculate parity.  There are two basic approaches:

1.  Aim for whole-stripe writes.  That way, you can calculate the
    parity from the data you have.

2.  First read (or cache) the old contents of the parity block.  Use
    them to calculate the new parity, write back.

Consider the two alternatives with a, say, 5 disk plex (set).  The
only way to get situation (1) is to use small blocks.  UFS transfers
tend to be in the order of 6 kB, though they can be as high as 60 kB
(and yes, they have nothing much to do with the file system block
size).  So you go for small transfers, say 1.5 kB (because it fits in
with this example).  You perform five writes, and that's all.  Because
the writes go do different disks, they can go in parallel.  Total time
is about the same as for a normal write.

There's obviously the problem here that you can't rely on having
ideally sized blocks.  That's OK, though; you'll get enough for this
approach to look attractive.

In the case of (2), by contrast, you need to read the entire contents
of the data you're changing, and then write it out again.  Twice the
number of transfers.  Half the speed?

If you use the same block sizes, yes, that's half the speed.  But the
whole argument is flawed.  You can't look at the elapsed time for a
single transfer.  Look at the time you're keeping the disks busy,
times the number of disks.  In version (1) you're positioning (5.9 ms)
and transferring (0.1 ms) five disks.  Total time 30 ms.  In example
(2) it would take 60 ms--*if* you use this stripe size.  Now increase
the stripe size to 512 kB.  Presto! In all probability, your 6 kB
transfer will go to 1 disk only.  You still need 4 transfers, but
that's only 24 ms, while 30 ms is the theoretical minimum for version
(1).

This shows the real issue: far too many people measure performance in
a completely different environment from practice.  The result is
frequently meaningless.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010906152301.J24413>