Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 3 Feb 2000 12:20:27 +1030
From:      Greg Lehey <grog@lemis.com>
To:        "Justin T. Gibbs" <gibbs@FreeBSD.org>
Cc:        Gary Palmer <gjp@in-addr.com>, scsi@FreeBSD.org, up@3.am, Wilko Bulte <wilko@yedi.iaf.nl>
Subject:   Re: Definitions of RAID levels (was: hardware vs software stripping)
Message-ID:  <20000203122027.O55303@freebie.lemis.com>
In-Reply-To: <200002021531.IAA00607@caspian.plutotech.com>
References:  <20000202123317.P55303@freebie.lemis.com> <200002021531.IAA00607@caspian.plutotech.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday,  2 February 2000 at  8:31:21 -0700, Justin T. Gibbs wrote:
>>>> My understanding is that RAID-3, effectively striping at a sub-sector
>>>> level, can give much higher data rates without buffering, and that's
>>>> its raison d'être.
>>>
>>> If you stripe at the sub-sector level, you must perform RMW.  This makes
>>> absolutely no sense.
>>
>> I think you're misunderstanding my use of the term "stripe".  I'm not
>> talking about "transactions" here, I'm talking about layout.  If I
>> have a 9 disk RAID-[345] set with a stripe size of 64 bytes, I can
>> read one sector from each of the 8 data disks and have a total of 8
>> sectors.  I can do the same thing if each disk contains an individual
>> bit of a byte.  Older disk and drum technology used a very similar
>> method (multiple heads) to speed up transfer times.  With relatively
>> simple hardware support, this would make a lot of sense, and if RAID-3
>> is really what you say, it makes me wonder why people haven't thought
>> of this alternative.
>
> Take a look at this diagram:
>
> http://sunsite.berkeley.edu/Dienst/UI/2.0/Page/ncstrl.ucb/CSD-87-391/16

Nice.  Thanks for the pointer.

> They don't use "minimum transaction size", they use "transfer units".
> Its the same thing.  In this example, the "transfer unit" is a
> sector.

I think this is a very important part of the example.

> In Pluto's system, the effective sector size is 64K (its too
> inefficient to perform sector I/O) and the "transfer unit" is a
> block of video frames.

OK, but the RAID-3 example shows that individual sectors contain data
for all four of the transfer units; effectively it has mapped data in
units of 128 bytes, a sub-sector division, which is what I have been
saying all the time.

> If we are recording uncompressed video, a video frame is ~500K.
> This means you must read more than one of the drives in a stripe in
> order to get the entire frame, but it may be possible to not read
> them all to get just a single frame.  This is what I meant by our
> system allowing independent access, but my assertion that it didn't
> buy us anything.  Putting all of a particular frame's data on a
> single drive would yield too much latency for random frame fetches,
> so we don't use that layout.

Sure.  I have no issue with the way you're doing things; it makes a
lot of sense.

> The main point I've been trying to make in all of this is that the
> data need not be bit or byte striped.  In the example in the
> Berkeley paper, the disk strip size is 1/4th of a sector.  The
> distinction is all based on what your "record" size is and whether
> you can store records without crossing disk boundaries so it makes
> sense to allow independent access.

Now that's the point I've been trying to make.  Unfortunately, the
example doesn't make it clear how the striping would be if we were
transferring, say, 4 sectors.  Based on the fact that we can't
generally determine the size of the transfer in advance, I'd claim
that this mapping would remain if you increase the transfer size to,
say, 1 MB.  As I said earlier, with appropriate hardware support, this
can be a very efficient way of handling large transfers.  Without this
support, such as in a FreeBSD environment, it doesn't make any sense.

>>> If your transaction is larger, perhaps you satisfy it by modifying 1 or
>>> more full stripes and only partially modifying the border stripes.
>>> The point is still the same.
>>
>> Well, I can't see that.  You're saying that RAID-4 stripes should be a
>> multiple of the transaction size, and I'm saying the "transaction"
>> size is variable.  The "point" seems to be that this is the main
>> difference in your definitions of RAID-3 and RAID-4.
>
> If a user requests to read 64K of a file on an 8K file system, can you
> not see that on a system where each block is on an independent spindle
> and those 64K happen to be contiguous that you are not forced to make
> more than 10 read transactions even if your stripe covered more disks
> than that? 

Indeed I can.  That's why I recommend large stripe sizes.

> If, on the other hand, you striped each 8k block across all drives,
> you'd not have that luxury.  That is the difference.

I think we've got hung up on this definition of "transaction".  I'd
prefer to leave the transfer size out of it and look at the mapping to
disk.

>> OK, and you could do this without changing the physical layout?  In
>> that case, I'd suggest this is RAID-4, not RAID-3.  Note that the text
>> you quote states:
>
> It is only RAID-4 if you can access a single disk and get all of the
> required data to do something useful.  This is not the case in our
> system.

If I have a RAID-5 plex with 8 kB stripes, and I make a 32 kB transfer
(both nothing exceptional: though I disapprove of such small stripes,
it seems some vendors recommend them), I can't get all the required
data from one drive.  In fact, I may not be able to get it from one
stripe.  I don't think this makes it any less RAID-5.

>>   Unlike RAID Level 3, however, a RAID Level 4 array's member disks
>>   are independently accessible.
>>
>> This still suggests to me that there is something about RAID-3 layout,
>> not the software implementation, which makes it impossible to access
>> drives individually.
>
> I've already covered why this is the case.  There was also the
> assumption in the past that the spindles would be synchronized.
> This is no longer the typical case.

Agreed.  But there's a difference between being able to read a sector
from a disk, and having to read the whole stripe even if you only want
a sector.

> Anyway, that's all I have to say about RAID levels.  I'm sorry I
> ever brought it up.

Well, sorry if I have to disagree with you, but I do believe it's
important to get our definitions correct.  I'm not criticizing the
Pluto implementation, which makes sense to me, but I still haven't
seen any evidence that it's RAID-3.  I'd call it RAID-4.  As I said
before, I don't believe that RAID-3 is much use with modern hardware.

The URL you sent is obviously part of something larger.  I'll check it
out.  Thanks again for the pointer.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000203122027.O55303>