FreeBSD Mail Archives

Date:      Thu, 25 Feb 1999 12:05:57 -0500
From:      Doug Ledford <dledford@redhat.com>
To:        Maxwell Spangler <maxwell@clark.net>
Cc:        AIC7xxx@FreeBSD.ORG
Subject:   Re: Adaptec 7890 and RAID portIII RAID controller Linux Support
Message-ID:  <36D582F5.AC7405E3@redhat.com>
References:  <Pine.LNX.4.04.9902241451370.23862-100000@maxwax.doghouse.com>

Maxwell Spangler wrote:
> 
> On Mon, 11 Jan 1999, Doug Ledford wrote:
> 
> > To sum up my impressions, hardware RAID is a waste of money.  It doesn't
> > buy speed any more (it used to when a hot server was a 486/33 and you
> > had an i960 chip on the RAID controller).  The newest RAID5 and RAID1
> > code from Ingo Molnar is *quite* reliable and pretty much on par with
> > what you would get in a hardware raid array.  The real reason for raid
> > used to be reliability in the face of failure.  Any more, with as
> > reliable as the software has gotten, I consider the hardware raid arrays
> > simply another possible point of failure.  I would go software raid if I
> > were you.
> 
> But isn't offloading processing of any sort to a specialised chip or device a
> good thing?  (Considering modern day hardware, not older stuff)
> 
> For example: (Completely fictional comparison example)
> 
> A PII-233 performing software OpenGL can produce 500 3D video operations in
> one second, but it takes 30% of the CPU's processing time to do so.
> 
> A second PII-233 performing the same task with the assistance of a hardware 3D
> device can perform the same number of operations, but reduce the amount of CPU
> processing time to 5%.

If that were the case, then I would agree.  But in the case of software RAID,
my experience has been that this isn't true.  It's more like a PII-233 could
do 150 I/O ops per second and 15MByte/s of data transfer with a hardware RAID
controller and only use 30% CPU.  Another PII-233 could do 400 I/O ops per
second and 30MByte/s of data transfer with software RAID but it would use 75%
of the CPU.  However, the increase in speed adds dramatically to the CPU use
by itself, not counting the RAID5 in software.  So, the end result is that it
does cost some CPU processing power, but the average CPU and Ingo's software
RAID is *SOOOOO* much faster than the typical hardware RAID, that you come out
way ahead in the long run (what good is it to have an idle CPU if the reason
it's idle is that you RAID controller is slow?)

> Case #2 would be better, and for years a lot of us have used SCSI instead of
> IDE (PIO IDE, not udma EIDE) because this was better.  As you pointed out, in
> the days of 386/486 CPUs, offloading to SCSI cards, network cards, video
> cards, was not only a good thing but required.  Wouldn't that concept scale to
> modern systems but just allow us to go even faster?

No, because while main CPUs have gotten faster, most hardware RAID controllers
have not kept pace and are now slowing the overall system performance down. 
They save CPU cycles by offloading work *AND* by slowing the overall system
down with higher latencies and lower throughputs.

> I wonder if you are saying that:
> 
> * Modern CPUs have CPU cycles to spare for most users and Ingo's SW RAID code
> is efficient and can utilize those cycles without much overall impact?

It has an overall impact on CPU cycles, so I'm not saying that the CPU impact
of RAID 5 is insignificant.  However, I am saying that it's still faster than
hardware stuff simply because main CPUs don't have to wait as long to get the
data.

> * hardware raid controllers aren't don't have the same ratio of power compared
> to the host CPU as they used to?  Can a PII-450 running SW RAID outperform a
> hardware RAID card, for example?

Hands down, while compiling a kernel and rendering a pic in povray and playing
quake software RAID on a PII450 will stomp the typical hardware RAID solution.

> I wonder if you'd guess as to what impact the SW RAID might have in a typical
> workstation or file server environment.  If I had a nicely configured system
> that was running along fine and I added SW RAID, would be be a noticable drain
> on the CPU?

It doesn't present a noticeable drain on a CPU, it's just that software RAID5
is somewhat slower than software RAID0.

> How about a dedicated fileserver with 1 root/boot disk, and 6 (3x3)?  Would
> removing a "typical" or average quality/speed HW RAID solution and replacing
> it with software have much of an impact?

Most likely, it would have a significant impact.

> I think your comment about recommending SW RAID over hardware just sounded too
> good to be true for me based on past years' experiences.  But then, this
> wouldn't be the first time Linux has broken commonsense ways of computing for
> something better :)

I won't gaurantee that all configurations are a dream, and there are some
guidelines to getting good performance out of a software RAID, but in general
I've been able to get excellent results through some careful designing and
testing during the build phase.  Here are some design issues that I keep in
mind (some of these seem obvious, but I'm listing them anyway because I've
seen them done and I'm going to save this for future reference as a start to a
SOFTWARE-RAID-TIPS document that someone else can expand upon and fill out
better later).

1.  Even though the software RAID can use partitions of different sizes,
having different zones creates different performance at different portions of
the drive and actually increases CPU overhead and memory consumption. 
Therefore, I try to keep all of my RAID partitions of *exactly* equal sizes. 
Sometimes this means wasting a few hundred MB on a drive (such as when you
have a 4.3GB and a 4.5GB drive you wish to RAID together).  That's fine.  You
can either ignore that few hundred MB or you can set it up as swap space, or a
/boot partition, or whatever.  What I don't recommend is placing any sort of
highly active filesystem on there.  Which leads me to my second point.

2.  Only put one RAID partition on any drive, and don't put any other commonly
used filesystems on there with it.  When you use a drive for RAID, make sure
it's dedicated to that purpose.  One of the key benefits of using software
RAID striped across 6 disks is the increased seeks per second that you can
get.  When you use 6 disks, you can get 6 times as many seeks under good
conditions.  When you make two raid arrays, one using the first half of all
six disks and the second using the back half of all six disks, you hurt
performance *more* than you would by making one big array and just using it. 
Part of the reason for this is that if you had one big array, then all the
files on that array may be residing in the first portion of the disks.  To
seek from any one file to another could be a relatively short seek.  When you
split the filesystems, then a read of a file on each filesystem will result in
a half disk seek to get from one filesystem to another.  These half disk seeks
will *destroy* your overall effective performance.

3.  Before you start to create your RAID arrays, make sure that the /boot
directory is either on a non-raid / partition, or on it's own partition.  LILO
for one does not know how to deal with RAID arrays, so where ever LILO expects
to find your kernel (the vmlinuz file), that directory must be on a non-raid
partition.  In my case, I've created a RAID array that's used as my /
partition, so I had to have a stand-alone /boot partition to store the kernel
in.  Anyone who attempts to do this exact setup, I *strongly* recommend that
the /boot partition be at least 100MByte in size and that you actually install
a minimal installation on the /boot partition (as a bare minimum, copy the
/etc /bin /sbin /lib directories onto the /boot partition, but in my case,
I've also selected certain stuff from /usr/bin /usr/lib and other directories
to go over there as well).  This way, if anything does happen to your raid
array, you have the raidtools installed on your /boot partition, you can boot
the kernel with the root= command line switch, boot into a working rescue set,
then fix the raid array from there.  It's much faster and easier than rescue
disks and since you have to have this partition for LILO anyway, you might as
well make effective use of it :)

4.  The RAID code in the stock 2.0.36 and 2.2.x kernels is not the best RAID
code there is.  There is better code located on ftp.kernel.org (and it's
mirrors) in the directory /pub/linux/daemons/raid/alpha.  Don't let the
directory name fool you.  The RAID code there is anything but Alpha.  It's
what will be going into 2.3.  However, that patch includes the startings of
support for a Logical Valume Manager and a Transparent mode.  Those two things
are what earns it the label of Alpha.  Just leave them turned off and use the
excellent RAID code found in there :)  However, please note that you will need
to download the updated raidtools that are also in that directory and your
RAID partitions will not work with the older RAID code, so you can't switch
back and forth between the new RAID code and the old RAID code that's in the
stock kernel, you have to pick one and stick with it.  I also recommend that
you set the partition type on your RAID partitions to type 0xfd and then
regardless of the array type you use, set the flag "persistent-superblock 1"
in your /etc/raidtab file (the /etc/raidtab is where you define the raid array
when you are first creating it).  Then, Ingo's new raid code will autodetect
these raid partitions and start them up for you automatically at each boot. 
This is what makes a RAID5 / partition wasy to manage.  No initrd or special
boot magic is needed because the kernel will autodetect your raid array before
it tries to mount the / partition, the raid array will be available and my
kernel simply mounts it as the /.

5.  Get a copy of bonnie or some other disk benchmarking program and have it
handy when you are creating your RAID array.  Once you are ready to create the
array, boot into single user mode, create the array, put a filesystem on the
array, then run bonnie on the array.  I usually make three runs of bonnie and
average out the results.  Make sure that the size of the bonnie file is at
least 1.5 to 2x the amount of RAM in your machine (this is to make sure you
are testing the disk performance and not the performance of the buffer
cache).  Then, umount the raid partition, edit the /etc/raidtab file to change
the raid parameters, re-make the raid array (and you will have to force the
raid code to do this, so ignore the warnings it issues because you know what
you are doing :) and then try this again.  I'm telling people to do this
because there appear to be "magical" combinations of disk numbers, chunk
sizes, and e2fs block sizes that produce excellent numbers, and others that
produce sucky numbers.  I haven't even begun to look into this, but somewhere
in the block device driver layers there are magical combinations that work
well, and others that don't.  Plan on experimenting some here to get the best
results.  On the raid arrays, you can define anything from a 4K chunk size all
the way up to megabyte chunk sizes.  In general, I usually try these sizes:
16k,32k,64k,128k.  If you have a huge number of disks in the array, you may
want to consider going lower, such as 8k and 4k.  The second variable to
consider is the e2fs block size.  When making the filesystem on the array, you
have three options for the ext2 block size.  The default of 1024, 2048, or
4096.  In most cases, I've gotten the best results with a setting of 4096, but
this depends somewhat on your average disk usage.  If you tend to have a lot
of small files, or you tend to access only small portions of those files, then
the smaller sizes may work better for you.  Since bonnie doesn't attempt to
duplicate your access patterns, you have to use your judgement on this one. 
Regardless of the final ext2 block size you use, the process of checking
performance on the disk array for chunk sizes vs. ext2 block sizes is
important.

OK...that's a start on a quick guide to getting a fast software RAID array
under linux.  Using these guidelines, I've got a RAID5 array that will do
25MByte/s sustained on a lowly PII-266, and a RAID0 array that will do
55MByte/s sustained on the same PII machine.  Both of those numbers are CPU
bound numbers BTW.  The disks will go faster.  It's the linux block device
layer keeping it at that speed.

-- 
  Doug Ledford   <dledford@redhat.com>
   Opinions expressed are my own, but
      they should be everybody's.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?36D582F5.AC7405E3>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation