Date: Thu, 25 Feb 1999 12:05:57 -0500 From: Doug Ledford <dledford@redhat.com> To: Maxwell Spangler <maxwell@clark.net> Cc: AIC7xxx@FreeBSD.ORG Subject: Re: Adaptec 7890 and RAID portIII RAID controller Linux Support Message-ID: <36D582F5.AC7405E3@redhat.com> References: <Pine.LNX.4.04.9902241451370.23862-100000@maxwax.doghouse.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Maxwell Spangler wrote: > > On Mon, 11 Jan 1999, Doug Ledford wrote: > > > To sum up my impressions, hardware RAID is a waste of money. It doesn't > > buy speed any more (it used to when a hot server was a 486/33 and you > > had an i960 chip on the RAID controller). The newest RAID5 and RAID1 > > code from Ingo Molnar is *quite* reliable and pretty much on par with > > what you would get in a hardware raid array. The real reason for raid > > used to be reliability in the face of failure. Any more, with as > > reliable as the software has gotten, I consider the hardware raid arrays > > simply another possible point of failure. I would go software raid if I > > were you. > > But isn't offloading processing of any sort to a specialised chip or device a > good thing? (Considering modern day hardware, not older stuff) > > For example: (Completely fictional comparison example) > > A PII-233 performing software OpenGL can produce 500 3D video operations in > one second, but it takes 30% of the CPU's processing time to do so. > > A second PII-233 performing the same task with the assistance of a hardware 3D > device can perform the same number of operations, but reduce the amount of CPU > processing time to 5%. If that were the case, then I would agree. But in the case of software RAID, my experience has been that this isn't true. It's more like a PII-233 could do 150 I/O ops per second and 15MByte/s of data transfer with a hardware RAID controller and only use 30% CPU. Another PII-233 could do 400 I/O ops per second and 30MByte/s of data transfer with software RAID but it would use 75% of the CPU. However, the increase in speed adds dramatically to the CPU use by itself, not counting the RAID5 in software. So, the end result is that it does cost some CPU processing power, but the average CPU and Ingo's software RAID is *SOOOOO* much faster than the typical hardware RAID, that you come out way ahead in the long run (what good is it to have an idle CPU if the reason it's idle is that you RAID controller is slow?) > Case #2 would be better, and for years a lot of us have used SCSI instead of > IDE (PIO IDE, not udma EIDE) because this was better. As you pointed out, in > the days of 386/486 CPUs, offloading to SCSI cards, network cards, video > cards, was not only a good thing but required. Wouldn't that concept scale to > modern systems but just allow us to go even faster? No, because while main CPUs have gotten faster, most hardware RAID controllers have not kept pace and are now slowing the overall system performance down. They save CPU cycles by offloading work *AND* by slowing the overall system down with higher latencies and lower throughputs. > I wonder if you are saying that: > > * Modern CPUs have CPU cycles to spare for most users and Ingo's SW RAID code > is efficient and can utilize those cycles without much overall impact? It has an overall impact on CPU cycles, so I'm not saying that the CPU impact of RAID 5 is insignificant. However, I am saying that it's still faster than hardware stuff simply because main CPUs don't have to wait as long to get the data. > * hardware raid controllers aren't don't have the same ratio of power compared > to the host CPU as they used to? Can a PII-450 running SW RAID outperform a > hardware RAID card, for example? Hands down, while compiling a kernel and rendering a pic in povray and playing quake software RAID on a PII450 will stomp the typical hardware RAID solution. > I wonder if you'd guess as to what impact the SW RAID might have in a typical > workstation or file server environment. If I had a nicely configured system > that was running along fine and I added SW RAID, would be be a noticable drain > on the CPU? It doesn't present a noticeable drain on a CPU, it's just that software RAID5 is somewhat slower than software RAID0. > How about a dedicated fileserver with 1 root/boot disk, and 6 (3x3)? Would > removing a "typical" or average quality/speed HW RAID solution and replacing > it with software have much of an impact? Most likely, it would have a significant impact. > I think your comment about recommending SW RAID over hardware just sounded too > good to be true for me based on past years' experiences. But then, this > wouldn't be the first time Linux has broken commonsense ways of computing for > something better :) I won't gaurantee that all configurations are a dream, and there are some guidelines to getting good performance out of a software RAID, but in general I've been able to get excellent results through some careful designing and testing during the build phase. Here are some design issues that I keep in mind (some of these seem obvious, but I'm listing them anyway because I've seen them done and I'm going to save this for future reference as a start to a SOFTWARE-RAID-TIPS document that someone else can expand upon and fill out better later). 1. Even though the software RAID can use partitions of different sizes, having different zones creates different performance at different portions of the drive and actually increases CPU overhead and memory consumption. Therefore, I try to keep all of my RAID partitions of *exactly* equal sizes. Sometimes this means wasting a few hundred MB on a drive (such as when you have a 4.3GB and a 4.5GB drive you wish to RAID together). That's fine. You can either ignore that few hundred MB or you can set it up as swap space, or a /boot partition, or whatever. What I don't recommend is placing any sort of highly active filesystem on there. Which leads me to my second point. 2. Only put one RAID partition on any drive, and don't put any other commonly used filesystems on there with it. When you use a drive for RAID, make sure it's dedicated to that purpose. One of the key benefits of using software RAID striped across 6 disks is the increased seeks per second that you can get. When you use 6 disks, you can get 6 times as many seeks under good conditions. When you make two raid arrays, one using the first half of all six disks and the second using the back half of all six disks, you hurt performance *more* than you would by making one big array and just using it. Part of the reason for this is that if you had one big array, then all the files on that array may be residing in the first portion of the disks. To seek from any one file to another could be a relatively short seek. When you split the filesystems, then a read of a file on each filesystem will result in a half disk seek to get from one filesystem to another. These half disk seeks will *destroy* your overall effective performance. 3. Before you start to create your RAID arrays, make sure that the /boot directory is either on a non-raid / partition, or on it's own partition. LILO for one does not know how to deal with RAID arrays, so where ever LILO expects to find your kernel (the vmlinuz file), that directory must be on a non-raid partition. In my case, I've created a RAID array that's used as my / partition, so I had to have a stand-alone /boot partition to store the kernel in. Anyone who attempts to do this exact setup, I *strongly* recommend that the /boot partition be at least 100MByte in size and that you actually install a minimal installation on the /boot partition (as a bare minimum, copy the /etc /bin /sbin /lib directories onto the /boot partition, but in my case, I've also selected certain stuff from /usr/bin /usr/lib and other directories to go over there as well). This way, if anything does happen to your raid array, you have the raidtools installed on your /boot partition, you can boot the kernel with the root= command line switch, boot into a working rescue set, then fix the raid array from there. It's much faster and easier than rescue disks and since you have to have this partition for LILO anyway, you might as well make effective use of it :) 4. The RAID code in the stock 2.0.36 and 2.2.x kernels is not the best RAID code there is. There is better code located on ftp.kernel.org (and it's mirrors) in the directory /pub/linux/daemons/raid/alpha. Don't let the directory name fool you. The RAID code there is anything but Alpha. It's what will be going into 2.3. However, that patch includes the startings of support for a Logical Valume Manager and a Transparent mode. Those two things are what earns it the label of Alpha. Just leave them turned off and use the excellent RAID code found in there :) However, please note that you will need to download the updated raidtools that are also in that directory and your RAID partitions will not work with the older RAID code, so you can't switch back and forth between the new RAID code and the old RAID code that's in the stock kernel, you have to pick one and stick with it. I also recommend that you set the partition type on your RAID partitions to type 0xfd and then regardless of the array type you use, set the flag "persistent-superblock 1" in your /etc/raidtab file (the /etc/raidtab is where you define the raid array when you are first creating it). Then, Ingo's new raid code will autodetect these raid partitions and start them up for you automatically at each boot. This is what makes a RAID5 / partition wasy to manage. No initrd or special boot magic is needed because the kernel will autodetect your raid array before it tries to mount the / partition, the raid array will be available and my kernel simply mounts it as the /. 5. Get a copy of bonnie or some other disk benchmarking program and have it handy when you are creating your RAID array. Once you are ready to create the array, boot into single user mode, create the array, put a filesystem on the array, then run bonnie on the array. I usually make three runs of bonnie and average out the results. Make sure that the size of the bonnie file is at least 1.5 to 2x the amount of RAM in your machine (this is to make sure you are testing the disk performance and not the performance of the buffer cache). Then, umount the raid partition, edit the /etc/raidtab file to change the raid parameters, re-make the raid array (and you will have to force the raid code to do this, so ignore the warnings it issues because you know what you are doing :) and then try this again. I'm telling people to do this because there appear to be "magical" combinations of disk numbers, chunk sizes, and e2fs block sizes that produce excellent numbers, and others that produce sucky numbers. I haven't even begun to look into this, but somewhere in the block device driver layers there are magical combinations that work well, and others that don't. Plan on experimenting some here to get the best results. On the raid arrays, you can define anything from a 4K chunk size all the way up to megabyte chunk sizes. In general, I usually try these sizes: 16k,32k,64k,128k. If you have a huge number of disks in the array, you may want to consider going lower, such as 8k and 4k. The second variable to consider is the e2fs block size. When making the filesystem on the array, you have three options for the ext2 block size. The default of 1024, 2048, or 4096. In most cases, I've gotten the best results with a setting of 4096, but this depends somewhat on your average disk usage. If you tend to have a lot of small files, or you tend to access only small portions of those files, then the smaller sizes may work better for you. Since bonnie doesn't attempt to duplicate your access patterns, you have to use your judgement on this one. Regardless of the final ext2 block size you use, the process of checking performance on the disk array for chunk sizes vs. ext2 block sizes is important. OK...that's a start on a quick guide to getting a fast software RAID array under linux. Using these guidelines, I've got a RAID5 array that will do 25MByte/s sustained on a lowly PII-266, and a RAID0 array that will do 55MByte/s sustained on the same PII machine. Both of those numbers are CPU bound numbers BTW. The disks will go faster. It's the linux block device layer keeping it at that speed. -- Doug Ledford <dledford@redhat.com> Opinions expressed are my own, but they should be everybody's. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe aic7xxx" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?36D582F5.AC7405E3>