Date: Fri, 19 Jun 1998 15:48:37 -0400 (EDT) From: Simon Shapiro <shimon@simon-shapiro.org> To: Chris Parry <laotzu@juniper.net>, freebsd-questions@FreeBSD.ORG, freebsd-SCSI@FreeBSD.ORG Subject: RE: DPT support binaries - How to Setup Message-ID: <XFMail.980619154837.shimon@simon-shapiro.org> In-Reply-To: <Pine.NEB.3.96.980619083513.2503V-100000@leaf.juniper.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Chris, I hope you do not mind me forwarding this to FreBSD questions... On 19-Jun-98 Chris Parry wrote: >> Since I moved, my web server has been down. DPT drivers for FreeBSD are >> integral part of FreeBSD now. If you want to see the contents of my ftp >> server use ftp://simon-shapiro.org/crash >> >> There is no dptmgr for FreeBSD yet. In the works. > > Excellent. Would you know how people are currently setting up RAID on > DPT's in FreeBSD? I'm assuming something like having a dos partition, > and > doing the config there, and then just mounting the volume as sd0? I have been asked this question many times, and have also seen much misinformation in this regard, so here is a brief review. The DPT controller creates and manages RAID array in a manner totally transparent to the (ANY) operating system. Say, you have 45 disk drives, and you attach them to one DPT controller (I have several ``customers'' who do that; You need a DPT PM3334UDW, and seven disk shelves, and a very large UPS). Then you boot DOS from a floppy, take the DPT install floppy number one (comes with the controller), put it in the floppy drive and type ``dptmgr/fw0'' and press the return key. After a short while, a windows-like application starts. You do not need windows, DOS or anything installed on the machine. Just boot dos 6.22 or later (I use IBM PC-DOS 7.0) from the floppy drive. We will go to the review in a minute, but here are the the steps to create a very complex RAID subsystem, for ANY operating system, FreeBSD included. For brevity, I will use the notation cXbYtZ to define disk drives. The DPT controllers (PM3334 series can have up to three SCSI busses attached to the same controller. BTW, the correct name for a SCSI controller is HBA, as in Host Bus Adapter. Let's say we have two controllers. The first controller has 1 disk connected to the first channel, and the disk is setup (via its jumpers) to be target 1. The second controller has one disk connected to the third channel, and is setup to be target ID 15. In this example, I will call the first disk c0b0t1. The second disk I will call c1tb2t15. OK? Now, back to our monster system. Step by step: * Hook up everything. If you are not using DPT, nor DEC StorageWorks disk shelves, make SURE you bought the MOST expensive cables you can find. Ultra SCSI runs 16 bit bus at 20MHz, using electrical signals similar to ISA/PCI busses. You do not expect your PC cards to work over 30 feet of sloppy, cheap cable. Right? Do not expect SCSI to be any different. If you need long cables, or more than 5-6 drives per very short cable, get differential controllers, and good shelves. Half the people that contact me with ``DPT problems'' have cheap cables and/or disk shelves. the other half is using some Quantum Atlas drives that have old/bad/wrong firmware. About 1 in a hundred actually has a deficiency in the driver, or the firmware. this is not bragging. These are facts. I am happy to help each and every one, but your system is yours, not mine, and you need to set it up correctly. * As indicated above, boot DOS, install DPT floppy 1, and start the dptmgr with the option fw0. I will tell you later why. * If this is the very first time these disk drives are connected to the DPT controller, DPTMGR will invite you to choose an O/S. Say, Linux, or Other. Do NOT say BSDi. This is important. * In this example, We will assume a pair of DPT PM3334UDW (Ultra-wide, differential, with 3 busses) HBAs. 84 4GB drives that are all identical, We will assume that the drives are installed in DPT shelves, 7 drives per shelf. Each channel in each controller has 2 shelves attached to it; 2 shelves contain 14 drives, for a total of 6 shelves per controller, or 42 drives per controller. The shelves have to be configured for the starting TARGET ID of the shelf. One shelf has targets 0-6, the second shelf has targets 8-15. The DPT HBA has target ID 7 on all three busses. * We want to arrange the drives in this manner: 1 RAID-1 array to be used for booting the system, most installed file systems, etc. This will give us full redundancy, and yet be able to READ faster than a single disk, and WRITE almost as fast. 4GB will be enough, so this will consume exactly two drives. 1 RAID-0 array to be used for swap, /tmp, /var/tmp, /usr/obj, etc. RAID-0 is very fast, but if any disk in the array fails, the whole array will lose its data. For the indicated use, this is acceptable to us (Remember, this is just an example). We do not need an awful lot of space, but we need the speed of at least 15MB/Sec, so we will use 6 drives here. 1 Huge RAID-0 array to contain news articles. Again, we do not care if we loose the article. This array needs to be big and as fast as possible. We will use 33 disk drives here. 1 Huge RAID-5 array to contain our E-mail. We need reliability and capacity, so we will use 33 drives here. In reality, RAID-5 arrays are not so effective at this size, but this is just an exadurated example. 1 Very large RAID-5 array to contain our CVS tree. Since reliability and performance are important, we will use 8 drives here. If you add up the drives, you will see that we have two drives unassigned here. Hold on... * Before we go on to configure/create any RAID arrays, here is a bit of madness; The order in which the BIOS finds adaptors (controllers) on the PCI bus, and the order in which the SAME BIOS boots, and/or FreeBSD scans the PCI bus, can be reversed on some motherboards. What that means is that when we refer to c0bYtZ, in the context of DPTMGR, it actually may be C1bYtZ as far as Unix is concerned. In ALL my systems things are reversed: What the DPT calls HBA0, is actually HBA 1 for Unix. * Make sure the DPT sees all your devices, without errors. If you use DPT disk shelves, and see a cabling error, unplug, correct, and use the File->Read system configuration to force the DPTMGR to re-scan the busses. No need to reboot. * The next step is to define the role of each drive in the system. Drives can be themselves, part of RAID array, or Hot Spares. Drives that are Hot Spares or part of a RAID array, are invisible TO THE O/S. Again; There is no way for the O/S to see a drive that is part of a RAID array, or a hot spare. * In defining RAID arrays, DPTMGR asks you for a stripe size. Unless you have a specific reason to override the default stripe size, leave it alone. Chances are the DPT people who write the firmware know their hardware, SCSI and RAID theory better than you do. * Using the DPTMGR utility, we create a RAID-1 array using c1b0t0 and c1b1t0. Use the File->Set system Parameters to save the configuration and have the DPT start building the array. When you are done defining the array, it's icons will have black flags. When it builds, the array icon will be blue, and the drives' will be white. * While the array builds, double click on it, click on the Name button, and type in an appropriate name, for Example ``Mad_Boot-1'' to remind yourself this is the Mad system, the Boot ``disk'' (more on that later), and it is a RAID-1. Choose File->Set system Parameters to save the new name. * Double click on c1b2t0 and click on Make hot Spare. This will make this drive invisible to ?Unix, but will allow the DPT to automatically replace any defective drive with the hot spare. We will talk about that some more later. * Start creating a RAID-0 array. Add devices to this array in this order: c1b0t1, c1b1t1, c1b2t1, c1b0t2, c1b1t2, c1b2t2, c1b0t3, c1b1t3, c1b2t3... The idea is to specify the drives, alternating between busses. This gives the DPT the opportunity to load0share the busses. Performance gains are impressive. When you are done, File->Set system Parameters. Do not forget to change the array name to ``Mad-News-0'' * Do the same with the last arrays on c0. Remember to designate a hot spare, to alternate the drives as you Add them to an array, to File->Set system Parameters. * The theory says that you could now shut your systems down, and install Unix on it. Not so fast: While the arrays are building, they are NOT available to you. Current firmware (7M0) will show the arrays to the O/S with size of zero or one sector. FreeBSD will crash (panic) in response to that. Leave the system alone until it is all done. Handling failures on arrays that are already built is totally different. See below. * If you follow my example, when you re-boot the system, BIOS, DOS, Windows, Linux, FreeBSD, they will only see FIVE disk drives. What happened to the other 79 drives?!!! Listen Carefully: Every RAID array appears to the O/S as ONE DISK DRIVE. Hot spares are TOTALLY INVISIBLE to the O/S Since we defined 5 RAID arrays and two hot spares (one per HBA. Hot Spares cannot cross HBA lines), all the system gets to see is FIVE DRIVES. If you look at the drive model, it will show the array Name you chose when setting up. The revision level for the ``drive'' is DPTxYz, where xYz is the firmware version. Currently, there is no way to get to the underlying drives in FreeBSD. Operation: Once the arrays completed building (the blue flags will be gone, and if you double click on the array icon, its Status files will say ``optimal''. Go Install whatever O/S, using whatever method you choose. Please beware that some versions of FreeBSD barf on disks with capacity of 20GB or larger. So it may barf on filesystems this huge. This seems to be an ``attribute'' of sysinstall, not the standard fdisk, disklabel, and newfs. you may choose to only install and configure the boot disk, as this one appears to the O/S as a simple, 4GB disk (if you used 4GB drives to define it). Failures: What does the DPT do in case of disk failure? First, the FreeBSD O/S, the DPT driver in the O/S, have no clue about the discussion below, except as explicitly indicated. General: If you use DPT shelves, a failed drive (unless it is totally dead) will turn its fault light on. The disk canister has a Fault light, that the DPT can turn on and off. In addition, the DPT controller will beep. The beeping pattern actually will tell you which disk on which bus has failed. If you use DPT (or DEC) shelves, simply pull out the bad drive and plug in a new drive. RAID-0: If any disk in a RAID-0 array fails, the whole array is flagged by the DPT as ``Dead''. Any I/O to the array will immediately fail. If you boot DOS/DPTMGR, the array will have black flag on it. Your only option is to delete the array, and create a new one. Any data on the array will be lost. horrible? Not necessarily. If you use RAID-0 arrays for data you can live without, you will be fine. With drives touting 800,000 hours MTBF, an 8 disks RAID-0 array will have MTBF of about 5 years. RAID-1/5: If a drive fails, the RAID array will go into degraded mode (Yellow flag in dptmgr). If you have a Hot Spare connected to the DPT, it will automatically make the hot spare a new member of the degraded array and start rebuilding the array onto the drive (that was the Hot Spare). If you replace the dead drive with a good one, the newly inserted drive will be recognized by the DPT and be made a new Hot Spare. The new hot Spare will not be available until the building array has completed its re-build. Important: The degraded array is available for I/O while in degraded or constructing mode. This has been verified more than once, and actually works. However: * RAID-5 in degraded mode is very slow. * Array rebuild in RAID-5 is done by reading all the good drives, computing the missing data, and writing it to the new/replacement drive. This is not exactly fast. It is downright SLOW. * RAID-1 arrays build by copying the entire good disk onto the replacement disk. This sucks all the bandwidth off the SCSI bus. * If there is another error while the arrays are in degraded/rebuild mode, you are hosed. there is not redundant data and you may lose your data. If possible and/or practical, do not WRITE to a degraded array. Back it up instead. Common Failure: I cannot overstate the commonality of this failure scenario: One uses cheap shelves (disk cabinets), cheap cables, marginal drives and the following happens: A certain drive hangs, or goes on the fritz, or the SCSI bus stalls. the DPT makes a valiant effort to reset the bus, but at least one drive is out cold. the DPT raises the alarm, and drafts a HotSpare into service. It then starts rebuilding the array. Since building the array is very I/O intensive, another drive goes on the fritz. Now the array is DEAD as far as the DPT is concerned. At this point the operator notices the problem, shuts the system down, reboots into the DPTMGR and comes out saying ``Nothing is WRONG!'' The operator sees red flags on the drives, if lucky, runs the DPT OPTIMAL.EXE utility (NOT available from me!), runs DOS based diagnostics, and start his/her news server again. Within minutes/hours/days, the whole scenario repeats. What happened? Under DOS (dptmgr is no exception), I/O is poled or sedately slow. Failures are rare and recovery almost certain. Under FreeBSD, a new server can peak at well over 1,000 disk I/Os per second. This is when marginal systems break. This is the most frustrating scenario for me; * The problem is NOT an O/S problem (FreeBSD simply pushes as much I/O into the driver as it can. It knows not a RAID array; The RAID array is simply a disk. * It is not a driver problem; the driver does not know a RAID array form a cucumber; It simply receives SCSI commands and passes them along. It never looks inside to even see what command is passing through. * the DPT firmware is not at fault either; It simply pushes commands to the drives according th the ANSI spec. So, who is at fault? Typically, the user who buys a $3,000.00 disk controller, attaching it to $20,000 worth of disk drives, using a $5.000 cable. In some cases, the user elects to buy the cheapest drive the can find, so as to reduce the $20,000 cost in disks to maybe $15,000. Some of these drives simply cannot do I/O correctly, or cannot do it correctly and quickly. Sometimes the problem is a combination of marginal interconnect and marginal drives. What to Do? If you have a mission critical data storage that you want to be reliable and fast: * Get a DPT controller * Get the ECC memory form DPT (Yes, bitch them out on the price, and say that Simon said the price is absurdly high). * Get the disk shelves from DPT (Or get DEC StorageWorks) * Get the DISKS from DPT. Make sure they supply you with the ECC ready disks). you will pay about $100.00/per drive extra, but will get the carrier for free, so the total cost is just about the same. * Get the cables from DPT, DEC, or Amphenol. * SCSI Cables are precision instruments. Keep the hammer away form them. * Use only the version of Firmware I recommend for the DPT. It is currently 7Li, not the newer 7M0. I have done the above, and on that day, my I/O errors disappeared. I run a total of over 60 disk drives on DPTs. Some in very stressed environments. Most in mission critical environments. Any failure we have to date is a direct result of violating these rules. What is the ECC option? The ECC option comprises of special ECC SIMMs for the DPT cache, proper cabling, and proper disk shelves (cabinets, enclosures). Using this option, the DPT guarantees that the data recorded to a device, or read from a device goes through a complete ECC data path. any small errors in the data are transparently corrected. Large errors are detected and alarmed. How does it work? * When data arrives from the host into the controller, it is put into the cache memory, and a 16 byte ECC is computed on every 512 byte ``sector''. * The disk drive is formatted to 528 bytes/sector, instead of the normal 512. Please note that not every disk can do that. sometimes it is simply a matter of having the proper firmware on the disk. sometimes the disk has to be different. * When the DPT writes the sector to disk, it writes the entire 528 bytes to disk. when it READs the sector, it reads the entire 528 bytes, performs the ECC check/correction, and puts the data in the cache memory. All disk drives have either CRC or ECC. What's the big deal? disk drives uses ECC (or some such) to make sure that what came into their possession was recored correctly on the disk, or that the recorded data is read correctly. The disk drive still has no clue if the data it receives from the initiator (HBA) is correct. When a disk sends data to the HBA, it does not know what arrives at the host. Yes, SCSI bus support parity. But parity cannot correct errors, and will totally miss even number of missing bits. Yes, that happens easily with bad cables. So, what is it good for? Aside from peace of mind, there is one important use for this: Hot Plug drives. Let me explain: The SCSI bus (ribbon cable genre, not FCAL) was never designed to sustain hot insertion. IF you add/remove device on the bus, you will invariably glitch it. These glitches will find themselves doing one of two things: a. Corrupt some handshake signal. this is typically easily detected by the devices and corrected via re-try. b. Corrupt some data which is in transfer. This is the more common case. If the corruption went undetected, something will go wrong much later and will typically be blamed on software. The ECC option goes hand in hand with the cabinet/disk-canister. The cabinet/canister combination makes sure that the minimal disruption appears on the bus by using special circuitry. the ECC option complements it by making sure that the entire data path, from the CPU to the disk and back is monitored for quality, and in most cases automatically repaired. What other wonders does a DPT controller perform? If you expect it to actually write code for you, sorry. It does not. But, if you use the correct enclosure, it will tell you about P/S failures, fan failures, overheating, and internal failures. A near-future release of the driver will even allow you to get these alarms into user space. You can then write a shell/perl/whatever script/program to take some action when a failure occurs in your disk system. What Performance can I expect out of a DPT system? It depends. Caching systems are slower in sequential access than individual disks. Things like Bonnie will not show what the DPT can really do. RAID-1 is slightly slower than a single disk in WRITE and can be slightly faster in READ operations. RAID-5 is considerably slower in WRITE and slightly faster in READ. RAID-0 is fast but very fragile. The main difference is in the disk subsystem reaction to increasing load and its handling of failures. A single disk is a single point of failure. A DPT, correctly configured will present ``perfect./non-stop'' disks to the O/S. In terms of load handling, the DPT controller has the lowest load per operation in a FreeBSD system. In terms of interrupts per operation, number of disks per PCI slot, size of file systems, number of CPU cycles per logical operation, and handling of heavy disk loads, it is probably the best option available to FreeBSD today. In terms of RAW I/O, doing random read/write to large disk partitions, these are the numbers: RAID-0: Approximately 1,930 disk operations per second. About 18-21 MB/Sec. RAID-1: About 6.5 MB/Sec WRITE, About 8-14 MB/Sec READ, the wide range stems form increasing cache hit ratio. RAID-5: About 5.5 MB/Sec write, 8.5 MB/.Sec READ. These are optimal numbers, derived from large arrays, and a PM3334UDW with 64MB of ECC cache. To achieve these numbers you have to have at least 500 processes reading and writing continuedly to random areas on the disk. This translates to Load Average of 150-980, depending on the array type, etc. >From daily use, I'dd say that until your disk I/O reached 300 I/O ops/sed, you will not feel the load at all. I home the above answers some of the most common questions about the DPT controller, and its interaction with FreeBSD. If you have some more, then let me know. Simon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.980619154837.shimon>