Date: Tue, 26 Aug 1997 22:10:16 -0700 (PDT) From: Simon Shapiro <Shimon@i-Connect.Net> To: Jason Thorpe <thorpej@nas.nasa.gov>, FreeBSD-Hackers@FreeBSD.org Subject: RE: DPT driver - Overview (very long) Message-ID: <XFMail.970826221016.Shimon@i-Connect.Net> In-Reply-To: <199708261726.KAA07028@lestat.nas.nasa.gov>
next in thread | previous in thread | raw e-mail | index | archive | help
Folks, I am posting this reply so that all those who ask me about this type of questions will get a (slightly more) concise answer Hi Jason Thorpe; On 26-Aug-97 you wrote: > Hello Simon, > > I was going to embark on porting your DPT driver to NetBSD, but when I > looked at it, it was a bit more complex-looking than I expected. > > I was wondering if you could give me a brief design overview? I'm most > interested in the use of soft interrupts, and would also like to know > what the "signature" thing is all about... > > Thanks! Let's start at the end :-) The signature business; This is a piece of ``black magic'' that identifies the exact version and configuration of the DPT to a calling program. ALL the DPT controllers are essentially the same from API point of view. Some are more limited (ISA-EISA-PCI), some are slower (6800,68020,68030), some have multiple busses, some do not, etc. The next step is to understand what a DPT is. It really is not an HBA (Host Bus Adapter), although it likes to pretend to be one. It is more like an I/O Channel controller on a mainframe; A computer with its own O/S, memory, I/O (there is even a serial port on the card), etc. In addition to normal HBA functions the DPT manages RAID arrays in a very clever and complex way. RAID arrays need configuration, dead disks need to be replaced, replacements added, etc. Also consider the need for (and ability to) grow arrays, share arrays with other controllers on the same buss, etc. Last, but not least is the need to interact with external entities and exchange configuration and status data; The DPT can tell you that the disk cabinets are too warm, that a fan failed, that a disk is about to fail soon, how well are your disks performing, etc. To do all this you need a user interface. DPT has a standard user interface that we ported wholesale to FreeBSD. We added some minor, cute options along the way. One of the components in this API is the passing of ``signatures''. If you port this driver, just copy this stuff. You really do not need it for normal operation and it is pretty much O/S independent. Now for the driver overview. It starts life in a fairly normal way, by doing the regular PCI registration shtick. One piece that will catch your eye is the attempt to switch the controller from I/O mapped registers to memory mapped ones. This is not functional yet and should be left alone. There may be some performance advantage and some bug-fixing quality in this. One difference can already be visible; A linked list of dpt_ccb, eata_ccb and sp structures is allocated. How many are allocated depends on what a given DPT says about itself. These ccb (a dpt_ccb contains an eata_ccb and a sp, among other things) are put into the FREE queue. There is such a queue for each controller found. EATA: Extended ATA. The protocol by which we talk to the controller. eata_ccb: This is a data structure, understood by the DPT and contains detailed instructions what to do, including the SCSI command itself sp: A Status Packet. Filled in by the DPT when a command is comlete. dpt_ccb: A container to hold ALL the relevant data about a transaction, including an eata_ccb and an sp. Since a DPT can have up to three busses per controller, the dpt_softc structure contains an array of scsi_link structures. One scsi_link per bus. Flow. The upper kernel sends a scsi command to dpt_scs_cmd. We look at it, tweak some of the data, pop a dpt_ccb from the free queue, populate the eata_ccb and push the request onto the WAITING queue. One important piece to remember, is that we put in the eata_ccb a unique signature, used later to identify the completed command. Remember! The DPT talks big-endian, so htonl is very critical there. Once we are ready to submit the command to the DPT, we do not! Instead, we call dpt_sched_queue. What this function does is SCHEDULE a software interrupt to happen sometimes later. Once we scheduled the interrupt, we return, telling the upper layer that the request was shceduled. At some time in the future (the details evade me, but saying that we have to be in spl0, is probably not a complete lie), the software interrupt will occur. It will actually call the function dpt_sintr. Dpt_sintr runs at an spl level unique to it, and different from the SPL level of the regular interrupt routine dpt_intr. Unfortunately, there is no way I know of to pass any arguments to dpt_sintr. So, dpt_sintr tests all the COMPLETED queues of all the controllers. The completed queue contains transactions completed by the DPT that has not been called back to the upper layers. We will see shortly how that happened. Any command found in the completed queue is disposed of. If it was successful, we call scsi_done. If not, and a re-try is indicated, we put the command back in the waiting queue. This time it goes in the front, not the back of the queue. In this case, we schedule a software interrupt (to pick up the command from the waiting queue. If the command utterly failed, or succeeded, we push the dpt_ccb back into the free queue. You can control (with DPT_FREELIST_IS_STACK) where it goes. Stack will put it in the front, otherwise it goes in the back. Putting it in the front increases the chance of an L2 cache hit. It also increases the chance for weird race conditions. How do we send a command to the DPT? We load into a register set the address of the eata_ccb and write into the command port a command to ``go and do it''. That's all. The DPT will DMA the EATA ccb, do the scatter/gather (did I mention that we build the list in dpt_scsi_cmd?), etc. Before we actually send the command down to the DPT, we pop it off the waiting queue. Once we sent it successfully, we push it on the submitted queue. If we failed to send it, we push it back on the waiting queue and... Right! We generate a software interrupt. If we know that the DPT is too busy (we always know how many commands are in which queue), we do not even try to send it anything. What happens at command completion? Before we answer this, remember that the DPT can take (standard firmware) up to 64 concurrent commands, so, the pushing process continues until we have the DPT's incoming queue full, or until it tells us ``no more''. When the DPT completes a command, it does the following in this order: * DMA all the data in/out, whatever. * Fill out the different SCSI state/reply structures. * Fill a SP, general to the particular controller, with a result code and the unique signature which identifies the command just completed. (Remember, we filled this one out in dpt_scsi_cmd. * Write to status registers the fact that a command completed * Generate a hardware interrupt. * Ignore us until we tell it we serviced this interrupt. We receive the interrupt in dpt_intr. What we basically do there is examine the signature (with varying degrees of pedanticity). If we do not like it, it is an aborted interrupt. Remember these! If we like it, we treat it as if it is the address of the dpt_ccb which contains the transaction. We copy into the dpt_ccb some data that needs copying, take the command off the submitted queue and put it in the completed queue. Now what do we do? Right. We schedule a software interrupt. What does this interrupt do? scan the completed queue, scan the waiting queue, etc. Observations: 1. Why bother? The DPT driver for FreeBSD is part of a non-stop (I hate to say Fault Tolerant - marketing abused and invitation to big arguments) transaction processor. What we wanted is a system where we can push many I/O requests into the kernel and never have a user context block on hardware I/O. We also wanted to be able to schedule new requests concurrent with processing interrupts from the hardware. 2. Is it really faster? No, not really. If you examine the response time, on a slow processor, to a single thread of I/O requests, it is actually slower (There WAS an option to compile the driver without all this SWI stuff), the system shines when you examine two things: How well does it behave under extreamly heavy I/O load? And, how many I/O operations per second can we process? We needed about 800 of the later for our application. I measured today 1,432, with a one second peak of 1,560 I suspect I simply do not know how to push enough transactions into the driver. When you examine the system behavior under load, it is also very good. Our standard test rig includes running at least 256, or even 512 user processes, each in a tight loop doing RANDOM I/O on either tiny disk (so we can have ALL the disk in the DPT cache) or a huge disk (so our cache hit rate will drop as close o zero as possible. Under this load, Load Average is about 50-80, top sill runs every second, the keyboard is totally responsive and network packets still arrive at the rate of well over 1,000/sec. We have seen a single P6-200 do over 6,000 interrupts/sec. 3. Are you really this smart? No I am not. Although, in the context of FreeBSD I may be able to take the blame for the concept, in the context of SCSI interface, anyone familiar with the BSD kernel networking code will immediately recognize what is going on here. Another name to mention here, which has been invaluable in getting this done is Justin Gibbs. Probably one of the finest programmers I have ever seen. 4. Where do we go from here? a. Complete the /dev/dpt interface. It needs testing, debugging. b. Integrate he DLM support, so more than one DPT can be on the same disk array, without each trampling on each other's data. c. Finish the DBFS so that RDBMS engines will have a sharable, reliable and fast storage manager. d. Finish the DIO, so the storage manager can span local and remote devices. This is NOT a replacement for NFS :-) Monitoring: We have a large set of instruments in the driver. I publish some metrics every now and then. Here is a brief summary on how to: cd /dev;./MAKEDEV dpt${x} # Where x is {0,1,2,3) 1 for each DPT present, echo -n "dump softc" > /dev/dpt{x} get_dpt /dev/dpt${x} There is no documentation yet, but the sources are freely available :-) look on sendero-ppp.i-connect.net/crash. Simon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.970826221016.Shimon>