Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Aug 1997 22:10:16 -0700 (PDT)
From:      Simon Shapiro <Shimon@i-Connect.Net>
To:        Jason Thorpe <thorpej@nas.nasa.gov>, FreeBSD-Hackers@FreeBSD.org
Subject:   RE: DPT driver - Overview (very long)
Message-ID:  <XFMail.970826221016.Shimon@i-Connect.Net>
In-Reply-To: <199708261726.KAA07028@lestat.nas.nasa.gov>

next in thread | previous in thread | raw e-mail | index | archive | help
Folks,

I am posting this reply so that all those who ask me about this type of 
questions will get a (slightly more) concise answer 


Hi Jason Thorpe;  On 26-Aug-97 you wrote: 

>  Hello Simon,
>  
>  I was going to embark on porting your DPT driver to NetBSD, but when I
>  looked at it, it was a bit more complex-looking than I expected.
>  
>  I was wondering if you could give me a brief design overview?  I'm most
>  interested in the use of soft interrupts, and would also like to know
>  what the "signature" thing is all about...
>  
>  Thanks!

Let's start at the end :-)  The signature business;  This is a piece of 
``black magic'' that identifies the exact version and configuration of
the DPT to a calling program.  ALL the DPT controllers are essentially 
the same from API point of view.  Some are more limited (ISA-EISA-PCI),
some are slower (6800,68020,68030), some have multiple busses, some do
not, etc.  

The next step is to understand what a DPT is.  It really is not an HBA
(Host Bus Adapter), although it likes to pretend to be one.  It is more
like an I/O Channel controller on a mainframe;  A computer with its own
O/S, memory, I/O (there is even a serial port on the card), etc.

In addition to normal HBA functions the DPT manages RAID arrays in a very
clever and complex way.  RAID arrays need configuration, dead disks need
to be replaced, replacements added, etc.  Also consider the need for (and
ability to) grow arrays, share arrays with other controllers on the same
buss, etc.  Last, but not least is the need to interact with external
entities and exchange configuration and status data;  The DPT can tell you
that the disk cabinets are too warm, that a fan failed, that a disk is 
about to fail soon, how well are your disks performing, etc.

To do all this you need a user interface.  DPT has a standard user
interface that we ported wholesale to FreeBSD.  We added some minor, cute
options along the way.  One of the components in this API is the passing
of ``signatures''.  If you port this driver, just copy this stuff.
You really do not need it for normal operation and it is pretty much O/S
independent.

Now for the driver overview.

It starts life in a fairly normal way, by doing the regular PCI
registration shtick.  One piece that will catch your eye is the attempt
to switch the controller from I/O mapped registers to memory mapped ones.
This is not functional yet and should be left alone.  There may be some
performance advantage and some bug-fixing quality in this.
One difference can already be visible;  A linked list of dpt_ccb, eata_ccb
and sp structures is allocated.  How many are allocated depends on what a
given DPT says about itself.  These ccb (a dpt_ccb contains an eata_ccb
and a sp, among other things) are put into the FREE queue.  There is such
a queue for each controller found.

EATA:  Extended ATA.  The protocol by which we talk to the controller.
eata_ccb:  This is a data structure, understood by the DPT and contains
           detailed instructions what to do, including the SCSI command 
           itself
sp:  A Status Packet.  Filled in by the DPT when a command is comlete.
dpt_ccb:  A container to hold ALL the relevant data about a transaction,
          including an eata_ccb and an sp.

Since a DPT can have up to three busses per controller, the dpt_softc
structure contains an array of scsi_link structures.  One scsi_link per
bus.

Flow.  The upper kernel sends a scsi command to dpt_scs_cmd.  We look at
it, tweak some of the data, pop a dpt_ccb from the free queue, populate
the eata_ccb and push the request onto the WAITING queue.  One important 
piece to remember, is that we put in the eata_ccb a unique signature, used
later to identify the completed command.

  Remember!  The DPT talks big-endian, so htonl is very critical there.
Once we are ready to submit the command to the DPT, we do not!
Instead, we call dpt_sched_queue.  What this function does
is SCHEDULE a software interrupt to happen sometimes later.  Once we 
scheduled the interrupt, we return, telling the upper layer that the 
request was shceduled.

At some time in the future (the details evade me, but saying that we have
to be in spl0, is probably not a complete lie), the software interrupt
will occur.  It will actually call the function dpt_sintr.  Dpt_sintr runs
at an spl level unique to it, and different from the SPL level of the
regular interrupt routine dpt_intr.

Unfortunately, there is no way I know of to pass any arguments to
dpt_sintr.  So, dpt_sintr tests all the COMPLETED queues of all the
controllers.  The completed queue contains transactions completed by the
DPT that has not been called back to the upper layers.  We will see 
shortly how that happened.  Any command found in the completed queue is
disposed of.  If it was successful, we call scsi_done.  If not, and a 
re-try is indicated, we put the command back in the waiting queue.
This time it goes in the front, not the back of the queue.  In this case,
we schedule a software interrupt (to pick up the command from the waiting
queue.
If the command utterly failed, or succeeded, we push the dpt_ccb back into
the free queue.  You can control (with DPT_FREELIST_IS_STACK) where it
goes.  Stack will put it in the front, otherwise it goes in the back.
Putting it in the front increases the chance of an L2 cache hit.  It also
increases the chance for weird race conditions.

How do we send a command to the DPT?  We load into a register set the
address of the eata_ccb and write into the command port a command to 
``go and do it''.  That's all.  The DPT will DMA the EATA ccb, do the 
scatter/gather (did I mention that we build the list in dpt_scsi_cmd?),
etc.  Before we actually send the command down to the DPT, we pop it
off the waiting queue.  Once we sent it successfully, we push it on the
submitted queue.  If we failed to send it, we push it back on the waiting
queue and... Right! We generate a software interrupt.  If we know that the
DPT is too busy (we always know how many commands are in which queue), we
do not even try to send it anything.

What happens at command completion?  

Before we answer this, remember that the DPT can take (standard firmware)
up to 64 concurrent commands, so, the pushing process continues until
we have the DPT's incoming queue full, or until it tells us ``no more''.

When the DPT completes a command, it does the following in this order:

  * DMA all the data in/out, whatever.
  * Fill out the different SCSI state/reply structures.
  * Fill a SP, general to the particular controller, with a result code
    and the unique signature which identifies the command just completed.
    (Remember, we filled this one out in dpt_scsi_cmd.

  * Write to status registers the fact that a command completed
  * Generate a hardware interrupt.
  * Ignore us until we tell it we serviced this interrupt.

We receive the interrupt in dpt_intr.  What we basically do there is
examine the signature (with varying degrees of pedanticity).  If we do not
like it, it is an aborted interrupt.  Remember these!
If we like it, we treat it as if it is the address of the dpt_ccb which
contains the transaction.  We copy into the dpt_ccb some data that needs
copying, take the command off the submitted queue and put it in the
completed queue.  Now what do we do?  Right.  We schedule a software 
interrupt.

What does this interrupt do?  scan the completed queue, scan the waiting
queue, etc.

Observations:  

1.  Why bother?  The DPT driver for FreeBSD is part of a non-stop (I hate
    to say Fault Tolerant - marketing abused and invitation to big
    arguments) transaction processor.  What we wanted is a system where we
    can push many I/O requests into the kernel and never have a user
    context block on hardware I/O.  We also wanted to be able to schedule
    new requests concurrent with processing interrupts from the hardware.

2.  Is it really faster?  No, not really.  If you examine the response 
    time, on a slow processor, to a single thread of I/O requests, it is
    actually slower (There WAS an option to compile the driver without
    all this SWI stuff), the system shines when you examine two things:
    How well does it behave under extreamly heavy I/O load?  And, how many
    I/O operations per second can we process?  We needed about 800 of the
    later for our application.  I measured today 1,432, with a one second
    peak of 1,560  I suspect I simply do not know how to push enough
    transactions into the driver.

    When you examine the system behavior under load, it is also very good.
    Our standard test rig includes running at least 256, or even 512
    user processes, each in a tight loop doing RANDOM I/O on either tiny
    disk (so we can have ALL the disk in the DPT cache) or a huge disk
    (so our cache hit rate will drop as close o zero as possible.
    Under this load, Load Average is about 50-80, top sill runs every
    second, the keyboard is totally responsive  and network packets still
    arrive at the rate of well over 1,000/sec.  We have seen a single
    P6-200 do over 6,000 interrupts/sec.

3.  Are you really this smart?  No I am not.  Although, in the context of
    FreeBSD I may be able to take the blame for the concept, in the
    context of SCSI interface, anyone familiar with the BSD kernel 
    networking code will immediately recognize what is going on here.

    Another name to mention here, which has been invaluable in getting 
    this done is Justin Gibbs.  Probably one of the finest programmers
    I have ever seen.

4.  Where do we go from here?

    a.  Complete the /dev/dpt interface.  It needs testing, debugging.
    b.  Integrate he DLM support, so more than one DPT can be on the 
        same disk array, without each trampling on each other's data.
    c.  Finish the DBFS so that RDBMS engines will have a sharable,
        reliable and fast storage manager.
    d.  Finish the DIO, so the storage manager can span local and remote
        devices.  This is NOT a replacement for NFS :-)

Monitoring:  We have a large set of instruments in the driver.  I publish
some metrics every now and then.  Here is a brief summary on how to:

cd /dev;./MAKEDEV dpt${x} # Where x is {0,1,2,3) 1 for each DPT present,

echo -n "dump softc" > /dev/dpt{x}
get_dpt /dev/dpt${x}

There is no documentation yet, but the sources are freely available :-)
look on sendero-ppp.i-connect.net/crash.

Simon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.970826221016.Shimon>