Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Feb 1997 05:25:59 -0500 (EST)
From:      Peter Dufault <dufault@hda.com>
To:        Shimon@i-Connect.Net (Simon Shapiro)
Cc:        terry@lambert.org, freebsd-hackers@freebsd.org
Subject:   Re: Raw I/O Question
Message-ID:  <199702121026.FAA11921@hda.hda.com>
In-Reply-To: <XFMail.970211233250.Shimon@i-Connect.Net> from Simon Shapiro at "Feb 11, 97 10:49:46 pm"

next in thread | previous in thread | raw e-mail | index | archive | help
As has been already noted, if you're using the raw device don't
look at the block device.

> No file system.  See above.  What is the block size used then?

I'm assuming you will hack your driver as needed.  The answer in
that case is the size you request that must be a multiple of the
sector size up to 64K.

The optimal I/O chunk is probably a full track striped across all
disks, but then your I/O size will vary with cylinder offset and
you'll really waste space, so forget that.

> All these, stripping off the file system pointers, as they do not apply)
> are good and valid, except:
> 
> 1.  We have to guarantee transactions.  This means that system failure,
>     at ANY time cannot ``undo'' what is said to be done.  IOW, a WRITE
>     call that returns positively, is known to have been recorded, on 
>     error/failure resistant medium.  We will be using DPT RAID 
>     controlles for a mix of RAID-1 and RAID-5, as the case justifies.

You will have to have your driver issue the calls to flush the
device disk cache at the correct points or turn off cacheing on
the drives.

> 2.  Our basic recorded unit is less than 512 bytes long.  We compromise
>     and round it (way) up to 512v since nobody makes fast disk drives
>     with sectors smaller than that anymore.  Yes, SCSI being what it
>     is, even 512 is way small.  We know...

You can reformat the drive to a smaller sector size.

> 3.  In case of system failure (most common reason today == O/S crash) we
>     must be back in operation within less than 10 seconds.  We do that by
>     sharing the disks with another sytem, which is already up.
> 
> 4.  We need to process very large number of interrupts.  In fact, so
>     many that one FreeBSD CPU cannot keep up.  So, we are back to shared
>     disks.

State what that large number is.  There may be too much going on
in the interrupt, but that is a fixable problem especially for a
single controller.  I'm skeptical that with things balanced properly
you will run out of CPU before memory bandwidth, especially given
that you seem to be able to use large transfers.

> 5.  Because disks are shared, the write state must be very deterministic
>     at all times.  As O/S have caches, RAID controllers have caches,
>     disks have caches, we have to have some sense of who has what in 
>     which cache when.  Considering the O/S to be the most lossy element
>     in the system, we have to keep the amount of WRITE caches to a
>     minimum.
> 
>     (I do not intend to start a war.  I am quoting my management who has
>     collected some impressive statistics in this manner.  Using some
>     commercial O/S's which will not be named here)

The raw I/O system doesn't have a cache.  It is transferring the
data directly from your user process to the device.  From then on
you are at the mercy of the RAID setup and the corresponding disk
setup, which may be specified by the RAID controller manufacturer.

(...)

> Ah!  there is a read-ahead on raw devices?  How do we shut it down?

There will be read ahead on the raw device itself and not in the
system.  A disk will have a cache and the performance is specified
in the cache page (page 8).  The RAID controller probably also has
a cache and you must look to the manufacturer find out what you
must / can do to tune it.

> 
(...)
> 
> How does all this relate to raw/character devices?

It doesn't.

(...)

> > Most likely, you do not really need this, or you are poorly implementing
> > the two stage commit process typical of most modern database design.
> 
> Assumptions, assumptions... :-)  There is no database, there is no 2phase
> commit here.  Wish I could share more details in this forum, but I am 
> already stretching it :-(  

Hmm.  I'm from Missouri (For non-Americans: The show-me state, indicating
healthy skepticism about claims until demonstrated).

> > They are written using a write operation which block until the data
> > has been committed.  Per the definition of O_WRITESYNC.

Again, they aren't.  They are doing a WRITE and however the RAID
controller or SCSI disk respond to a WRITE is what the drivers are
doing.

The RAID controller manufacturer has hopefully spent time figuring
out how to configure the disks to ensure no spurious "writes claimed
finished when not".  They need that to guarantee fault tolerance.
Go over your questions with them.

I would evaluate writing a raw device driver from scratch specifically
for this application and use any existing drivers as maintenance
devices.  This lets you address issues such as interrupt overhead,
etc, separately and from a clean slate, and will make it a lot
easier to be sure you are extracting the performance you want.
I expect you won't fully "get your head around the problem" otherwise.

Peter

-- 
Peter Dufault (dufault@hda.com)   Realtime Machine Control and Simulation
HD Associates, Inc.               Voice: 508 433 6936



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199702121026.FAA11921>