FreeBSD Mail Archives

Date:      Tue, 11 Feb 1997 22:49:46 -0800 (PST)
From:      Simon Shapiro <Shimon@i-Connect.Net>
To:        Terry Lambert <terry@lambert.org>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Raw I/O Question
Message-ID:  <XFMail.970211233250.Shimon@i-Connect.Net>
In-Reply-To: <199702112244.PAA29164@phaeton.artisoft.com>


Hi Terry Lambert;  On 11-Feb-97 you wrote: 
> > Can someone take a moment and describe briefly the execution path of a
> > lseek/read/write system call to a raw (character) SCSI partition?
> 
> You skipped a specification step: the FS layout on that partition.
> I will assume FFS with 8k block size (the default).

I skipped nothing :-)  there is NO file system on the partition.
Just a simple file (partitions are files.  not in a file system, 
but files.  Right? :-)

> I will also assume your lseek is absolute or relative to the start
> of the file (no VOP_GETATTR needed to find the current end of the
> file).

Yes.

> I will take a gross stab at this; clearly, I can't include everything,
> and the Lite2 changes aren't reflected.  I expect I will be corrected
> wherever I have erred.
> 
> lseek
>       -> lseek syscall
>       -> set offset in fdesc
>       -> return( 0);
> 
>       (one could argue that there should be a VOP_LSEEK at
>        this point to allow for predictive read-ahead using
>        the lseek offset -- there is not)
> 
> read
>       -> read syscall
>       -> fill out uio struct
>       -> call VOP_READ using bogus fileops struct dereference
>          which is there because named pipes and UNIX domain
>          sockets aren't in the VFS like they should be
>       -> ffs_read (/sys/ufs/ufs/ufs_readwrite.c:READ)
>       -> (called iteratively)
>               bread
>               -> getblk
>                  (in cache?  No?)
>                  -> vfs_busy_pages
>                     VOP_STRATEGY
>                     -> VCALL strategy routine for device vnode
>                     -> spec_strategy (/sys/miscfs/specfs/spec_vnops.c)
>                     -> call device strategy through bdevsw[]
>                     -> generic scsi (scbus [bus interface]/sd [disk
>                        interface]
>                     -> actual controller requests
>                     biowait
>               uiomove
>               -> copyout
> 
> write
>       -> write syscall
>       -> fill out uio struct
>       -> call VOP_WRITE using bogus fileops struct dereference
>          which is there because named pipes and UNIX domain
>          sockets aren't in the VFS like they should be
>       -> ffs_write (/sys/ufs/ufs/ufs_readwrite.c:WRITE)
>       -> (called iteratively)
>               (partial FS block? !!!CALL READ!!!
>               -> fill in modified areas of partial FS block
>                  (uiomove)
>                  -> copyin
>               bwrite
>       ...

Excellet!  Thank you very much!  I leave it here so those who missed it
get a second chance.

> > We are very interested in the most optimal, shortest path to I/O on
> > a large number of disks.
> 
> o     Write in the FS block size, not the disk block size to
>       avoid causing a read before the write can be done

No file system.  See above.  What is the block size used then?

> o     Do all I/O on FS block boundries
> o     Use the largest FS block size you can
> o     Used CCD to implement striping
> o     Use a controller that supports tagged command queueing
> o     Use disk drives with tack write caching (they use the
>       rotational speed of the disk to power writes after a
>       power failure, so writes can be immediately ack'ed even
>       though they haven't really bee written).

All these, stripping off the file system pointers, as they do not apply)
are good and valid, except:

1.  We have to guarantee transactions.  This means that system failure,
    at ANY time cannot ``undo'' what is said to be done.  IOW, a WRITE
    call that returns positively, is known to have been recorded, on 
    error/failure resistant medium.  We will be using DPT RAID 
    controlles for a mix of RAID-1 and RAID-5, as the case justifies.

2.  Our basic recorded unit is less than 512 bytes long.  We compromise
    and round it (way) up to 512v since nobody makes fast disk drives
    with sectors smaller than that anymore.  Yes, SCSI being what it
    is, even 512 is way small.  We know...

3.  In case of system failure (most common reason today == O/S crash) we
    must be back in operation within less than 10 seconds.  We do that by
    sharing the disks with another sytem, which is already up.

4.  We need to process very large number of interrupts.  In fact, so
    many that one FreeBSD CPU cannot keep up.  So, we are back to shared
    disks.

5.  Because disks are shared, the write state must be very deterministic
    at all times.  As O/S have caches, RAID controllers have caches,
    disks have caches, we have to have some sense of who has what in 
    which cache when.  Considering the O/S to be the most lossy element
    in the system, we have to keep the amount of WRITE caches to a
    minimum.

    (I do not intend to start a war.  I am quoting my management who has
    collected some impressive statistics in this manner.  Using some
    commercial O/S's which will not be named here)

...

> DEFINITION:   Random reads/writes: "please remove any cache
>               effects from my testing, I believe my app will
>               be a cache-killer, so I don't want cache to
>               count for anything because I have zero locality
>               of reference".

Almost, but not quite :-)  Each FreeBSD system will have 50GB of database
on it.  Although the 90/10 rules proabably apply, it is impossible to
predict, or force, the locality.  Having 90% cache hit rate has some
cooling problems associated with it :-)  Not only system cooling but
management cooling as well.  They do not see a systme with that much RAM
as amusing, nor exciting.

...  Something got garbled here...

> (zero locality of reference: a hard thing to find in the real world)
> prevent the read-ahead from being invoked.

Ah!  there is a read-ahead on raw devices?  How do we shut it down?

> The best speed will be at FS block size, since all reads and writes
> are in terms of chunks in FS block size, some multiple of the page
> size (in general, assuming you want it to be fast).
> 
> The smaller your block size, the more data you have to read of of disk
> for your write.
> 
> The VM system has an 8 bit bitmap, one bit per 512b (physical disk
> block) in a 4k (VM page size) page.  This bitmap is, unfortunately,
> not used for read/write, or your aligned 512b blocks would not have
> to actually read 4k of data off the disk to write the 512b you want
> to write.
> 
> The problem here is that you can not insure write fault granularity
> to better than your page size.  The funny thing is, the i386 will
> not write fault a modification from protected mode (kernel code),
> so it has to fake this anyway -- so it's *possible* to fake it,
> and it would, in general, be a win for writes on disk block boundries
> for some multiple of disk block size (usually 512b for disks, 1k for
> writeable optical media).

How does all this relate to raw/character devices?

> Talk to John Dyson about this optimization.  Don't expect an enthusiastic
> response: real work utilization is seldom well aligned... this is a
> "benchmark optimization".

John Dyson,  Sory but I do not have your email address...

I made sure it is actually a fit.  We made all the data records (that
require
this performance/reliability) be exactly 512 bytes.  It is very ``wasteful''
in terms of storage, but when compared to the speed/cost benefits, it is
very
cheap. Also, consider the number of spindles required for our transaction
rate and we ended up with more disk space than we need, so...

> > What we see is a flat WRITE response until 2K.  then it starts a linear
> > decline until it reaches 8K block size.  At this point it converges 
> > with READ performance.  The initial WRITE performance, for small blocks
> > is quite poor compared to READ.  We attribute it to the need to do
> > read-modify-write when blocks are smaller than a certain ``natural block
> > size (page?).
> 
> Yes.  But the FS block size s 8k, not pagesize (4k).

We were not using a filesystem.  That's the point.

> > Another attribute of performance loss, we think to be the
> > lack of O_SYNC) option to the write(2) system call.  This forces the 
> > application to do an fsync after EVERY WRITE.  We have to do that for
> > many good reasons.
> 
> There is an option O_WRITESYNC.  Use it, or fcntl() it on.  You will
> take a big hit for using it, however; the only overhead difference
> will be the system call overhead for implementing the protection
> domain crossing for the fsync() call.

O_WRITESYNC!  This is an open(2) option that says that all write's are
synchronous (do not return until actually done). Right?  And it applies 
to block devices, as well as filesystem files.  Right?

The ``only'' difference is additional 200 system calls per second?  How many
of these can a Pentium-Pro, 512K cache, 128MB RAM, etc. can do in one
second.
We are always in the 1,000+ in our budget.  20% increase is a lot to us.

> Most likely, you do not really need this, or you are poorly implementing
> the two stage commit process typical of most modern database design.

Assumptions, assumptions... :-)  There is no database, there is no 2phase
commit here.  Wish I could share more details in this forum, but I am 
already stretching it :-(  

> > The READ performance is even more peculiar.  It starts higher than
> > WRITE, declines rapidly until block size reaches 2K.  It peaks at 4K
> > blocks and starts a linear decline from that point on (as block size 
> > increases).
> 
> This is because of precache effects.  Your "random" reads are not
> "random" enough to get rid of cache effects, it seems.  If they were,
> the 4k numbers would be worse, and the peak would be the FS block size.

On a block device?  Which filesystem?

> > We intend to use the RAW (character) device with the mpool buffering
> > system and would like to understand its behavior without reading the
> > WHOLE kernel source :-)
> 
> The VM and buffer cache have been unified.  bread/bwrite are, in fact,
> page fault operations.  Again, talk to John Dyson about the bitmap
> optimization for representing partially resident pages at this level;
> otherwise, you *must* fault the data in and out on page boundries,
> and the fault will be in page groups of FS blocksize in size.

Hmmm...  We are going back to this decades old argument.  One of the few
things I did agree with while at Oracle, was that not ALL disk I/O in a
system is composed of page faults and emacs sessions.  Sometimes I/O needs
to be performed in manners which defy any preplanning on the O/S architect
part.  This is where raw devices are so crucially important.

The same tests described here were run on a well known commercial OS.  It
exhibits totally flat response from 512 bytes to 4Kb blocks. What happened
at 8K blocks and larger?  The process will totally hang if you did
read + (O_SYNC) write on the same file at the same time.  Cute.

...

> Jorg, Julian, and the specific SCSI driver authors are probably
> your best resource below the bdevsw[] layer.

I appreciate that.  I have not seen anything in the SCSI layer that really
``cares'' about the type of I/O done.  It all appears the same.

> They are written using a write operation which block until the data
> has been committed.  Per the definition of O_WRITESYNC.

Thanx!

...

Simon

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.970211233250.Shimon>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation