Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Feb 1997 11:20:31 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        Shimon@i-Connect.Net (Simon Shapiro)
Cc:        terry@lambert.org, freebsd-hackers@freebsd.org
Subject:   Re: Raw I/O Question
Message-ID:  <199702121820.LAA00856@phaeton.artisoft.com>
In-Reply-To: <XFMail.970211233250.Shimon@i-Connect.Net> from "Simon Shapiro" at Feb 11, 97 10:49:46 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > > Can someone take a moment and describe briefly the execution path of a
> > > lseek/read/write system call to a raw (character) SCSI partition?
> > 
> > You skipped a specification step: the FS layout on that partition.
> > I will assume FFS with 8k block size (the default).
> 
> I skipped nothing :-)  there is NO file system on the partition.
> Just a simple file (partitions are files.  not in a file system, 
> but files.  Right? :-)

So you are writing a user space FS to use the partition.

My previous posting referred only to FS-formatted block devices;
I interpreted "raw" to mean something other than "raw device"
because when I heard hoofbeats, I thought "horses".

Do you work for Oracle?  8-).


I think the problem comes down to how you handle your commits, and
exactly what your on disk structure looks like, and exactly what you
plan to do to your device driver to support your user space FS.

The "lseek" is basically the same, but the "read" and the "write"
are not.  They go through the struct fileops and into the ops defined
by specfs for character devices, and through there, directly to the
strategy routines through cdevsw reference.


I profoundly believe you should not use character devices for this.

I also believe the FS should be in the kernel, not in user space,
to avoid unnecessary protection domain crossing, and the resulting
context switching that will cause.

We can squeeze some additional code path out of there by eliminating
struct fileops; it's not that hard to do; it's the result of a partial
integration of vnode ops into the VFS framework (the VFS framework was
a rush job -- USL attempted to cripple the ability to make a bootable
OS by surgically claiming 6 pieces of the kernel, which really funneled
down to 5 critical subsystems.  The new VFS code was a workaround for
the consent decree for one of those (IMO).


> > > We are very interested in the most optimal, shortest path to I/O on
> > > a large number of disks.
> > 
> > o     Write in the FS block size, not the disk block size to
> >       avoid causing a read before the write can be done
> 
> No file system.  See above.  What is the block size used then?

This is dependent on the device and the device driver.  For disk
devices, the block size is 512b.  This only really applies well
if you are using the block device, not the character device, which
I recommend you do.


> All these, stripping off the file system pointers, as they do not apply)
> are good and valid, except:
> 
> 1.  We have to guarantee transactions.  This means that system failure,
>     at ANY time cannot ``undo'' what is said to be done.  IOW, a WRITE
>     call that returns positively, is known to have been recorded, on 
>     error/failure resistant medium.  We will be using DPT RAID 
>     controlles for a mix of RAID-1 and RAID-5, as the case justifies.

Are you using a journal, a log, or some other method to handle
implied state across domains?  For example, say I have an index and
a bunch of records pointed to by that index.

In order to do the transaction, I need a two stage commit: I allocate
a new record, I write it, and I then rewrite the index.  In practical
terms, this is:

	i)	alloc new record
	ii)	write new record
	iii)	commit new record
	iv)	write new index
	v)	deallocate old record

...in other words, a standard two stage commit process across two
files.

If you are using a log, in case of failure, you can "undo" any partially
complete transactions (in the commit order above, you can recover by
back-out only... allocated records without indices on next startup
are deallocated).  In the general case, you can roll your transaction
forward if you add:

	.)	start transaction with "intent" record
	i)	alloc new record
	ii)	write new record
	.)	write "record data valid" -- this replaces "commit"

XXX failure after this point can be rolled forward using "intent" record.

	iv)	write new index
	v)	deallocate old record
	.)	mark transaction complete

Note: THIS DOES NOT REQUIRE COMMIT TO DISK EXCEPT FOR THE LOG.  You
      must guarantee order of actual write operation, but not that
      each write operation has actually taken place before you start
      the next operation.

So basically, what you have to commit vs. what you have to order can
save you a hell of a lot of waiting.


> 2.  Our basic recorded unit is less than 512 bytes long.  We compromise
>     and round it (way) up to 512v since nobody makes fast disk drives
>     with sectors smaller than that anymore.  Yes, SCSI being what it
>     is, even 512 is way small.  We know...

Then this must be for write transaction unit.  It is too bad you are
not using a RAID 4 stripe set and writing exactly a full stripe at a
time using spindle sync.


> 3.  In case of system failure (most common reason today == O/S crash) we
>     must be back in operation within less than 10 seconds.  We do that by
>     sharing the disks with another sytem, which is already up.

???

You mean sharing the physical drives, or you mean a network share?  I'm
guessing physical sharing?

There are some not very general things you can do to make an OS boot
nearly instantly (I keep wanting them for private APM modes and for
system install).  For one, you could keep a log of system state, and
restore system state from the log, rather than booting normally.  This
requires the cooperation of the device drivers and certain parts of
the boot process.


> 4.  We need to process very large number of interrupts.  In fact, so
>     many that one FreeBSD CPU cannot keep up.  So, we are back to shared
>     disks.

I suspect you are using PCI controllers.  PCI does not support "fast
interrupts".  Contact Bruce Evans for details on how you can fix this.


> 5.  Because disks are shared, the write state must be very deterministic
>     at all times.  As O/S have caches, RAID controllers have caches,
>     disks have caches, we have to have some sense of who has what in 
>     which cache when.  Considering the O/S to be the most lossy element
>     in the system, we have to keep the amount of WRITE caches to a
>     minimum.

Unless they are non-volatile, anyway.


> > (zero locality of reference: a hard thing to find in the real world)
> > prevent the read-ahead from being invoked.
> 
> Ah!  there is a read-ahead on raw devices?  How do we shut it down?

There is read-ahead for any device which is sequentially accessed.  If
you do not sequentially acces, you will not trigger read-ahead.  This
is a non-problem (I think).

[ ... block size ... ]

> How does all this relate to raw/character devices?

It doesn't (see up top; I didn't think you really meant character
devices when you said "raw").  But neither does the original question,
then, since block size is largely irrelevent above device block size
granularity.  It will depend on the disk diver, and the controller
cache size, and whether or not the disk supports track write caching
itself.


> > > What we see is a flat WRITE response until 2K.  then it starts a linear
> > > decline until it reaches 8K block size.  At this point it converges 
> > > with READ performance.  The initial WRITE performance, for small blocks
> > > is quite poor compared to READ.  We attribute it to the need to do
> > > read-modify-write when blocks are smaller than a certain ``natural block
> > > size (page?).
> > 
> > Yes.  But the FS block size s 8k, not pagesize (4k).
> 
> We were not using a filesystem.  That's the point.

Then it's undefined, and it's relative to the controller/disk combination
only.


> O_WRITESYNC!  This is an open(2) option that says that all write's are
> synchronous (do not return until actually done). Right?  And it applies 
> to block devices, as well as filesystem files.  Right?

Yes.  It internally does the same thing you are doing, without the
additional transition out then back in across the protection domain,
with the accompanying possibility of context switch.


> The ``only'' difference is additional 200 system calls per second?  How many
> of these can a Pentium-Pro, 512K cache, 128MB RAM, etc. can do in one
> second.
> We are always in the 1,000+ in our budget.  20% increase is a lot to us.

It depends on where you are bound up.  If all writes are synchronus, you
are bound up in disk I/O, not system call overhead.  If writes are being
guaranteed, and you don't force synchronicity to imply idempotence
acoss disk operations that aren't themselves atomic (ie: index/data
releationships), then you may see the system call overhead.  I know
I have been on projects where this is important enough that we defined
our own system calls to combine write-then-read operations on networks,
I/O and stat operations, and pattern matching in the kernel so that
irrelevent data is not pushed back over the getdents interface, etc..


> > Most likely, you do not really need this, or you are poorly implementing
> > the two stage commit process typical of most modern database design.
> 
> Assumptions, assumptions... :-)  There is no database, there is no 2phase
> commit here.  Wish I could share more details in this forum, but I am 
> already stretching it :-(  

I'd have to say your synchronicity requirements are probably specious,
then.  What you really have are transaction ordering requirements, and,
as noted above, you don't have to have synchronicity to implement them.

> > > The READ performance is even more peculiar.  It starts higher than
> > > WRITE, declines rapidly until block size reaches 2K.  It peaks at 4K
> > > blocks and starts a linear decline from that point on (as block size 
> > > increases).
> > 
> > This is because of precache effects.  Your "random" reads are not
> > "random" enough to get rid of cache effects, it seems.  If they were,
> > the 4k numbers would be worse, and the peak would be the FS block size.
> 
> On a block device?  Which filesystem?

Well, disk block size, then.

> The same tests described here were run on a well known commercial OS.  It
> exhibits totally flat response from 512 bytes to 4Kb blocks. What happened
> at 8K blocks and larger?  The process will totally hang if you did
> read + (O_SYNC) write on the same file at the same time.  Cute.

Sounds like they have a single queue for locking operations on vnodes.

If I had to guess, your commercial OS was Solaris 2.x, x>=3.  I really
disagree with the way Solaris implements its SMP locking; it doesn't
scale well, it's not as concurrent as they'd like you to believe, and
it's hard for third parties to use.

I wish you could try it on a Unisys 6000/50 SVR4.0.2 ES/MP system; I
believe they did vnode locking correctly.


> > Jorg, Julian, and the specific SCSI driver authors are probably
> > your best resource below the bdevsw[] layer.
> 
> I appreciate that.  I have not seen anything in the SCSI layer that really
> ``cares'' about the type of I/O done.  It all appears the same.

In general, it's not supposed to care.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199702121820.LAA00856>