FreeBSD Mail Archives

Date:      Wed, 3 Apr 1996 12:47:11 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        luigi@labinfo.iet.unipi.it (Luigi Rizzo)
Cc:        msmith@atrad.adelaide.edu.au, bde@zeta.org.au, davidg@Root.COM, dutchman@spase.nl, freebsd-hackers@FreeBSD.ORG
Subject:   Re: HDD cpu usage (IDE vs. SCSI).
Message-ID:  <199604031947.MAA19734@phaeton.artisoft.com>
In-Reply-To: <199604030950.LAA01500@labinfo.iet.unipi.it> from "Luigi Rizzo" at Apr 3, 96 11:50:08 am

> > Ah, so you want the system to know how each and every drive works, and
> > to keep track of actuator position and velocity, as well as the rotational
> > position of the media.  Funny ha ha.
> > The only part of the system in a position to make an _informed_ decision
> > about which of several transactions is the easiest to perform next
> > is the disk.  With ZBR, hidden geometry and 'invisible' sector sparing, 
> > the OS doesn't have a hope.  (Yes Terry, I know, RAID-X)
> 
> Agree. I wanted to mention this but forgot while writing the reply.
> Actually this raises the question of for how long the fs code will need or
> even benefit from trying to arrange data in a contiguous fashion
> (cylinder groups, log fs etc.).
> 
> It is right that disks tend to hide features from the disk, but a
> bit of cooperation is certainly useful (as a minimum, I should be able
> to tell the disk "this block is likely to be accessed sequentially
> after block X").
> 
> One last thing, with "invisible" sector sparing it's probably the user
> who doesn't have a hope, but this ought to be an infrequent occurrence.

You turn "invisible" sector sparing off for any serious application,
because we all know that it's not truly invisible... it introduces
delays when replacement sectors are accessed.  Such delays are
catastrophic if you are attempting to speed the system by striping
data (or, to a lesser extent, optimistic replication).  Better
to mark the entire stripe set inaccessable than to take the speed
hit to spindle sync in most cases -- which means disk errors multiply
by the number of members in your stripe set, potentially causing
much larger areas of several disks to be inaccessable because of
errors on other disks.  Which is what drive subsystem vendors
are optimizing for when they sell you "matched disk sets".

For serious applications, you don't use ZBR hardware.  OR you use
it, but you use it with knowledge of when seeks will occur, either
because you enter in the geometry independently, or because you
use SCSI II devices.


Disks do things like reverse ordering sectors, so when you request
a read of a sector on a given track, it positions the head to that
track and starts reading in reverse order until it gets to the
requested block.  It can do this because block reads tend to have
a higher locality: if you ask for block 'n', you are likely to
ask for block 'n+1'.  This would work better if you could transfer
all block runs for a given read to less volatile storage than the
disk controller track buffer... ie: "transfer up to X contiguous
blocks beginning at block 'n', with a minimum of Y blocks to be
transferred".

This is an issue of controller and disk and SCSI command set being
slightly inadequate for totally optimal behaviour (disk usually
run local track buffers, but their locality is limited by seeks,
so unless you optimize accesses to avoid seeks, you lose most of
the benefit of doing this -- with the exception of async writes
within a single track).

Given all that, it's *still* a good idea to order data so as to
avoid seeks, if at all possible.  For SCSI II devices, even ZBR,
it's possible because the real geometry can be queried.

The problem we run into here is that the FFS layout policy assumes
uniform size for cylinder groups; this means it's intrinsically a
bad design for seek optimization on ZBR devices, unless you happen
to overcommit bitmap information for all smaller cylinders (maybe
by marking the non-existant blocks as preallocated).

Then you use absolute sector addressing at table-driven offsets (up
to 8k of table for a large 1024 cylinder device; more likely on the
order of 16k, or 32k for safety, to give a 64 bit address for 4096
possible cylinders.  Per device).

I believe Auspex, and several higher end RAID vendors actually do
this today, when they can control the FS layout on the disk, etc., at
that level.  They run modified FFS, or even proprietary FS's optimized
for modern devices.

All in all, a very interesting problem.  8-).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199604031947.MAA19734>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation