Date: Wed, 3 Apr 1996 12:47:11 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: luigi@labinfo.iet.unipi.it (Luigi Rizzo) Cc: msmith@atrad.adelaide.edu.au, bde@zeta.org.au, davidg@Root.COM, dutchman@spase.nl, freebsd-hackers@FreeBSD.ORG Subject: Re: HDD cpu usage (IDE vs. SCSI). Message-ID: <199604031947.MAA19734@phaeton.artisoft.com> In-Reply-To: <199604030950.LAA01500@labinfo.iet.unipi.it> from "Luigi Rizzo" at Apr 3, 96 11:50:08 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > Ah, so you want the system to know how each and every drive works, and > > to keep track of actuator position and velocity, as well as the rotational > > position of the media. Funny ha ha. > > The only part of the system in a position to make an _informed_ decision > > about which of several transactions is the easiest to perform next > > is the disk. With ZBR, hidden geometry and 'invisible' sector sparing, > > the OS doesn't have a hope. (Yes Terry, I know, RAID-X) > > Agree. I wanted to mention this but forgot while writing the reply. > Actually this raises the question of for how long the fs code will need or > even benefit from trying to arrange data in a contiguous fashion > (cylinder groups, log fs etc.). > > It is right that disks tend to hide features from the disk, but a > bit of cooperation is certainly useful (as a minimum, I should be able > to tell the disk "this block is likely to be accessed sequentially > after block X"). > > One last thing, with "invisible" sector sparing it's probably the user > who doesn't have a hope, but this ought to be an infrequent occurrence. You turn "invisible" sector sparing off for any serious application, because we all know that it's not truly invisible... it introduces delays when replacement sectors are accessed. Such delays are catastrophic if you are attempting to speed the system by striping data (or, to a lesser extent, optimistic replication). Better to mark the entire stripe set inaccessable than to take the speed hit to spindle sync in most cases -- which means disk errors multiply by the number of members in your stripe set, potentially causing much larger areas of several disks to be inaccessable because of errors on other disks. Which is what drive subsystem vendors are optimizing for when they sell you "matched disk sets". For serious applications, you don't use ZBR hardware. OR you use it, but you use it with knowledge of when seeks will occur, either because you enter in the geometry independently, or because you use SCSI II devices. Disks do things like reverse ordering sectors, so when you request a read of a sector on a given track, it positions the head to that track and starts reading in reverse order until it gets to the requested block. It can do this because block reads tend to have a higher locality: if you ask for block 'n', you are likely to ask for block 'n+1'. This would work better if you could transfer all block runs for a given read to less volatile storage than the disk controller track buffer... ie: "transfer up to X contiguous blocks beginning at block 'n', with a minimum of Y blocks to be transferred". This is an issue of controller and disk and SCSI command set being slightly inadequate for totally optimal behaviour (disk usually run local track buffers, but their locality is limited by seeks, so unless you optimize accesses to avoid seeks, you lose most of the benefit of doing this -- with the exception of async writes within a single track). Given all that, it's *still* a good idea to order data so as to avoid seeks, if at all possible. For SCSI II devices, even ZBR, it's possible because the real geometry can be queried. The problem we run into here is that the FFS layout policy assumes uniform size for cylinder groups; this means it's intrinsically a bad design for seek optimization on ZBR devices, unless you happen to overcommit bitmap information for all smaller cylinders (maybe by marking the non-existant blocks as preallocated). Then you use absolute sector addressing at table-driven offsets (up to 8k of table for a large 1024 cylinder device; more likely on the order of 16k, or 32k for safety, to give a 64 bit address for 4096 possible cylinders. Per device). I believe Auspex, and several higher end RAID vendors actually do this today, when they can control the FS layout on the disk, etc., at that level. They run modified FFS, or even proprietary FS's optimized for modern devices. All in all, a very interesting problem. 8-). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199604031947.MAA19734>