From owner-freebsd-hackers  Sat Apr 22 21:05:06 1995
Return-Path: hackers-owner
Received: (from majordom@localhost)
          by freefall.cdrom.com (8.6.10/8.6.6) id VAA12779
          for hackers-outgoing; Sat, 22 Apr 1995 21:05:06 -0700
Received: from cs.weber.edu (cs.weber.edu [137.190.16.16])
          by freefall.cdrom.com (8.6.10/8.6.6) with SMTP id VAA12769
          for <freebsd-hackers@FreeBSD.org>; Sat, 22 Apr 1995 21:05:05 -0700
Received: by cs.weber.edu (4.1/SMI-4.1.1)
	id AA10300; Sat, 22 Apr 95 21:55:59 MDT
From: terry@cs.weber.edu (Terry Lambert)
Message-Id: <9504230355.AA10300@cs.weber.edu>
Subject: Re: large filesystems/multiple disks [RAID]
To: rgrimes@gndrsh.aac.dev.com (Rodney W. Grimes)
Date: Sat, 22 Apr 95 21:55:59 MDT
Cc: jgreco@brasil.moneng.mei.com, freebsd-hackers@FreeBSD.org
In-Reply-To: <199504221758.KAA02014@gndrsh.aac.dev.com> from "Rodney W. Grimes" at Apr 22, 95 10:58:16 am
X-Mailer: ELM [version 2.4dev PL52]
Sender: hackers-owner@FreeBSD.org
Precedence: bulk

[ ... striping code ... ]

> > Did you ever make any progress on this?  If not, I will (try to) look at
> > it, but I'd prefer that somebody that knows what the heck they're doing down
> > within the device driver code putz with it..  :-)
> 
> Yes, I played with that code (infact I have a kernel with /dev/ilv in
> it).  I never made it work completly.  Then I remeber the sys/dev/cd.c
> driver that came with 4.4 Lite and went and looked at it.  I also have
> that working (renamed to concat.c to elimanate the conflict) partially,
> enough to say that I took 2 4MB/sec drives and interleaved them and
> got a 5.2MB/sec transfer rate for reads (I can't write due to bugs)
> *with out* spindle sync. 

This is nearly spot-on the theoretical performance of 5.3333 for two
devices replacing a single device with a 100% random distribution of
stripes between the media (assuming the 4MB/S number and 5.2MB/S number
are correct).  Congradulations!

For anyone that's interested, the expected speed up is +33% for two
units with N (N>=two) outstanding operations or 79% for three units
with N (N>=three) outstanding operations.


> I have done a bunch of aggregate bandwidth testing now using from 1 to
> 4 NCR810 SCSI controllers on a P54C-90 and found I can actually hit
> 12-14MB/sec using 4MB/sec drives.  We seem to have a bottleneck in
> the ncr.c driver when trying to run multiple drives on one controller.
> I have run single drives on that controler at 6.6MB/sec, but two 4MB
> drives only get 5.3MB/sec.

This would be indicative of command queuing not working quite as expected,
or a maximum of two outstanding requests simultaneously (5.3 is the
maximum, which you'd expect if you didn't have drive interleave latency
to consider.

> My first pass through concat.c was a ``mechanical conversion, just make
> the bloody thing compile and do *something*''.  I am now onto the task
> of actually going through it and cleaning it up to work correctly.  

I don't think you will get better than your first shot for random I/O.

Unfortunately, I have an algorithm for calculating expected performance
using equivalent drive/interface combinations, but it only works for
the number of requests to satisfy being less than or equal to the number
of disks.

> > Having recently seen Solaris' Online: DiskSuite, which suffers from fairly
> > significant performance degradations, I'm curious to see what a real
> > operating system can do.  ;-)
> 
> It will be at least another week, but you'll now I have made serious
> progress when you see a cvs commit message for the import of
> sys/dev/concat.

For truly random strip placement, there will be a potential for
performance degradation based on the file system mechanism used to
address the blocks themselves, and whether it is a high percentage
of the overhead on the attempted I/O.

The other consideration is that you are not typically going to see
the performance increase unless you either split the drives between
SCSI controllers or actually get command queueing working.

Typically, I would expect that spindle sync would do nothing for
you unless your stripe lengths are on the order of single cluster
size and you divide the actual rotational latency of the drive by
the number of synced spindles before using it, and then scale it
by the relative sync notification time added to the rotational
period.  Adjusting all this for possible ZBR variations in the
effective rotational period based on distance from the spindle.
Then use round-robin allocation of sequential blocks from disk
to disk to ensure linear ordering of the distribution.

Or, you could get complicated.  8^).


					Terry Lambert
					terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.