From owner-freebsd-hackers  Tue Mar  3 21:25:19 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id VAA26019
          for freebsd-hackers-outgoing; Tue, 3 Mar 1998 21:25:19 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id VAA26009
          for <hackers@freebsd.org>; Tue, 3 Mar 1998 21:25:08 -0800 (PST)
          (envelope-from karl@Mars.mcs.net)
Received: from Mars.mcs.net (karl@Mars.mcs.net [192.160.127.85]) by Kitten.mcs.com (8.8.7/8.8.2) with ESMTP id XAA02559; Tue, 3 Mar 1998 23:24:45 -0600 (CST)
Received: (from karl@localhost) by Mars.mcs.net (8.8.7/8.8.2) id XAA05534; Tue, 3 Mar 1998 23:24:44 -0600 (CST)
Message-ID: <19980303232444.59397@mcs.net>
Date: Tue, 3 Mar 1998 23:24:44 -0600
From: Karl Denninger  <karl@mcs.net>
To: shimon@simon-shapiro.org
Cc: Wilko Bulte <wilko@yedi.iaf.nl>, sbabkin@dcn.att.com,
        tlambert@primenet.com, jdn@acp.qiv.com, blkirk@float.eli.net,
        hackers@FreeBSD.ORG, grog@lemis.com
Subject: Re: SCSI Bus redundancy...
References: <19980303200652.07366@mcs.net> <XFMail.980303212310.shimon@simon-shapiro.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.84
In-Reply-To: <XFMail.980303212310.shimon@simon-shapiro.org>; from Simon Shapiro on Tue, Mar 03, 1998 at 09:23:10PM -0800
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Tue, Mar 03, 1998 at 09:23:10PM -0800, Simon Shapiro wrote:
> 
> On 04-Mar-98 Karl Denninger wrote:
>  
> > If the filesystem has news on it?  Forget it.  The small files blast
> the
> > hell out of restore (or pax, or anything else) during the creates - even
> > if
> > mounted async during that operation.  It simply takes forever.  I've
> > tried
> > copying a 4G news spool disk before - get ready for a 12 hour wait.
> 
> I always thought news is an excellent candidate for a heavily cached RAID-0.
> 8 stripes or better.  Some of my customers insist on RAID-5 which is very
> slow on WRITE.  Never understood why.  I figured out, whith 8 stripes on a
> RAID-0, you will fail /var/spool/news aboutevery 18 months.

The probelm is getting enough cache to matter.

We have a 25GB RAID 0+1 news spool.  Its about half full right now, on its
way upward (we keep tuning expiration times).  There is basically ZERO
locality of reference by the readers, which means that you'd need at least a
couple of GB of RAM to make any difference at all.

Now the RAID adapter helps - a lot - by striping the writes and reads, 
primarily.  The ultra SCSI bus ends up being the controlling factor.

> Failures will occur, but if a database is on ``raw disk'' which is
> RAID-{1,5} on a relaible adapter, you will not have much to complain about.
> What I am nervous about, is running RAID-{1,5} in the Unix kernel.  It
> makes the actualy integrity of your disks dependant on the sound driver,
> the VGA driver, X11, PPP, etc.  Any bug in these not only will lay your
> filesystem to rest. It will take the ``disk'' with it.  Then I have yet to
> see an in-kernel RAID that can truely recover concurrently with the O/S
> operation.  I am not talking perfromance, I am taking functionality at all.

Yep.  The other problem is that a kernel RAID *cannot* do writeback caching.
If it does, you're f*d if the power goes out or the OS goes down dirty.

The standalones CAN do writeback, because they can have a battery on them
AND if the CPU dies they keep running.

RAID 5, in particular, benefits enormously from writeback, as it allows 
it to defer writes until an entire stripe is ready, which means no 
read/compute/write cycle.  This is a monstrous win for performance.

> >> True.  Your perfromance also goes up with the smaller drives.  You can
> >> stripe better.  I think I mentioned it before in this forum;  Most DBMS
> >> benchmarks only use 300MB of the disk.  This is sort of the ``sweet
> >> spot''
> >> between system cost and perfrormance.
> > 
> > To a point this is true.  The problem is that the smaller disks rotate
> > slower, and have bit densities that are lower.
> 
> Yup.  This is where you see a benchmark machine using 200 4 GB drives for a
> database of 50GB.  Nobody can affrd such machine, but benchmarks being what
> they are.

Yep.

> > There is a tradeoff between seek latency and transfer time.  If there are
> > lots of small files, the huge number of small disks wins big.  If there
> > are
> > a few large files, the small number of disks with speed on the physical
> > I/O
> > wins, provided you can seek sequentially.
> 
> Yes, but they all end up on a SCSI bus.  The great equalizer.  At least I
> do not feel so lonely in my views any more :-)

Well, yes.

The best I've seen off our RAID systems right now is about 11MB/sec (that's
megaBYTES, not bits).  That's on an Ultra bus, with 2 ultra busses going to
the RAID disks.  

Neither the disk buses nor the RAID controller CPU are saturated.  I 
believe this is pretty much the wall on one SCSI channel, at least with 
16 SCBs.  I'm going to try it with SCBPAGING turned on and see if that 
helps, but for sequential reads it probably won't matter much.

I could run two host channels on this thing across two RAID sets into two
Adaptec adapters.  That might be a big win.

I suspect the bottleneck is in the AIC code at this point, or the bus 
itself, or the interrupt latency on the DMA completion is killing me.  
There is no appreciable difference between running at 40MB/sec (ultra 
full-bore) and 20MB/sec, indicating that perhaps the hold-up is in the 
Adaptec microcode, driver, and/or the Adaptec/PCI bus interface.

--
-- 
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
			     | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message