From owner-freebsd-current@FreeBSD.ORG  Mon Oct 25 22:43:38 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 027A816A4CE
	for <freebsd-current@freebsd.org>;
	Mon, 25 Oct 2004 22:43:38 +0000 (GMT)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 6C14843D45
	for <freebsd-current@freebsd.org>;
	Mon, 25 Oct 2004 22:43:37 +0000 (GMT)
	(envelope-from scottl@freebsd.org)
Received: from [192.168.254.11] (junior-wifi.samsco.home [192.168.254.11])
	(authenticated bits=0)
	by pooker.samsco.org (8.12.11/8.12.10) with ESMTP id i9PMiYG7078287;
	Mon, 25 Oct 2004 16:44:34 -0600 (MDT)
	(envelope-from scottl@freebsd.org)
Message-ID: <417D812F.1040404@freebsd.org>
Date: Mon, 25 Oct 2004 16:41:51 -0600
From: Scott Long <scottl@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.2) Gecko/20040929
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Charles Swiger <cswiger@mac.com>
References: <14479.1098695558@critter.freebsd.dk>
	<417D25E8.6080804@ng.fadesa.es> <200410251928.01536.victor@alf.dyndns.ws>
	<"200410251837.58257.Thoma s.Sparrev ohn"@btinternet.com>
	<417D3F12.20302@DeepCore.dk> <417D40A1.9030802@ng.fadesa.es>
	<417D45F1.9090504@freebsd.org> <77F3FD4D-26BE-11D9-9A2F-003065ABFD92@mac.com>
	<F5F15CA0-26C5-11D9-9A2F-003065ABFD92@mac.com> <417D65F1.2040809@freebsd.org>
	<p0600205fbda318006656@[10.0.1.3]> <417D6F4C.9000404@freebsd.org>
	<p06002061bda3224cd029@[10.0.1.3]>
	<64029B30-26D2-11D9-9A2F-003065ABFD92@mac.com>
In-Reply-To: <64029B30-26D2-11D9-9A2F-003065ABFD92@mac.com>
X-Enigmail-Version: 0.86.1.0
X-Enigmail-Supports: pgp-inline, pgp-mime
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, hits=0.0 required=3.8 tests=none autolearn=no version=2.63
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on pooker.samsco.org
cc: freebsd-current@freebsd.org
Subject: Re: FreeBSD 5.3b7and poor ata performance
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 25 Oct 2004 22:43:38 -0000

Charles Swiger wrote:
> On Oct 25, 2004, at 5:39 PM, Brad Knowles wrote:
> 
>> At 3:25 PM -0600 2004-10-25, Scott Long wrote:
>>
>>>                                       But as was said, there is always
>>>  a performance vs. reliability tradeoff.
>>
>>
>>     Well, more like "Pick two: performance, reliability, price"  ;)
> 
> 
> That sounds familiar.  :-)
> 
> If you prefer...            ...consider using:
> ----------------------------------------------
> performance, reliability:    RAID-1 mirroring
> performance, cost:         RAID-0 striping
> reliability, performance:    RAID-1 mirroring (+ hot spare, if possible)
> reliability, cost:            RAID-5 (+ hot spare)
> cost, reliability:            RAID-5
> cost, performance:            RAID-0 striping

It's more complex than that.  Are you talking software RAID, PCI RAID,
or external RAID?  That affects all three quite a bit.  Also, how do
you define reliability?  Do you verify reads on RAID-1 and 5?  Also,
what about error recovery?

> 
>>> And when you are talking about RAID-10 with a bunch of disks, you 
>>> will indeed start seeing bottlenecks in the bus.
>>
>>
>>     When you're talking about using a lot of disks, that's going to be 
>> true for any disk subsystem that you're trying to get a lot of 
>> performance out of.
> 
> 
> That depends on your hardware, of course.  :-)
> 
> There's a Sun E450 with ten disks over 5 SCSI channels in the room next 
> door: one UW channel native on the MB, and two U160 channels apiece from 
> two dual-channel cards which come with each 8-drive-bay extender kit.  
> It's running Solaris and DiskSuite (ODS) now, but it would be 
> interesting to put FreeBSD on it and see how that does, if I ever get 
> the chance.
> 
>>     The old rule was that if you had more than four disks per channel, 
>> you were probably hitting saturation.  I don't know if that specific 
>> rule-of-thumb is still valid, but I'd be surprised if disk controller 
>> performance hasn't roughly kept up with disk performance over time.
> 
> 
> That rule dates back to the early days of SCSI-2, where you could fit 
> about four drives worth of aggregate throughput over a 40Mbs ultra-wide 
> bus.  The idea behind it is still sound, although the numbers of drives 
> you can fit obviously changes whether you talk about ATA-100 or SATA-150.
> 

The formula here is simple:

ATA: 2
SATA: 1

So the channel transport starts becoming irrlevant now (except when you
talk about SAS and having bonded channels going to switches).  The
limiting factor again becomes PCI.  An easy example is the software
RAID cards that are based on the Marvell 8 channel SATA chip.  It can
drive all 8 drives at max platter speed if you have enough PCI bandwidth
(and I've tested this recently with FreeBSD 5.3, getting >200 MB/s
across 4 drives).  However, you're talking about PCI-X-100 bandwidth at
that point, which is not what most peole have in their desktop systems.
And for reasons of reliability, I wouldn't consider software RAID to
be something that you would base your server-class storage on other than
to mirror the boot drive so a failure there doesn't immediately bring
you down.

Anyways, it sounds like the original poster found that at least part of
the problem was due to local ATA problems.  In the longer term, I'd
like to see people who care about performance focus on things like
I/Os per second, not raw bandwidth.  As I mentioned above, I've seen
that a software RAID driver on FreeBSD can sustain line rate with the
drives on large transfers.  That would make sense because the overhead
to set up the DMA is dwarfed in comparison to the time to do the DMA.
I'd also like to see more 'apples-to-apples' comparisons.  It doesn't
mean a whole lot to say, for example, that software RAID on SCSI
doesn't perform as well as a single ATA drive, regardless of how 'common
sense' this argument might sound.  The performance characteristics of
ATA and SCSI really are quite different.  With SCSI you get the ability
to do lots of parallel request via tagged queueing, and ATA just can't
touch that. With ATA you tend to get large caches and agressive
read-ahead, so sequential performance is always good.  In my opinion
these qualities can have a detrimental impact on reliability, but again
my focus has always been on reliability first.

What is interesting is measuring how many single-sector transfers can be
done per second and how much CPU that consumes.  I used to be able
to get about 11,000 io/s on an aac card on a 5.2-CURRENT system from
last winter.  Now I can only get about 7,000.  I not sure where the
problem is yet, unfortunately.  I'm using KSE pthreads to generate a
lot of parallel requests with as little overhead as possible, so maybe
something there has changed, or maybe something in the I/O path above
the driver has changed, or maybe something in interrupt handling or
shceduling has changed.  It would be interesting to figure this out
since this definitenly shows a problem.

Scott