From owner-freebsd-hackers  Thu Nov 12 16:48:51 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id QAA23016
          for freebsd-hackers-outgoing; Thu, 12 Nov 1998 16:48:51 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from papillon.lemis.com (papillon.lemis.com [192.109.197.159])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA23001
          for <hackers@FreeBSD.org>; Thu, 12 Nov 1998 16:48:41 -0800 (PST)
          (envelope-from grog@freebie.lemis.com)
Received: from freebie.lemis.com (freebie.lemis.com [192.109.197.137]) by papillon.lemis.com (8.9.1/8.6.12) with ESMTP 
       id SAA02288; Thu, 12 Nov 1998 18:44:46 +1030 (CST)
Received: (from grog@localhost)
	by freebie.lemis.com (8.9.1/8.9.0) id SAA16347;
	Thu, 12 Nov 1998 18:45:12 +1030 (CST)
Message-ID: <19981112184509.K463@freebie.lemis.com>
Date: Thu, 12 Nov 1998 18:45:09 +1030
From: Greg Lehey <grog@lemis.com>
To: Bernd Walter <ticso@cicely.de>, Mike Smith <mike@smith.net.au>,
        hackers@FreeBSD.ORG
Subject: Re: [Vinum] Stupid benchmark: newfsstone
References: <199811100638.WAA00637@dingo.cdrom.com> <19981111103028.L18183@freebie.lemis.com> <19981111040654.07145@cicely.de> <19981111134546.D20374@freebie.lemis.com> <19981111085152.55040@cicely.de> <19981111183546.D20849@freebie.lemis.com> <19981111194157.06719@cicely.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.91.1i
In-Reply-To: <19981111194157.06719@cicely.de>; from Bernd Walter on Wed, Nov 11, 1998 at 07:41:57PM +0100
WWW-Home-Page: http://www.lemis.com/~grog
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-41-739-7062
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Wednesday, 11 November 1998 at 19:41:57 +0100, Bernd Walter wrote:
> On Wed, Nov 11, 1998 at 06:35:46PM +1030, Greg Lehey wrote:
>> On Wednesday, 11 November 1998 at  8:51:52 +0100, Bernd Walter wrote:
>>> On Wed, Nov 11, 1998 at 01:45:46PM +1030, Greg Lehey wrote:
>>>> On Wednesday, 11 November 1998 at  4:06:54 +0100, Bernd Walter wrote:
>>>>> On Wed, Nov 11, 1998 at 10:30:28AM +1030, Greg Lehey wrote:
>>>>>> On Monday,  9 November 1998 at 22:38:04 -0800, Mike Smith wrote:
>>>>> [...]
>>>>> One point is that is doesn't aggregate transactions to the lower drivers.
>>>>> When using stripes of one sector it's doing no more than one sector
>>>>> transactions to the HDDs so at least with the old scsi driver there's no
>>>>> linear performance increase with it. That's the same with ccd.
>>>>
>>>> Correct, at least as far as Vinum goes.  The rationale for this is
>>>> that, with significant extra code, Vinum could aggregate transfers
>>>> *from a single user request* in this manner.  But any request that
>>>> gets this far (in other words, runs for more than a complete stripe)
>>>> is going to convert one user request into n disk requests.  There's no
>>>> good reason to do this, and the significant extra code would just chop
>>>> off the tip of the iceberg.  The solution is in the hands of the user:
>>>> don't use small stripe sizes.  I recommend a stripe of between 256 and
>>>> 512 kB.
>>>
>>> That's good for random performance increase - but for linear access a smaler
>>> stripe size is the only way to get the maximum performance of all
>>> disks together.
>>
>> No, the kind of stripe size you're thinking about will almost always
>> degrade performance.  If you're accessing large quantities of data in
>> a linear fashion, you'll be reading 60 kB at a time.  If each of these
>> reads requires accessing more than one disk, you'll kill performance.
>> Try it: I have.
>
> With agregation?

No, with less than a full stripe transferred.

> Say You read the volume linear without any other activity on the
> disks.  If you have a stripe size of 60k and reading is at 60k
> chunks each read will read 60k of only one disk - expecting all
> transactions are stripe aligned.  The only thing wich will increase
> performance are the readahead abilities of the fs-driver and the
> disks themself - at least if I havn't missed any.

Right.

> If You use 512byte Stripes and read 60k chunks - the current
> situation is that each drive gets single sector transactions which
> is often slower than a single disk.

That would definitely be slower.

> What I expect is that an agreagation such a 60k chunk access on the
> volume is splited into only one transaction per drive - so you can
> read from all the drives at the same time and get an bandwidth
> increase.

OK, so you want to have 4 15 kB reads, and you expect a performance
improvement because of it.

Let's consider the hardware: a good modern disk has a disk transfer
rate of 10 MB/s and a rotational speed of 7200 rpm.  Let's look at the
times involved:

		rotational		transfer time	total
		latency

1 disk/60 kB	   4.2 ms		6 ms		10.2 ms
4 disks/15 kB	   7.8 ms		1.5 ms		 9.3 ms

Huh?  Why the difference in rotational latency?  If you're reading
from one disk, on average you'll have a half track latency.  For two,
on average one is half a track off from the other, so you'll have a
latency of .75 a track.  With three drives, it's .875, and with four
drives, it's .9375 of a track.  Still, in this case (the largest
possible block size, and only 4 disks), you win--barely.  Let's look
at a more typical case: 16 kB

		rotational		transfer time	total
		latency

1 disk/16 kB	   4.2 ms		1.6 ms		 5.8 ms
4 disks/4 kB	   7.8 ms		 .4 ms		 8.2 ms

Most transfers are 16 kB or less.  What really kills you is the lack
of spindle synchronization between the disks.  If they were
synchronized, that would be fine, but that's more complicated than it
looks.  You'd need identical disks with identical layout (subdisks in
the same place on each disk).  And it's almost impossible to find
spindle synchronized disks nowadays.  Finally, aggregating involves a
scatter/gather approach which, unless I've missed something, is not
supported at a hardware level.  Each request to the driver specifies
one buffer for the transfer, so the scatter gather would have to be
done by allocating more memory and performing the transfer there (for
a read) and then copying to the correct place.

I have thought about aggregating in the manner you describe, and to a
certain extent I feel it's a copout not to do so.  I hope you now see
that it doesn't really make sense in this context.

Greg
--
See complete headers for address, home page and phone numbers
finger grog@lemis.com for PGP public key

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message