From owner-freebsd-hackers Thu Nov 12 16:48:51 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id QAA23016 for freebsd-hackers-outgoing; Thu, 12 Nov 1998 16:48:51 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from papillon.lemis.com (papillon.lemis.com [192.109.197.159]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id QAA23001 for ; Thu, 12 Nov 1998 16:48:41 -0800 (PST) (envelope-from grog@freebie.lemis.com) Received: from freebie.lemis.com (freebie.lemis.com [192.109.197.137]) by papillon.lemis.com (8.9.1/8.6.12) with ESMTP id SAA02288; Thu, 12 Nov 1998 18:44:46 +1030 (CST) Received: (from grog@localhost) by freebie.lemis.com (8.9.1/8.9.0) id SAA16347; Thu, 12 Nov 1998 18:45:12 +1030 (CST) Message-ID: <19981112184509.K463@freebie.lemis.com> Date: Thu, 12 Nov 1998 18:45:09 +1030 From: Greg Lehey To: Bernd Walter , Mike Smith , hackers@FreeBSD.ORG Subject: Re: [Vinum] Stupid benchmark: newfsstone References: <199811100638.WAA00637@dingo.cdrom.com> <19981111103028.L18183@freebie.lemis.com> <19981111040654.07145@cicely.de> <19981111134546.D20374@freebie.lemis.com> <19981111085152.55040@cicely.de> <19981111183546.D20849@freebie.lemis.com> <19981111194157.06719@cicely.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.91.1i In-Reply-To: <19981111194157.06719@cicely.de>; from Bernd Walter on Wed, Nov 11, 1998 at 07:41:57PM +0100 WWW-Home-Page: http://www.lemis.com/~grog Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-41-739-7062 Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Wednesday, 11 November 1998 at 19:41:57 +0100, Bernd Walter wrote: > On Wed, Nov 11, 1998 at 06:35:46PM +1030, Greg Lehey wrote: >> On Wednesday, 11 November 1998 at 8:51:52 +0100, Bernd Walter wrote: >>> On Wed, Nov 11, 1998 at 01:45:46PM +1030, Greg Lehey wrote: >>>> On Wednesday, 11 November 1998 at 4:06:54 +0100, Bernd Walter wrote: >>>>> On Wed, Nov 11, 1998 at 10:30:28AM +1030, Greg Lehey wrote: >>>>>> On Monday, 9 November 1998 at 22:38:04 -0800, Mike Smith wrote: >>>>> [...] >>>>> One point is that is doesn't aggregate transactions to the lower drivers. >>>>> When using stripes of one sector it's doing no more than one sector >>>>> transactions to the HDDs so at least with the old scsi driver there's no >>>>> linear performance increase with it. That's the same with ccd. >>>> >>>> Correct, at least as far as Vinum goes. The rationale for this is >>>> that, with significant extra code, Vinum could aggregate transfers >>>> *from a single user request* in this manner. But any request that >>>> gets this far (in other words, runs for more than a complete stripe) >>>> is going to convert one user request into n disk requests. There's no >>>> good reason to do this, and the significant extra code would just chop >>>> off the tip of the iceberg. The solution is in the hands of the user: >>>> don't use small stripe sizes. I recommend a stripe of between 256 and >>>> 512 kB. >>> >>> That's good for random performance increase - but for linear access a smaler >>> stripe size is the only way to get the maximum performance of all >>> disks together. >> >> No, the kind of stripe size you're thinking about will almost always >> degrade performance. If you're accessing large quantities of data in >> a linear fashion, you'll be reading 60 kB at a time. If each of these >> reads requires accessing more than one disk, you'll kill performance. >> Try it: I have. > > With agregation? No, with less than a full stripe transferred. > Say You read the volume linear without any other activity on the > disks. If you have a stripe size of 60k and reading is at 60k > chunks each read will read 60k of only one disk - expecting all > transactions are stripe aligned. The only thing wich will increase > performance are the readahead abilities of the fs-driver and the > disks themself - at least if I havn't missed any. Right. > If You use 512byte Stripes and read 60k chunks - the current > situation is that each drive gets single sector transactions which > is often slower than a single disk. That would definitely be slower. > What I expect is that an agreagation such a 60k chunk access on the > volume is splited into only one transaction per drive - so you can > read from all the drives at the same time and get an bandwidth > increase. OK, so you want to have 4 15 kB reads, and you expect a performance improvement because of it. Let's consider the hardware: a good modern disk has a disk transfer rate of 10 MB/s and a rotational speed of 7200 rpm. Let's look at the times involved: rotational transfer time total latency 1 disk/60 kB 4.2 ms 6 ms 10.2 ms 4 disks/15 kB 7.8 ms 1.5 ms 9.3 ms Huh? Why the difference in rotational latency? If you're reading from one disk, on average you'll have a half track latency. For two, on average one is half a track off from the other, so you'll have a latency of .75 a track. With three drives, it's .875, and with four drives, it's .9375 of a track. Still, in this case (the largest possible block size, and only 4 disks), you win--barely. Let's look at a more typical case: 16 kB rotational transfer time total latency 1 disk/16 kB 4.2 ms 1.6 ms 5.8 ms 4 disks/4 kB 7.8 ms .4 ms 8.2 ms Most transfers are 16 kB or less. What really kills you is the lack of spindle synchronization between the disks. If they were synchronized, that would be fine, but that's more complicated than it looks. You'd need identical disks with identical layout (subdisks in the same place on each disk). And it's almost impossible to find spindle synchronized disks nowadays. Finally, aggregating involves a scatter/gather approach which, unless I've missed something, is not supported at a hardware level. Each request to the driver specifies one buffer for the transfer, so the scatter gather would have to be done by allocating more memory and performing the transfer there (for a read) and then copying to the correct place. I have thought about aggregating in the manner you describe, and to a certain extent I feel it's a copout not to do so. I hope you now see that it doesn't really make sense in this context. Greg -- See complete headers for address, home page and phone numbers finger grog@lemis.com for PGP public key To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message