From owner-freebsd-hackers  Fri Nov 13 01:06:19 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id BAA11542
          for freebsd-hackers-outgoing; Fri, 13 Nov 1998 01:06:19 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from gatekeeper.tsc.tdk.com (gatekeeper.tsc.tdk.com [207.113.159.21])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id BAA11537
          for <hackers@FreeBSD.ORG>; Fri, 13 Nov 1998 01:06:18 -0800 (PST)
          (envelope-from gdonl@tsc.tdk.com)
Received: from sunrise.gv.tsc.tdk.com (root@sunrise.gv.tsc.tdk.com [192.168.241.191])
	by gatekeeper.tsc.tdk.com (8.8.8/8.8.8) with ESMTP id BAA25440;
	Fri, 13 Nov 1998 01:05:29 -0800 (PST)
	(envelope-from gdonl@tsc.tdk.com)
Received: from salsa.gv.tsc.tdk.com (salsa.gv.tsc.tdk.com [192.168.241.194])
	by sunrise.gv.tsc.tdk.com (8.8.5/8.8.5) with ESMTP id BAA10348;
	Fri, 13 Nov 1998 01:05:28 -0800 (PST)
Received: (from gdonl@localhost)
	by salsa.gv.tsc.tdk.com (8.8.5/8.8.5) id BAA07536;
	Fri, 13 Nov 1998 01:05:26 -0800 (PST)
From: Don Lewis <Don.Lewis@tsc.tdk.com>
Message-Id: <199811130905.BAA07536@salsa.gv.tsc.tdk.com>
Date: Fri, 13 Nov 1998 01:05:26 -0800
In-Reply-To: Greg Lehey <grog@lemis.com>
       "Re: [Vinum] Stupid benchmark: newfsstone" (Nov 12,  6:45pm)
X-Mailer: Mail User's Shell (7.2.6 alpha(3) 7/19/95)
To: Greg Lehey <grog@lemis.com>, Bernd Walter <ticso@cicely.de>,
        Mike Smith <mike@smith.net.au>, hackers@FreeBSD.ORG
Subject: Re: [Vinum] Stupid benchmark: newfsstone
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Nov 12,  6:45pm, Greg Lehey wrote:
} Subject: Re: [Vinum] Stupid benchmark: newfsstone

} 		rotational		transfer time	total
} 		latency

} 1 disk/60 kB	   4.2 ms		6 ms		10.2 ms
} 4 disks/15 kB	   7.8 ms		1.5 ms		 9.3 ms
} 
} Huh?  Why the difference in rotational latency?  If you're reading
} from one disk, on average you'll have a half track latency.  For two,
} on average one is half a track off from the other, so you'll have a
} latency of .75 a track.  With three drives, it's .875, and with four
} drives, it's .9375 of a track.

Things should not be quite so bleak in practice, since the drives that
complete their part of any given transaction the fastest don't have to
wait for the slower drives and can get started on the next transaction.

Assuming a 9ms seek time, if you have N independent transactions, then
the total time to complete them in the small stripe parallel case will be
	(N-1)*4.2ms + 7.8ms + N*9ms + N*1.5ms
or 238.8ms for 16 transactions.

If the transactions are small enough so that the transfer time is a small
part of the total, then you are better off using larger stripes so that
each transaction only involves one spindle, so that you can be servicing
four transactions in parallel.
	ceiling(N/4) * (4.2ms + 9ms + 6ms)
or 76.8ms for 16 transactions.

While your calculations show that using a 15kB stripe size wins
slightly in terms of the latency of one 60kB transaction, it looks to
me like this actually loses if you have multiple transactions and add
in the seek time.

For large transactions, the average transfer rate drops because of the
time wasted by head switching and track to track seeks.  At some point,
make the transfer time will dominate, and if you don't have many
transactions that can be done in parallel it makes sense to choose a
stripe size so that all the drives are running in parallel on a given
transaction.  If you're doing 1MB transactions, then it will take more
than 100ms just for the transfer time in the single drive case, but
about 25ms for the case where you've got four drives processing this
transaction in parallel.  Since this difference is much greater than
the differences in the rotational latency between these two cases, using
a stripe size less than one fourth the size of the transaction is the
winning strategy.

I suspect it would be optimal if the stripes corresponded to disk tracks,
but since modern drives have track lengths that vary between cylinders,
this is pretty impractical.

I suspect that most common system workloads fall into the small
transaction category, so the stripe size should be chosen large enough
so that it is not common for a transaction to be split across multiple
drives.  This maximizes the number of transactions that can be
processed in parallel.  The only problem with using stripe sizes that
are very large is that there may be periods of time where a particular
part of a filesystem gets heavy use and this would all be directed to
one drive while the others sat idle.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message