From owner-freebsd-hackers  Sun Mar 21 14:52:15 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from allegro.lemis.com (allegro.lemis.com [192.109.197.134])
	by hub.freebsd.org (Postfix) with ESMTP
	id 2484814EF7; Sun, 21 Mar 1999 14:52:08 -0800 (PST)
	(envelope-from grog@freebie.lemis.com)
Received: from freebie.lemis.com (freebie.lemis.com [192.109.197.137])
	by allegro.lemis.com (8.9.1/8.9.0) with ESMTP id JAA13703;
	Mon, 22 Mar 1999 09:21:48 +1030 (CST)
Received: (from grog@localhost)
	by freebie.lemis.com (8.9.3/8.9.0) id JAA07220;
	Mon, 22 Mar 1999 09:21:47 +1030 (CST)
Message-ID: <19990322092147.T429@lemis.com>
Date: Mon, 22 Mar 1999 09:21:47 +1030
From: Greg Lehey <grog@lemis.com>
To: Nick Hilliard <nick@iol.ie>
Cc: tom@sdf.com, freebsd-scsi@FreeBSD.ORG,
	FreeBSD Hackers <hackers@FreeBSD.ORG>
Subject: Re: dpt raid-5 performance
References: <19990321084436.Z429@lemis.com> <199903211417.OAA28733@beckett.earlsfort.iol.ie>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.93.2i
In-Reply-To: <199903211417.OAA28733@beckett.earlsfort.iol.ie>; from Nick Hilliard on Sun, Mar 21, 1999 at 02:17:13PM +0000
WWW-Home-Page: http://www.lemis.com/~grog
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-41-739-7062
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

[copying -hackers, since this is of relevance beyond -scsi]

On Sunday, 21 March 1999 at 14:17:13 +0000, Nick Hilliard wrote:
>> I haven't yet replied to Nick's message because I wanted to check
>> something here first, and I've been too busy so far.  But I'll come
>> back with some comparisons.
>
> I'm going to run some benchmarks over the next few days and see what they
> throw up.
>
> My instinct was that 512K was a "good" interleave size in some sense
> of the word, mainly because of the fact that it would cause so many
> fewer disk io ops in most circumstances -- in fact, all
> circumstances except where you're doing piles of tiny io ops.  The
> bonnie results seem to shatter this illusion.

I have found a similar tendency with my testing with vinum.
*However*, I also looked at the number of I/O requests issued, and
they are very varied.  There could be a cache interaction problem
here.  One of the things is that bonnie doesn't measure raw disk
throughput, and that's what we're really trying to measure.  The
figures don't make any sense.  Here are the preliminary results.  They
were all done with a 1.6 GB volume with one plex spread over four
ancient CDC drives (thus the poor overall performance; the comparisons
should be valid, however).

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU Writes Reads  Mblock  Mstripe
ufs       100   582 13.8   479  3.0   559  5.2  1121 24.6  1124  5.2  45.4  2.6
s1k       100   156 15.4   150 12.3   108  3.7   230  6.7   230  2.7  36.3  3.4 311848 328587 619009  138783
s8k       100  1492 44.8  1478 18.4   723  8.1  1466 34.0  1467  8.2 115.4  8.0  38913  41065  56152    9337
s64k      100  1723 48.6  1581 18.6  1021 11.8  1792 39.5  1827 11.1 115.3  8.8  17238   8231   1294     333
s256k     100  1717 47.2  1629 19.0   937 11.2  1469 32.2  1467  8.7  95.9  7.8  16982   9272   2001     494
s512k     100  1772 48.8  1621 18.0   732  8.3  1256 27.4  1254  7.4 115.4  8.8  16157   7564    155      37
r512k     100   379 14.9   385  8.9   360  4.5  1122 24.7  1258  7.4  80.9  6.7  38339  46453    521     793
s4m/16    100  1572 52.8  1336 18.6   612  6.1  1139 25.2  1142  5.6  97.9  7.1  20434   8028     17       7
s4m/17    100  1431 44.8  1234 16.9   613  6.1  1145 25.4  1147  5.6  97.3  7.0  19922   8101    113      31

Sorry for the format; I'll probably remove some of the bonnie columns
when I'm done.

The "Machine" indicates the type and stripe size of the plex (r:
RAID-5, s: striped, ufs: straight ufs partition for comparison
purposes).  The additional columns at the end are the writes and reads
at plex level, the number of multiblock transfers (combined read and
write), and the number of multistripe transfers (combined read and
write).  A multiblock transfer is one which requires two separate I/Os
to satisfy, and a multistripe transfer is one which requires accessing
two different stripes.  They're the main cause of degraded performance
with small stripes.

I tried two different approaches with the 4 MB stripes: with a default
newfs, I got 16 cylinders per cylinder group and cylinder groups of 32
MB, which placed all the superblocks on the first disk.  The second
time I tried with 17 cylinders per super group, which put successive
superblocks on a different disk.

Some of the things that seem to come out of these results are:

- Performance with 1 kB stripes is terrible.  Performance with 8 kB
  stripes is much better, but a further increase stripe size helps.

- Block read and random seek performance increases dramatically up to
  a stripe size of about 64 kB, after which it drops off again.

- Block write performance increases up to a stripe size of 512 kB,
  after which it drops off again.

- Peak write performance is about 3.5 times that of a straight UFS
  file system.  This is due to buffer cache: the writes are
  asynchronous to the process, and can thus overlap.  I'm quite happy
  with this particular figure, since it's relatively close to the
  theoretical maximum of a 4x performance improvement.

- Peak read performance is about 1.6 times that of a straight UFS file
  system.

- RAID-5 read performance is comparable to striped read performance.
  Write performance is about 24% of striped write performance.  Note
  that there is a *decrease* in CPU time for RAID-5 writes: the reason
  for the performance decrease is that there are many more I/O
  operations (compare the Reads and Writes columns).

The trouble with these results is that they don't make sense.
Although we can see some clear trends, there are also obvious
anomalies:

- On a striped volume, the mapping of reads and writes is identical.
  Why should reads peak at 64 kB and writes at 512 kB?

- The number of multiblock and multistripe transfers for s4m/17 is 8
  times that for s4m/16.  The number of writes for s4m/17 is lower
  than for s4m/16.  The number of writes should be the number of raw
  writes to the device (volume) plus the number of multiblock and
  multistripe transfers; in other words, s4m/17 should have *more*
  transfers, not less.  There's obviously something else here, and I
  suspect cache.

- The random seek performance is pretty constant for s8k, s64k and
  s512k.  Since bonnie performs 8k transfers, this seems reasonable.
  On the other hand, the performance was much worse for s256k, which I
  did last.  Again, I suspect that there are other issues here which
  are clouding the results.

In addition, bonnie does not simulate true file system performance
well.  The character I/O benchmarks are not of relevance for what
we're looking at, and the block I/O benchmarks all use 8 kB transfers.
True file system performance includes transfers of between 1 and 120
sectors of 512 bytes, with an average apparently in the order of 8kB.
In real life, the performance benefits of large stripes will be
greater.  I'm currently thinking of writing a program which will be
able to simulate this behaviour and get more plausible measurements.

To add to this theory, I've just re-run the 64 kB test under what look
to me like identical conditions.  Here are the results.  The first
line is a copy of the one I did yesterday (above), the second one are
the new results:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU Writes Reads  Mblock  Mstripe
s64k      100  1723 48.6  1581 18.6  1021 11.8  1792 39.5  1827 11.1 115.3  8.8  17238   8231   1294     333
s64k      100  1711 48.4  1633 18.9   983 11.5  1778 39.6  1815 11.3  95.8  7.8  16495   8029   7952    1986

In other words, there are significant differences in the way vinum was
accessed in each case, and in particular we can assume that the
differences in random seek performance are, well, random.

Getting back to your results which started this thread, however, there
are some significant differences:

     -------Sequential Output-------- ---Sequential Input-- --Random--
     -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
  256   541  7.5   491  1.7   458  2.3  4293 59.1  4335 16.2 193.6  6.8
  100   379 14.9   385  8.9   360  4.5  1122 24.7  1258  7.4  80.9  6.7

Comparing the block write and block read performances, vinum gets
about 30% of the read performance on writes.  Your DPT write results
show only 11% of the read performance, and are in fact only slightly
faster than vinum with the ancient disks, so I can't see that this
could be due to the faster disks.  So yes, I suspect there is
something wrong here.  It's possible that DPT doesn't DTRT with large
slices: Vinum only accesses that part of a slice which is necessary
for the transfers.  It's possible that DPT accesses the complete 512
kB block on each transfer, in which case, of course, it would be
detrimental to use a stripe size in excess of about 64 kB, and you
might even get better performance with 32 kB.  If this is the case,
however, it's a bug with DPT's implementation, not a general
principle.

Greg
--
See complete headers for address, home page and phone numbers
finger grog@lemis.com for PGP public key


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message