From owner-freebsd-hackers Sun Mar 21 14:52:15 1999 Delivered-To: freebsd-hackers@freebsd.org Received: from allegro.lemis.com (allegro.lemis.com [192.109.197.134]) by hub.freebsd.org (Postfix) with ESMTP id 2484814EF7; Sun, 21 Mar 1999 14:52:08 -0800 (PST) (envelope-from grog@freebie.lemis.com) Received: from freebie.lemis.com (freebie.lemis.com [192.109.197.137]) by allegro.lemis.com (8.9.1/8.9.0) with ESMTP id JAA13703; Mon, 22 Mar 1999 09:21:48 +1030 (CST) Received: (from grog@localhost) by freebie.lemis.com (8.9.3/8.9.0) id JAA07220; Mon, 22 Mar 1999 09:21:47 +1030 (CST) Message-ID: <19990322092147.T429@lemis.com> Date: Mon, 22 Mar 1999 09:21:47 +1030 From: Greg Lehey To: Nick Hilliard Cc: tom@sdf.com, freebsd-scsi@FreeBSD.ORG, FreeBSD Hackers Subject: Re: dpt raid-5 performance References: <19990321084436.Z429@lemis.com> <199903211417.OAA28733@beckett.earlsfort.iol.ie> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <199903211417.OAA28733@beckett.earlsfort.iol.ie>; from Nick Hilliard on Sun, Mar 21, 1999 at 02:17:13PM +0000 WWW-Home-Page: http://www.lemis.com/~grog Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-41-739-7062 Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [copying -hackers, since this is of relevance beyond -scsi] On Sunday, 21 March 1999 at 14:17:13 +0000, Nick Hilliard wrote: >> I haven't yet replied to Nick's message because I wanted to check >> something here first, and I've been too busy so far. But I'll come >> back with some comparisons. > > I'm going to run some benchmarks over the next few days and see what they > throw up. > > My instinct was that 512K was a "good" interleave size in some sense > of the word, mainly because of the fact that it would cause so many > fewer disk io ops in most circumstances -- in fact, all > circumstances except where you're doing piles of tiny io ops. The > bonnie results seem to shatter this illusion. I have found a similar tendency with my testing with vinum. *However*, I also looked at the number of I/O requests issued, and they are very varied. There could be a cache interaction problem here. One of the things is that bonnie doesn't measure raw disk throughput, and that's what we're really trying to measure. The figures don't make any sense. Here are the preliminary results. They were all done with a 1.6 GB volume with one plex spread over four ancient CDC drives (thus the poor overall performance; the comparisons should be valid, however). -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Writes Reads Mblock Mstripe ufs 100 582 13.8 479 3.0 559 5.2 1121 24.6 1124 5.2 45.4 2.6 s1k 100 156 15.4 150 12.3 108 3.7 230 6.7 230 2.7 36.3 3.4 311848 328587 619009 138783 s8k 100 1492 44.8 1478 18.4 723 8.1 1466 34.0 1467 8.2 115.4 8.0 38913 41065 56152 9337 s64k 100 1723 48.6 1581 18.6 1021 11.8 1792 39.5 1827 11.1 115.3 8.8 17238 8231 1294 333 s256k 100 1717 47.2 1629 19.0 937 11.2 1469 32.2 1467 8.7 95.9 7.8 16982 9272 2001 494 s512k 100 1772 48.8 1621 18.0 732 8.3 1256 27.4 1254 7.4 115.4 8.8 16157 7564 155 37 r512k 100 379 14.9 385 8.9 360 4.5 1122 24.7 1258 7.4 80.9 6.7 38339 46453 521 793 s4m/16 100 1572 52.8 1336 18.6 612 6.1 1139 25.2 1142 5.6 97.9 7.1 20434 8028 17 7 s4m/17 100 1431 44.8 1234 16.9 613 6.1 1145 25.4 1147 5.6 97.3 7.0 19922 8101 113 31 Sorry for the format; I'll probably remove some of the bonnie columns when I'm done. The "Machine" indicates the type and stripe size of the plex (r: RAID-5, s: striped, ufs: straight ufs partition for comparison purposes). The additional columns at the end are the writes and reads at plex level, the number of multiblock transfers (combined read and write), and the number of multistripe transfers (combined read and write). A multiblock transfer is one which requires two separate I/Os to satisfy, and a multistripe transfer is one which requires accessing two different stripes. They're the main cause of degraded performance with small stripes. I tried two different approaches with the 4 MB stripes: with a default newfs, I got 16 cylinders per cylinder group and cylinder groups of 32 MB, which placed all the superblocks on the first disk. The second time I tried with 17 cylinders per super group, which put successive superblocks on a different disk. Some of the things that seem to come out of these results are: - Performance with 1 kB stripes is terrible. Performance with 8 kB stripes is much better, but a further increase stripe size helps. - Block read and random seek performance increases dramatically up to a stripe size of about 64 kB, after which it drops off again. - Block write performance increases up to a stripe size of 512 kB, after which it drops off again. - Peak write performance is about 3.5 times that of a straight UFS file system. This is due to buffer cache: the writes are asynchronous to the process, and can thus overlap. I'm quite happy with this particular figure, since it's relatively close to the theoretical maximum of a 4x performance improvement. - Peak read performance is about 1.6 times that of a straight UFS file system. - RAID-5 read performance is comparable to striped read performance. Write performance is about 24% of striped write performance. Note that there is a *decrease* in CPU time for RAID-5 writes: the reason for the performance decrease is that there are many more I/O operations (compare the Reads and Writes columns). The trouble with these results is that they don't make sense. Although we can see some clear trends, there are also obvious anomalies: - On a striped volume, the mapping of reads and writes is identical. Why should reads peak at 64 kB and writes at 512 kB? - The number of multiblock and multistripe transfers for s4m/17 is 8 times that for s4m/16. The number of writes for s4m/17 is lower than for s4m/16. The number of writes should be the number of raw writes to the device (volume) plus the number of multiblock and multistripe transfers; in other words, s4m/17 should have *more* transfers, not less. There's obviously something else here, and I suspect cache. - The random seek performance is pretty constant for s8k, s64k and s512k. Since bonnie performs 8k transfers, this seems reasonable. On the other hand, the performance was much worse for s256k, which I did last. Again, I suspect that there are other issues here which are clouding the results. In addition, bonnie does not simulate true file system performance well. The character I/O benchmarks are not of relevance for what we're looking at, and the block I/O benchmarks all use 8 kB transfers. True file system performance includes transfers of between 1 and 120 sectors of 512 bytes, with an average apparently in the order of 8kB. In real life, the performance benefits of large stripes will be greater. I'm currently thinking of writing a program which will be able to simulate this behaviour and get more plausible measurements. To add to this theory, I've just re-run the 64 kB test under what look to me like identical conditions. Here are the results. The first line is a copy of the one I did yesterday (above), the second one are the new results: -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Writes Reads Mblock Mstripe s64k 100 1723 48.6 1581 18.6 1021 11.8 1792 39.5 1827 11.1 115.3 8.8 17238 8231 1294 333 s64k 100 1711 48.4 1633 18.9 983 11.5 1778 39.6 1815 11.3 95.8 7.8 16495 8029 7952 1986 In other words, there are significant differences in the way vinum was accessed in each case, and in particular we can assume that the differences in random seek performance are, well, random. Getting back to your results which started this thread, however, there are some significant differences: -------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 256 541 7.5 491 1.7 458 2.3 4293 59.1 4335 16.2 193.6 6.8 100 379 14.9 385 8.9 360 4.5 1122 24.7 1258 7.4 80.9 6.7 Comparing the block write and block read performances, vinum gets about 30% of the read performance on writes. Your DPT write results show only 11% of the read performance, and are in fact only slightly faster than vinum with the ancient disks, so I can't see that this could be due to the faster disks. So yes, I suspect there is something wrong here. It's possible that DPT doesn't DTRT with large slices: Vinum only accesses that part of a slice which is necessary for the transfers. It's possible that DPT accesses the complete 512 kB block on each transfer, in which case, of course, it would be detrimental to use a stripe size in excess of about 64 kB, and you might even get better performance with 32 kB. If this is the case, however, it's a bug with DPT's implementation, not a general principle. Greg -- See complete headers for address, home page and phone numbers finger grog@lemis.com for PGP public key To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message