From owner-freebsd-fs Tue Mar 21 16:24:19 2000 Delivered-To: freebsd-fs@freebsd.org Received: from ns0.netcraft.com (ns0.netcraft.com [195.188.192.4]) by hub.freebsd.org (Postfix) with ESMTP id 2BC4A37BBEB; Tue, 21 Mar 2000 16:23:59 -0800 (PST) (envelope-from richard@netcraft.com) Received: (from richard@localhost) by ns0.netcraft.com (8.8.8/8.8.8) id AAA28786; Wed, 22 Mar 2000 00:22:42 GMT (envelope-from richard) From: Richard Wendland Message-Id: <200003220022.AAA28786@ns0.netcraft.com> Subject: FreeBSD random I/O performance issues In-Reply-To: <38D6BBD7.DA4B950B@originative.co.uk> from Paul Richards at "Mar 21, 2000 00:01:27 am" To: Paul Richards Date: Wed, 22 Mar 2000 00:22:42 +0000 (GMT) Cc: Alfred Perlstein , Poul-Henning Kamp , Matthew Dillon , current@FreeBSD.ORG, fs@FreeBSD.ORG X-Mailer: ELM [version 2.4ME+ PL61 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Paul Richards said in "Re: patches for test / review": > Richard, do you want to post a summary of your tests? Well I'd best post the working draft of my report on the issues I've seen, as I'm not going to have time to work on it in the near future, and it raises serious performance issues that are best looked at soon. Note none of these detailed results are from current, but Paul Richards has checked that these issues are still present in current. There are still issues to be explored so this report isn't in a complete state, and not polished. It's grown in 3 stages: - initial Berkeley DB (random I/O) performance problem analysis - side-issue of ATA outperforming SCSI systems at my synthetic benchmark - interesting dramatic performance changes from changing seek multiple and I/O block size one byte from 8192 Note I've cc'd freebsd-fs, as this raises issues in the filesystem area. I've also changed the subject since I think there are broader issues here than the clustering algorithm, and this email is rather large to drop into an ongoing discussion. The benchmark program source code is available, and easy to run, the bottom of the report has links. I don't have an explanation for the behaviour I have been measuring, but I hope these quite extensive results will enable someone to explain and perhaps suggest improvements. Richard. Folks, I appear to have found a serious performance problem with random access file I/O in FreeBSD, and have a simple C benchmark program which reproducibly demonstrates it. In that the benchmark demonstrates very poor non-async performance, this touches on the age-old sync/async filesystem argument, and FreeBSD vs Linux debates. I originally observed this problem with perl DB_File (Berkeley DB), and with the help of truss have synthesised this benchmark as a much simplified model of heavy Berkeley DB update behaviour. Quite probably other database-like software will have similar performance issues. This issue appears to be related to the traditional BSD behaviour of immediately scheduling full disc block writes. I think this benchmark must be showing up a related bug. But it is conceivable that this is intended noasync behaviour, in which case the implications need to be thought through. The program does simple random I/O within a 64KB file, which should I hope be fully cached so hardly any real I/O would be done. Other than mtime, this program makes no file meta-data or directory changes; and the file remains the same size. The file is used as 8 8KB blocks, and for each block in the order 0,5,2,7,4,1,6,3,0,... 10,000 lseek/read/lseek/write block updates are done, much like updating 10,000 non-localised Berkeley DB file records. Using a tiny 64KB file is just to simplify and make a point. My original perl performance problems were with multi-megabyte files, but still small enough to be fully cached. I ran this on a large range of lightly loaded or idle machines, which gave reproducible results. Results and a summary of the machines, which unless otherwise noted use SCSI 7200 RPM discs and Adaptec controllers, are given in descending performance order below. OS Elapse secs, system FreeBSD 3.2-RELEASE, async mount <1 (cheap ATA C433, 5400 RPM) Linux 2.2.13 <1 (Dell 1300, PIII 450MHz) Linux 2.0.36 3 (old ATA P200, 5400 RPM) Linux 2.0.36, sync [meta-data] mount 3 (old ATA P200, 5400 RPM) SunOS 5.5.1 (Solaris 2.5.1) 7 (old SS4/110, 5400 RPM) FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=5 15 (PII 450MHz, 512MB, 10k RPM) FreeBSD 2.2.7-RELEASE+CAM 21 (PII 400MHz, 512MB) FreeBSD 2.1.6.1-RELEASE 32 (old P100, 64MB) FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=2 39 (PII 400MHz, 512MB) FreeBSD 3.4-STABLE, vinum stripe+mirr=4 41 (dual PIII 500MHz, 1GB) FreeBSD 3.4-STABLE 41 (dual PIII 500MHz, 1GB) FreeBSD 2.1.6.1-RELEASE, ccd stripe=2 52 (old P100, 64MB) FreeBSD 3.3-RELEASE, ccd stripe=2 53 (Dell 1300, PIII 450MHz) FreeBSD 3.2-RELEASE 55 (cheap ATA C433, 5400 RPM) FreeBSD 3.2-RELEASE, noatime mount 55 (cheap ATA C433, 5400 RPM) FreeBSD 3.2-RELEASE, noclusterr mount 55 (cheap ATA C433, 5400 RPM) FreeBSD 3.2-RELEASE, noclusterw mount 58 (cheap ATA C433, 5400 RPM) FreeBSD 3.3-RELEASE 63 (Dell 1300, PIII 450MHz) FreeBSD 3.3-RELEASE, softupdates 63 (Dell 1300, PIII 450MHz) FreeBSD 3.2-RELEASE, sync mount 105 (cheap ATA C433, 5400 RPM) I also have a range of results from an ATA (IDE) cheap deskside Dell system running FreeBSD 3.3-RELEASE, with a range of wd(4) flags. This system exhibits much better performance than the SCSI systems above at this benchmark, perhaps related to better DMA ability. ATA being faster than SCSI on this benchmark is a bit of a side-issue to the thrust of this report, but the performance numbers may give hints diagnosing the problem. Dell Dimension XPS T450 440BX IBM-DPTA-372730 (Deskstar 34GXP, 7200RPM, 2MB buffer) default mount options wd(4) flags Elapse secs 0x0000 19 0x00ff, multi-sector transfer mode 17 0x8000, 32bit transfers 13 0x2000, bus-mastering DMA 4 0xa0ff, BM-DMA+32bit+multi-sector 4 Note that Linux performs about the same for [meta-data] sync & async mounts, which is as I'd expect for this program. But FreeBSD performance is hugely affected by async, sync or default (meta-data sync) filesystem mounts, with noclusterw unsurprisingly making it somewhat worse. One interesting observation is that for non sync, async or noclusterw mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000 writes. If I change the program to use 16 blocks there are ~9375 I/O operations which is 15/16ths of the 10,000 writes. Guessing, this is as if writes are forced for all blocks but one. With async filesystem mounts very little I/O occurs, and with noclusterw there are ~10,000 operations matching the number of writes. With sync it's ~20,000 operations matching the total of reads & writes. This demonstrates another aspect of the bug, sync behaviour should cause 10,000 operations; the reads aren't being cached. A quick softupdates test suggests this makes no difference, as would be expected. Looking at mount output on FreeBSD 3 the substantial part of the I/O is async in all cases other than sync mounts; as expected. Another aspect of this issue is the effect of changing the seek blocksize, and write blocksize, by 1 byte each way from 8192, thus doing block unaligned I/O. In some cases this changes the amount of I/O recorded by getrusage to zero, and drops elapse time from half a minute or so to less than 1 second. Thanks to Paul Richard for noticing this. I've not spent much time researching this, so can only present my small set of measurements. To do these tests you have to recompile my test program each time eg gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c Sorry it's that crude. These results are from a FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system, though exactly the same pattern is apparent with 3.4-STABLE. "****" indicate sub-second "zero I/O" results. BLOCKSIZE WRITESIZE csh 'time' output 8191 8191 0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w 8191 8192 0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w 8191 8193 0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w 8192 8191 0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w **** 8192 8192 0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w 8192 8193 0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w 8193 8191 0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w 8193 8192 0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w 8193 8193 0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w 8191 4095 0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w 8191 4096 0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w 8191 4097 0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w 8192 4095 0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w **** 8192 4096 0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w **** 8192 4097 0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w **** 8193 4095 0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w **** 8193 4096 0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w **** 8193 4097 0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w **** Any views gratefully received. A fix would be much better :-) Test program source, including compile & run instructions, is available at: http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c Detailed notes on the test system configurations are at: http://www.netcraft.com/freebsd/random-IO/results-notes.txt Thanks, Richard - Richard Wendland richard@netcraft.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message