Date: Sun, 6 Feb 2005 14:43:46 +0000 (GMT) From: Robert Watson <rwatson@FreeBSD.org> To: Jeremie Le Hen <jeremie@le-hen.org> Cc: performance@FreeBSD.org Subject: Re: Some initial postmark numbers from a dual-PIII+ATA, 4.x and 6.x Message-ID: <Pine.NEB.3.96L.1050206132936.55544C-100000@fledge.watson.org> In-Reply-To: <20050206132642.GP163@obiwan.tataz.chchile.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 6 Feb 2005, Jeremie Le Hen wrote: > Hi Robert, > > > This would seem to place it closer to 4.x than 5.x -- possibly a property > > of a lack of preemption. Again, the differences here are so small it's a > > bit difficult to reason using them. > > Thanks for the result. I'm quite dubitative now : I thought this was a > fact that RELENG_5 have worse performances than RELENG_4 for the moment, > partly due to lack of micro-optimizations. There have been indeed > numerous reports about weak performances on 5.x. Seeing your results, > it appears that RELENG_4, RELENG_5 and CURRENT are in fact very close. > What should we think then ? You should think that benchmark results are a property of several factors: - Work load - Software baseline - Hardware configuration - Software configuration - Experimental method - Effectiveness of documentation Let's evaluate each: - The workload was postmark in a relatively stock configuration. I selected a smaller number of transactions than some other reporters, based on the fact that my hardware is quite a bit slower and I wanted to try and get coverage of a number of versions. I selected a 90-ish second run. The postmark benchmark is basically about effective caching, efficient I/O processes, and how the file system managed meta-data. - Software baseline: I selected to run with 4.x, 5.x, and 6.x kernels, all configured for "production" use. I.e., no debugging features enabled. I also used a statically compiled 4.x postmark binary for all tests on any versions, to try and avoid the effects of compiler changes, etc. I was primarily interested in evaluating the performance of the kernel as a variable. - Hardware configuration: I'm using somewhat dated PIII MP hardware with a relatively weak I/O path. It was the hardware on-hand and easily preemptible. The hardware has pretty good CPU:I/O performance, meaning that with many interesting workloads, the work will be I/O-bound, not CPU-bound. It becomes a question of feeding the CPUs and keeping the available I/O path used effectively. - Software configuration: I network booted the kernel, and used one of two user spaces on disk -- a 4.x world and a 6.x world. However, I used a single shared UFS1 partition for the postmark target. My hope was that static linkiing would eliminate issues involving library changes, and that using the same file system partition would help reduce disk location effects (note that disk performance varies substantially based on the location of data on the platter -- if you lay out a disk into several partitions, they will have quite different performance properties -- often in excess of the measurable experimental results of the property you're testing for). However, as a result I used UFS1 for both tests, which is not the default install configuration for FreeBSD 5.x and 6.x. - Experimental method: I attempted to control additional variables in as much as possible. However, I used a small number of runs per configuration: two. I selected that number to illustrate whether there were caching effects in play between multiple runs without reboots. The numbers suggest slight caching effects, but not huge ones. The numbers weren't large enough to give a sampling distribution that could be analyzed -- on the other hand, they were relatively long runs resulting in "mean results", meaning that we benefited from a sampling effect and a smoothing effect by virtue of the experiment design. To run this experiment properly, you'd want to distinguish the caching/non-caching cases better, control the time between runs better, and have larger samples. In order to try to explain the results I got, I waved my hands at CPU cost, and will go into that some more below. I did not test the CPU load during the experiment in a rigorous or reproduceable way. - Effectiveness of documentation: my experiment was documented, although not in great detail. I neglected to document the version of postmark (1.5c), the partition layout details, and the complete configuration details. I've included more here. In my original results post, I demonstrated that, subject to the conditions of the tests (documented above and previously), FreeBSD 5.x/6.x performance was in line with 4.x performance, or perhaps marginally faster. This surprised me also: I expected to see a 5%-10% performance drop on UP based on increased overhead, and hoped for a moderate measurable SMP performance gain relative to 4.x. On getting the results I did, I reran a couple of sample cases -- specifically, 4.x and 6.x kernels on SMP with some informal measurement of system time. I concluded that the systems were basically idle throughout the tests, which was a likely result of the I/O path being the performance bottleneck. It's likely that the slight performance improvement between 4.x and 6.x relates to preemption and the ability to turn around I/O's in the ATA driver faster, or maybe some minor pipelining effect in GEOM or such. It would be interesting to know what it is that makes 6.x faster, but it may be hard to find out given the amount of change in the system. I also informally concluded that 6.x was seeing a higher percentage system time than 4.x. This result needs to be investigated properly in an experiment of its own, since it was based on informal watching of %system in systat, combined with a subjective observation that the numbers appeared bigger. An experiment involving the use of time(1) would be a good place to start. What's interesting about this informal observation (not a formal experimental conclusion!) is that it might explain the differing postmark result from some of the other reporters. The system I tested on has a decent CPU oomph, but it's relatively slow ATA drive technology -- not a RAID, not UDMA100, etc. So if a bit more CPU was burned to get slightly more efficient use of the I/O channel, then that was immediately visible as a positive factor. On systems with much stronger I/O capabilities, perhaps to the point of being CPU-bound, that can hurt rather than help, as there are fewer resources available to support the critical path. Another point that may have helped my configuration is that it ran on a PIII, where the relative costs of synchronization primitives are much lower. A few months ago, I ran a set of micro-benchmarks that illustrated that on the P4 architecture, synchronization primitives are many times more expensive than regular operations when compared with previous architectures. It could be that the instruction blend came out "net worse" in the 5.x/6.x systems on P4-based hardware. Another point in the favor of the configuration I was running in was that the ATA driver is MPSAFE. This means its interrupt handler is able to preempt most running code, and that it can execute effectively in parallel against other parts of the kernel (including the file system). Several of the reported results were on the twe storage adapter, which does not have that property. Last night, Scott Long mailed me patches to fix dumping on twe, and also make it MPSAFE. I hope to run some stability testing on that, and then hopefully we can get those patches into the hands of people doing performance testing with twe and see if they help. FWIW, similar changes on amr and ips have resulted in substantial I/O improvements, primarily by increasing the number of transactions per second throughput by reducing latency in processing the I/O transactions. It's easy to imagine this having a direct effect on a benchmark that is really a measure of meta-data transaction throughput. Finally, my slightly hazy recollection of earlier posts was that postmark generally illustrated somewhat consistent performance between FreeBSD revisions (excepting NFS async breakage), but that Linux seemed to tromp all over on meta-data operations. There was some hypothesizing by Matt and Poul-Henning that this was a result of having what Poul-Henning refers to as a "Lemming Syncer" -- i.e., a design issue in the way we stream data to disk. Robert N M Watson
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.NEB.3.96L.1050206132936.55544C-100000>