Date: Mon, 21 Jun 2004 17:15:09 -0700 (PDT) From: Matthew Dillon <dillon@apollo.backplane.com> To: Mikhail Teterin <Mikhail.Teterin@Murex.com> Cc: current@freebsd.org Subject: Re: read vs. mmap (or io vs. page faults) Message-ID: <200406220015.i5M0F9br036789@apollo.backplane.com> References: <Pine.BSF.4.21.0406201716191.23541-100000@InterJet.elischer.org> <200406211057.31103@aldan> <200406211952.i5LJqWSl035702@apollo.backplane.com> <200406211810.03629@misha-mx.virtual-estates.net>
next in thread | previous in thread | raw e-mail | index | archive | help
:The mmap interface is supposed to be more efficient -- theoreticly -- :because it requires one less buffer-copying, and because it (together :with the possible madvise()) provides the kernel with more information :thus enabling it to make better (at least -- no worse) decisions. Well, I think you forgot my earlier explanation regarding buffer copying. Buffer copying is a very cheap operation if it occurs within the L1 or L2 cache, and that is precisely what is happening when you read() into a fixed buffer in a loop in a C program... your buffer is fixed in memory and is almost guarenteed to be in the L1/L2 cache, which means that the extra copy operation is very fast on a modern processor. It's something like 12-16 GBytes/sec to the L1 cache on an Athlon 64, for example, and 3 GBytes/sec uncached to main memory. Consider the cpu time cost, then, of the local copy on a 2GB file... the cpu time cost on an AMD64 is about 2/12 of one second. This is the number mmap would have to beat. As you can see by your timing results, even on your fastest box, processing a file around that size is only going to incur 1-2 seconds of real time overhead to do the extra buffer copy. 2 seconds is a hard number to beat. This is something you can calculate yourself. Time a dd from /dev/zero to /dev/null. crater# dd if=/dev/zero of=/dev/null bs=32k count=8192 268435456 bytes transferred in 0.244561 secs (1097620804 bytes/sec) amd64# dd if=/dev/zero of=/dev/null bs=32k count=8192 268435456 bytes transferred in 0.066994 secs (4006846790 bytes/sec) amd64# dd if=/dev/zero of=/dev/null bs=16m count=32 536870912 bytes transferred in 0.431774 secs (1243407512 bytes/sec) Try it for different buffer sizes (16K through 16MB) and you will get a feel for how the L1 and L2 caches effect copying bandwidth. These numbers are reasonably close to the raw memory bandwidth available to the cpu (and will be different depending on whether the buffer fits in the L1 or L2 caches, or doesn't fit at all). The mmap interface is not supposed to be more efficient, per say. Why would it be? There are overheads involved with mapping the page table entries and taking faults to map more. Even if you pre-mapped everything, there are still overheads involved in populating the page table and performing invlpg operations on the TLB to reload the entry, and for large data sets there is overhead involved with removing page table entries and invalidating the pte. On a modern cpu, where an L1 cache copy is a two cycle streaming operation, the several hundred (or more) cycles it takes to process a page fault or even just populate the page table is equivalent to a lot of copied bytes. This immediately puts mmap() at a disadvantage on a modern cpu, but of course it also depends on what the data processing loop itself is doing. If the data processing loop is sensitive to the L1 cache then processing larger chunks of data is going to be make it more efficient, and mmap() can certainly provide that where read() might require buffers too large to fit comfortably in the L1/L2 cache. On the otherhand, if the processing loop is relatively insensitive to the L1 cache (i.e. its small), then you can afford to process the data in smaller chunks, like 16K, without any significant penalty. mmap() is not designed to streamline large demand-page reads of data sets much larger then main memory. mmap() works best for data that is already cached in the kernel, and even then it still has a fairly large hurdle to overcome vs a streaming read(). This is a HARDWARE limitation. Drastic action would have to be taken in software to get rid of this overhead (we'd have to use 4MB page table entries, which come with their own problems). The overhead required to manage a large mmap'd data set can skyrocket. FreeBSD (and DragonFly) have heuristics that attempt to detect sequential operations like this with mmap'd data and to depress the page priority behind the read (so: read-ahead and depress-behind), and this works, but it only mitigates the additional overhead some, it doesn't get rid of it. For linear processing of large data sets you almost universally want to use a read() loop. There's no good reason to use mmap(). :=: read: 10.619u 23.814s 1:17.67 44.3% 62+274k 11255+0io 0pf+0w := :Well, now we are venturing into the domain of humans' subjective :perception... I'd say, 12% is plenty, actually. This is what some people :achieve by rewriting stuff in assembler -- and are proud, when it works ::-) Nobody is going to stare at their screen for one minute and 17 seconds and really care that something might take one minute and 27 seconds instead of one minute and 17 seconds. That's subjective truth. The type of test you want to do is this: [start timing] [read all data into memory] [stop timing] -> print timing results [start timing] [process all data] [stop timing] -> print timing results Now you have something practical you can look at... you can look at the I/O bandwidth required to bring the data into memory without the complications of whatever processing you are doing on the data being mixed in. *THEN* you can say something more definitive about the kernel overhead required to get the data into memory first, because you can definitely say what the 'bandwidth', or data rate, has been achieved in getting the data from the disk or kernel caches into your program's memory space (faulted in and everything, ready to access). You could then compare that to the times required to do it in a mixed environment (read-processing loop). If *THOSE* numbers are hugely different then you can say something definitive about the relative efficiency of the mixed mode processing verses just doing pure I/O, for both read() and mmap() independantly. :... :Put it into perspective -- 10-15% is usually the difference between :the latest processor and the previous one. People are willing to pay :hundreds of dollars premium... 15% is nothing anyone cares about except perhaps gamers. I certainly couldn't care less about 15%. 50%, on the otherhand, is something that I would care about. But upgrading isn't just a function of raw cpu speed, it's also a function of general improvements in hardware and hardware interfaces... usb, usb2, firewire, sata, and so forth. :... :Besides, the differences can be higher. Here is from md5-ing a :2097272832-bytes file over NFS (on a Gigabit network, no jumbo frames). :The machine runs a FreeBSD-current on a single P4 2GHz: : : mmap1: 17.115u 16.106s 2:20.84 23.5% 5+166k 0+0io 253421pf+0w : read1: 19.468u 12.179s 1:27.80 36.0% 4+163k 0+0io 0pf+0w : mmap2: 17.214u 13.265s 2:13.75 22.7% 5+165k 1+0io 204842pf+0w : read2: 19.142u 11.576s 1:20.22 38.2% 4+162k 0+0io 4pf+0w : :mmap is 87% slower (or read is 38% faster)! According to `systat -if', :mmap was reading at about 13Mb/s, while read was consistently above :20Mb/s. : :If this mmap-associated penalty is removed, the applications can save :some memory by not using the BUFSIZ (or bigger) buffers, and the :systems can save the time and effort of shuffling the memory from :kernel buffers into user space (and flushing the instruction and data :caches). The difference can be big -- on a CPU bound machine the sum :of user time and system time is much smaller with mmap. For example, :on this Solaris box running on Sparc-900MHz md5-ing a 16061698048-byte :file (FreeBSD behaves similarly on the P2 400MHz reported earlier): : : mmap: 215.290u 48.990s 7:18.81 60.2% 0+0k 0+0io 0pf+0w : read: 184.240u 142.350s 5:46.31 94.3% 0+0k 0+0io 0pf+0w : (264.28 vs. 326.59 CPU seconds) : :but read manages to saturate the CPU better -- 94% vs. 60% -- and win :the "wall clock" race repeatedly... : :Yours, : : -mi I think this points to inefficiencies in NFS's getpages() interface over its read() interface. The read() interface (for NFS) definitely has better read-ahead characteristics. The NFS getpages() interface in FreeBSD is about as primitive as it is possible to make it and still work, and its only marginally better in DragonFly (we get rid of some KVM allocations and deallocations). In fact, I don't even think the NFS getpages interface uses the IOD's like the read interface does. I think it might actually be a synchronous interface. It would be nice if someone were to improve the NFS getpages interface. I might do it myself, if I can find the time down the road. -Matt Matthew Dillon <dillon@backplane.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200406220015.i5M0F9br036789>