From owner-freebsd-performance@FreeBSD.ORG Sat Dec 23 09:38:10 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1DC7116A412 for ; Sat, 23 Dec 2006 09:38:10 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.freebsd.org (Postfix) with ESMTP id 9608413C45A for ; Sat, 23 Dec 2006 09:38:09 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id DBA07328229; Sat, 23 Dec 2006 20:38:07 +1100 (EST) Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 40C058C0D; Sat, 23 Dec 2006 20:38:05 +1100 (EST) Date: Sat, 23 Dec 2006 20:38:04 +1100 (EST) From: Bruce Evans X-X-Sender: bde@epsplex.bde.org To: Mark Kirkwood In-Reply-To: <458C7FB1.9020002@paradise.net.nz> Message-ID: <20061223175413.W1116@epsplex.bde.org> References: <458B3651.8090601@paradise.net.nz> <20061222171431.L18486@delplex.bde.org> <458C7FB1.9020002@paradise.net.nz> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-performance@freebsd.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Dec 2006 09:38:10 -0000 On Sat, 23 Dec 2006, Mark Kirkwood wrote: > Bruce Evans wrote: >> >> None was attached. >> > > (meaning the c prog yes?) I notice that it is stripped out from the web > archive... so here's a link: > > http://homepages.paradise.net.nz/markir/download/freebsd/readtest.c >> However, I >> couldn't see much difference between block sizes of 16, 32 and 64K for >> a small (32MB) md-malloced file system with a simple test program. >> All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10% >> an an AthlonXP with ~PC3200 memory). Now I see the problem with a normal file system. The main difference in my quick test was probably that 32MB is too small to show the problem. 32MB fits in the buffer cache, but slightly larger files only fit in the VMIO cache, and the main problem is the interaction of these caches. This behaviour is easy to understand using kernel profiling: Part of a profile for the random case (reading 400MB with a block size of 4K -- smaller block sizes make larger differences): %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 2.70 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 22.5 0.608 0.608 102466 5933 5933 copyout [13] 21.2 1.180 0.573 0 100.00% mcount [14] 10.2 1.457 0.277 0 100.00% mexitcount [17] 10.2 1.733 0.276 450823 612 612 buf_splay [18] 9.7 1.995 0.262 348917 751 751 vm_page_splay [20] 5.7 2.148 0.153 0 100.00% cputime [22] 2.0 2.202 0.054 348017 154 179 vm_page_unwire [26] 1.8 2.252 0.050 87127 573 3487 getnewbuf [16] 1.7 2.298 0.047 0 100.00% user [29] 1.3 2.332 0.034 87132 388 388 pmap_qremove [31] 1.1 2.363 0.031 87127 351 4025 allocbuf [15] 1.0 2.388 0.026 348505 74 117 vm_page_wire [30] %%% Sequential case: %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 1.35 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 39.3 0.530 0.530 102443 5178 5178 copyout [11] 23.7 0.850 0.320 0 100.00% mcount [12] 11.5 1.004 0.154 0 100.00% mexitcount [13] 6.3 1.090 0.085 0 100.00% cputime [16] 3.0 1.130 0.040 102816 389 389 vm_page_splay [19] 1.6 1.151 0.021 409846 52 59 _lockmgr [22] 1.3 1.168 0.017 0 100.00% user [23] 0.9 1.180 0.012 102617 117 117 buf_splay [26] ... 0.7 1.200 0.009 25603 356 1553 getnewbuf [20] ... 0.6 1.208 0.009 25603 337 2197 allocbuf [17] ... 0.6 1.224 0.008 101915 78 96 vm_page_unwire [29] ... 0.5 1.239 0.007 25608 274 274 pmap_qremove [32] ... 0.2 1.316 0.002 102409 20 35 vm_page_wire [44] %%% It is a buffer-cache/vm problem like I suspected. The file system block size is 16K, so with a read size of 4K, random reads allocate a new buffer about 16K/4K = 4 times more often than sequential reads. Allocation involves vm stuff which is very expensive (takes about 1.25 times as long as the actual copying). I believe it was even more expensive before it used splay trees. More details from separate runs: Random: %%% ----------------------------------------------- 0.00 0.00 5/102404 breadn [237] 0.01 0.84 102399/102404 cluster_read [10] [11] 31.3 0.01 0.84 102404 getblk [11] 0.03 0.32 87126/87126 allocbuf [15] 0.05 0.25 87126/87126 getnewbuf [16] 0.01 0.12 189530/189530 gbincore [23] 0.00 0.06 87126/87126 bgetvp [27] 0.00 0.00 15278/409852 _lockmgr [32] 0.00 0.00 15278/15278 bremfree [144] ----------------------------------------------- %%% Sequential: %%% ----------------------------------------------- 0.00 0.00 6/102404 breadn [371] 0.01 0.12 102398/102404 cluster_read [14] [15] 9.5 0.01 0.12 102404 getblk [15] 0.01 0.05 25603/25605 allocbuf [17] 0.01 0.03 25603/25603 getnewbuf [18] 0.00 0.01 128007/128007 gbincore [31] 0.00 0.00 76801/409846 _lockmgr [22] 0.00 0.00 25603/25603 bgetvp [39] 0.00 0.00 76801/76801 bremfree [66] ----------------------------------------------- %%% getblk() is called the same number of times for each. In the sequential case, it uses a previously allocated buffer (almost always one allocated just before) with a probabilty of almost exactly 0.75, but in the random case it uses a previosly allocated buffer with a probability of about 0.13. The second probably is only larger than epsilon because there is a buffer pool with a size of a few thousand. Sometimes you get a hit in this pool, but for large working data sets, mostly you don't; then the buffer must be consituted from vm (or the disk). This problem is (now) fairly small because most working data sets aren't large compared with the buffer pool. It was much larger 10 years ago when the size of the buffer pool was only a few hundred. It was much larger still more than about 12 years ago in FreeBSD before the buffer cache was merged with vm. Then there was only the buffer pool with nothing between it and the disk, and it was too small. Linux might not have this problem because it is still using a simple and better buffer cache. At least 10-15 years ago, its buffer cache had a fixed block size of 1K where FreeBSD's buffer cache had a variable block size with the usual size equal to the ffs file system block size of 4K or 8K. With a block size of 1K, at least 4 times as many buffers are needed to compete on storage with a block size of 4K, and the buffer allocation routines need to be at least 4 times as efficient to compete on efficiency. Linux actually had a much larger multiple than 4 for the storage. I'm not sure about the efficiency factor, but it wasn't too bad (any in-memory buffer management is better than waiting for the disk, the small fixed size of 1K is much easier to manage than larger, variable sizes). The FreeBSD buffer management was and is especially unsuited to file systems with small block sizes like msdsofs floppies (512-blocks) and the original version of Linux's extfs (1K-blocks). With a buffer cache (pool) size of 256, you can manage a whole 128KB comprised of 512-blocks and got enormous thrashing for accessing a 1200KB floppy. With vm backing and a buffer cache size of a few thousand, the thrashing only occurs in memory, and a 1200KB floppy now barely fits in the buffer cache (pool). Also, no one uses 1200KB floppies. More practically, this problem makes msdosfs on hard disks (normally 4K-blocks) and ext2fs on hard disks (1K or 4K blocks) slower than they should be under FreeBSD. vm backing and clustering masks only some of the slowness. The problem becomes smaller as the read block size appoaches the file system block size and vanishes when the sizes are identical. Then there is apparently a different (smaller) problem: Read size 16K, random: %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 49.1 0.565 0.565 25643 22037 22037 copyout [11] 12.6 0.710 0.145 0 100.00% mcount [14] 8.8 0.811 0.101 87831 1153 1153 vm_page_splay [17] 7.0 0.892 0.081 112906 715 715 buf_splay [19] 6.1 0.962 0.070 0 100.00% mexitcount [20] 3.4 1.000 0.039 0 100.00% cputime [22] 1.2 1.013 0.013 86883 153 181 vm_page_unwire [28] 1.1 1.027 0.013 0 100.00% user [29] 1.1 1.040 0.013 21852 595 3725 getnewbuf [18] %%% Read size 16K, sequential: %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 57.1 0.550 0.550 25643 21464 21464 copyout [11] 14.2 0.687 0.137 0 100.00% mcount [12] 6.9 0.754 0.066 0 100.00% mexitcount [15] 4.2 0.794 0.040 102830 391 391 vm_page_splay [19] 3.8 0.830 0.037 0 100.00% cputime [20] 1.4 0.844 0.013 102588 130 130 buf_splay [22] 1.3 0.856 0.012 25603 488 1920 getnewbuf [17] 1.0 0.866 0.009 25606 368 368 pmap_qremove [24] %%% Now the splay routines are called almost the same number of times, but take much longer in the random case. buf_splay() seems to be unrelated to vm -- it is called from gbincore() even if the buffer is already in the buffer cache. It seems quite slow for that -- almost 1 uS just to look up compared with 21 uS to copyout a 16K buffer. Linux-sized buffers would take only 1.5 uS and then 1 uS to look them up is clearly too much. Another benchmark shows gbincore() taking 501 nS each to look up 64 in-buffer-cache buffers for 1MB file -- this must be the best case for it (all these times are for -current on an Athlon XP2700 overclocked to 2025MHz). The generic hash function used in my compiler takes 40 nS to hash a 16-byte string on this machine. The merged vm/buffer cache is clearly implemented suboptimally. Direct access to VMIO pages might be better, but it isn't clear how to implement it without getting the slowest parts of vm for all accesses. The buffer cache is now essentially just a cache of vm mappings, with vm mapping being so slow that it needs to be cached. The last thing that you want to do is throw away this cache and have to do a slow mapping for every access. I think the correct method is to wait for larger virtual address spaces (already here) and use sparse mappings more. Bruce