Date: Sat, 23 Dec 2006 21:27:38 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Mark Kirkwood <markir@paradise.net.nz> Cc: freebsd-performance@freebsd.org Subject: Re: Cached file read performance Message-ID: <20061223205324.B1533@epsplex.bde.org> In-Reply-To: <20061223175413.W1116@epsplex.bde.org> References: <458B3651.8090601@paradise.net.nz> <20061222171431.L18486@delplex.bde.org> <458C7FB1.9020002@paradise.net.nz> <20061223175413.W1116@epsplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Dec 2006, I wrote: > The problem becomes smaller as the read block size appoaches the file > system block size and vanishes when the sizes are identical. Then > there is apparently a different (smaller) problem: > > Read size 16K, random: > %%% > granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds > > % cumulative self self total > time seconds seconds calls ns/call ns/call name > 49.1 0.565 0.565 25643 22037 22037 copyout [11] > 12.6 0.710 0.145 0 100.00% mcount [14] > 8.8 0.811 0.101 87831 1153 1153 vm_page_splay [17] > 7.0 0.892 0.081 112906 715 715 buf_splay [19] > 6.1 0.962 0.070 0 100.00% mexitcount [20] > 3.4 1.000 0.039 0 100.00% cputime [22] > 1.2 1.013 0.013 86883 153 181 vm_page_unwire [28] > 1.1 1.027 0.013 0 100.00% user [29] > 1.1 1.040 0.013 21852 595 3725 getnewbuf [18] > %%% > > Read size 16K, sequential: > %%% > granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds > > % cumulative self self total > time seconds seconds calls ns/call ns/call name > 57.1 0.550 0.550 25643 21464 21464 copyout [11] > 14.2 0.687 0.137 0 100.00% mcount [12] > 6.9 0.754 0.066 0 100.00% mexitcount [15] > 4.2 0.794 0.040 102830 391 391 vm_page_splay [19] > 3.8 0.830 0.037 0 100.00% cputime [20] > 1.4 0.844 0.013 102588 130 130 buf_splay [22] > 1.3 0.856 0.012 25603 488 1920 getnewbuf [17] > 1.0 0.866 0.009 25606 368 368 pmap_qremove [24] > %%% > > Now the splay routines are called almost the same number of times, but > take much longer in the random case. buf_splay() seems to be unrelated > to vm -- it is called from gbincore() even if the buffer is already > in the buffer cache. It seems quite slow for that -- almost 1 uS just > to look up compared with 21 uS to copyout a 16K buffer. Linux-sized > buffers would take only 1.5 uS and then 1 uS to look them up is clearly > too much. Another benchmark shows gbincore() taking 501 nS each to > look up 64 in-buffer-cache buffers for 1MB file -- this must be the > best case for it (all these times are for -current on an Athlon XP2700 > overclocked to 2025MHz). The generic hash function used in my compiler > takes 40 nS to hash a 16-byte string on this machine. FreeBSD-~4.10 is faster. The difference is especially noticeable when the read size is the same as the fs block size (16K, as above). Then I get the following speeds: ~4.10, random: 580MB/S ~4.10, sequential: 580MB/S ~5.2, random: 575MB/S ~5.2, sequential: 466MB/S All with kernel profiling not configured, and no INVARIANTS etc. ~5.2 is quite different from -current, but it has buf_splay() and vm_page_splay(), and behaves similarly in this benchmark. With profiling ~4.10, read size 16K, sequential +some random: %%% % cumulative self self total time seconds seconds calls ns/call ns/call name 51.1 0.547 0.547 25643 21323 21323 generic_copyout [9] 17.3 0.732 0.185 0 100.00% mcount [10] 7.9 0.817 0.085 0 100.00% mexitcount [13] 5.0 0.870 0.053 0 100.00% cputime [16] 1.9 0.891 0.020 51207 395 395 gbincore [20] (424 for random) 1.4 0.906 0.015 102418 150 253 vm_page_wire [18] (322) 1.3 0.920 0.014 231218 62 62 splvm [23] 1.3 0.934 0.014 25603 541 2092 allocbuf [15] (2642) 1.0 0.945 0.010 566947 18 18 splx <cycle 1> [25] 1.0 0.955 0.010 102122 100 181 vm_page_unwire [21] 0.9 0.964 0.009 25606 370 370 pmap_qremove [27] 0.9 0.973 0.009 25603 359 2127 getnewbuf [14] (2261) %%% There is little difference for the sequential case, but the old gbincore() and buffer allocation routines are much faster for the random case. With profiling ~4.10, read size 4K, random: %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 2.63 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 27.3 0.720 0.720 0 100.00% mcount [8] 22.5 1.312 0.592 102436 5784 5784 generic_copyout [10] 12.6 1.643 0.331 0 100.00% mexitcount [13] 7.9 1.850 0.207 0 100.00% cputime [15] 2.9 1.926 0.076 189410 402 402 gbincore [20] 2.3 1.988 0.061 348029 176 292 vm_page_wire [18] 2.2 2.045 0.058 87010 662 2500 allocbuf [14] 2.0 2.099 0.053 783280 68 68 splvm [22] 1.6 2.142 0.043 0 99.33% user [24] 1.6 2.184 0.042 2041759 20 20 splx <cycle 3> [26] 1.3 2.217 0.034 347298 97 186 vm_page_unwire [21] 1.2 2.249 0.032 86895 370 370 pmap_qremove [28] 1.1 2.279 0.029 87006 337 2144 getnewbuf [16] 0.9 2.303 0.024 86891 280 1617 vfs_vmio_release [17] %%% Now the result is little different from -current -- the random case is almost as slow as in -current according to the total time, although this may be an artifact of profiling (allocbuf takes 2500 nS total in ~4.10 vs 4025 nS in -current). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20061223205324.B1533>