Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 23 Dec 2006 21:27:38 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Mark Kirkwood <markir@paradise.net.nz>
Cc:        freebsd-performance@freebsd.org
Subject:   Re: Cached file read performance
Message-ID:  <20061223205324.B1533@epsplex.bde.org>
In-Reply-To: <20061223175413.W1116@epsplex.bde.org>
References:  <458B3651.8090601@paradise.net.nz> <20061222171431.L18486@delplex.bde.org> <458C7FB1.9020002@paradise.net.nz> <20061223175413.W1116@epsplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Dec 2006, I wrote:

> The problem becomes smaller as the read block size appoaches the file
> system block size and vanishes when the sizes are identical.  Then
> there is apparently a different (smaller) problem:
>
> Read size 16K, random:
> %%%
> granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds
>
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ns/call  ns/call  name
> 49.1      0.565    0.565    25643    22037    22037  copyout [11]
> 12.6      0.710    0.145        0  100.00%           mcount [14]
>  8.8      0.811    0.101    87831     1153     1153  vm_page_splay [17]
>  7.0      0.892    0.081   112906      715      715  buf_splay [19]
>  6.1      0.962    0.070        0  100.00%           mexitcount [20]
>  3.4      1.000    0.039        0  100.00%           cputime [22]
>  1.2      1.013    0.013    86883      153      181  vm_page_unwire [28]
>  1.1      1.027    0.013        0  100.00%           user [29]
>  1.1      1.040    0.013    21852      595     3725  getnewbuf [18]
> %%%
>
> Read size 16K, sequential:
> %%%
> granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds
>
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ns/call  ns/call  name
> 57.1      0.550    0.550    25643    21464    21464  copyout [11]
> 14.2      0.687    0.137        0  100.00%           mcount [12]
>  6.9      0.754    0.066        0  100.00%           mexitcount [15]
>  4.2      0.794    0.040   102830      391      391  vm_page_splay [19]
>  3.8      0.830    0.037        0  100.00%           cputime [20]
>  1.4      0.844    0.013   102588      130      130  buf_splay [22]
>  1.3      0.856    0.012    25603      488     1920  getnewbuf [17]
>  1.0      0.866    0.009    25606      368      368  pmap_qremove [24]
> %%%
>
> Now the splay routines are called almost the same number of times, but
> take much longer in the random case.  buf_splay() seems to be unrelated
> to vm -- it is called from gbincore() even if the buffer is already
> in the buffer cache.  It seems quite slow for that -- almost 1 uS just
> to look up compared with 21 uS to copyout a 16K buffer.  Linux-sized
> buffers would take only 1.5 uS and then 1 uS to look them up is clearly
> too much.  Another benchmark shows gbincore() taking 501 nS each to
> look up 64 in-buffer-cache buffers for 1MB file -- this must be the
> best case for it (all these times are for -current on an Athlon XP2700
> overclocked to 2025MHz).  The generic hash function used in my compiler
> takes 40 nS to hash a 16-byte string on this machine.

FreeBSD-~4.10 is faster.  The difference is especially noticeable when
the read size is the same as the fs block size (16K, as above).  Then
I get the following speeds:

~4.10, random:     580MB/S
~4.10, sequential: 580MB/S
~5.2, random:      575MB/S
~5.2, sequential:  466MB/S

All with kernel profiling not configured, and no INVARIANTS etc.

~5.2 is quite different from -current, but it has buf_splay() and
vm_page_splay(), and behaves similarly in this benchmark.

With profiling ~4.10, read size 16K, sequential +some random:
%%%
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  51.1      0.547    0.547    25643    21323    21323  generic_copyout [9]
  17.3      0.732    0.185        0  100.00%           mcount [10]
   7.9      0.817    0.085        0  100.00%           mexitcount [13]
   5.0      0.870    0.053        0  100.00%           cputime [16]
   1.9      0.891    0.020    51207      395      395  gbincore [20]
 						(424 for random)
   1.4      0.906    0.015   102418      150      253  vm_page_wire [18]
 						(322)
   1.3      0.920    0.014   231218       62       62  splvm [23]
   1.3      0.934    0.014    25603      541     2092  allocbuf [15]
 					       (2642)
   1.0      0.945    0.010   566947       18       18  splx <cycle 1> [25]
   1.0      0.955    0.010   102122      100      181  vm_page_unwire [21]
   0.9      0.964    0.009    25606      370      370  pmap_qremove [27]
   0.9      0.973    0.009    25603      359     2127  getnewbuf [14]
 					       (2261)
%%%

There is little difference for the sequential case, but the old gbincore()
and buffer allocation routines are much faster for the random case.

With profiling ~4.10, read size 4K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.63 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  27.3      0.720    0.720        0  100.00%           mcount [8]
  22.5      1.312    0.592   102436     5784     5784  generic_copyout [10]
  12.6      1.643    0.331        0  100.00%           mexitcount [13]
   7.9      1.850    0.207        0  100.00%           cputime [15]
   2.9      1.926    0.076   189410      402      402  gbincore [20]
   2.3      1.988    0.061   348029      176      292  vm_page_wire [18]
   2.2      2.045    0.058    87010      662     2500  allocbuf [14]
   2.0      2.099    0.053   783280       68       68  splvm [22]
   1.6      2.142    0.043        0   99.33%           user [24]
   1.6      2.184    0.042  2041759       20       20  splx <cycle 3> [26]
   1.3      2.217    0.034   347298       97      186  vm_page_unwire [21]
   1.2      2.249    0.032    86895      370      370  pmap_qremove [28]
   1.1      2.279    0.029    87006      337     2144  getnewbuf [16]
   0.9      2.303    0.024    86891      280     1617  vfs_vmio_release [17]
%%%

Now the result is little different from -current -- the random case is
almost as slow as in -current according to the total time, although this
may be an artifact of profiling (allocbuf takes 2500 nS total in ~4.10
vs 4025 nS in -current).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20061223205324.B1533>