Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Feb 2006 14:55:34 +0000 (GMT)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        current@FreeBSD.org
Subject:   The sixty second pmc howto
Message-ID:  <20060223143856.O9642@fledge.watson.org>

next in thread | raw e-mail | index | archive | help

For those of you who haven't been using pmc for kernel profiling, you 
definitely ought to consider it.  Joseph Koshy's tools are very easy to use, 
and the results are quite informative.  The output mode to convert pmc samples 
to something gmon understands is particularly neat.

In thirty seconds, here's how to get it working:

(1) Compile your kernel with device hwpmc, options HWPMC_HOOKS.

(2) Run "pmcstat -S instructions -O /tmp/sample.out" to start sampling of
     instruction retirement events, saving the results to /tmp/sample.out.

(3) Exercise your kernel code.

(4) Exit pmcstat -- I hit ctrl-c, but there are probably more mature ways of
     setting this up.

(5) Run "pmcstat -R /tmp/sample.out -k /zoo/tiger-2/boot/kernel/kernel -g" to
     convert the results to gmon output format so it can be processed by gprof.
     Obviously, you need to set the right path to your boot kernel -- by
     default, pmcstat uses /boot/kernel/kernel for kernel sample results.

(6) View the results using gprof, "gprof /zoo/tiger-2/boot/kernel/kernel
     p4-instr-retired/gmon.out".  Again, update the path as needed.

Since there is no call graph information in the sample, the first few pages of 
gprof output will be of limited utility, but the summary table by function is 
the bit I found most useful:

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  13.6    7251.00  7251.00        0  100.00%           _mtx_lock_sleep [1]
   3.7    9213.00  1962.00        0  100.00%           _mtx_lock_spin [2]
   3.3   10956.00  1743.00        0  100.00%           bus_dmamap_load_mbuf_sg 
[3]
   2.7   12370.00  1414.00        0  100.00%           tcp_input [4]
   2.6   13781.00  1411.00        0  100.00%           tcp_output [5]
   2.6   15172.00  1391.00        0  100.00%           spinlock_exit [6]
   2.5   16496.00  1324.00        0  100.00%           uma_zalloc_arg [7]
   2.0   17555.00  1059.00        0  100.00%           spinlock_enter [8]
   1.9   18553.00   998.00        0  100.00%           uma_zfree_arg [9]
   1.6   19409.00   856.00        0  100.00%           em_rxeof [10]
   1.6   20260.00   851.00        0  100.00%           sleepq_signal [11]
   1.5   21071.00   811.00        0  100.00%           em_get_buf [12]
   1.4   21821.00   750.00        0  100.00%           rn_match [13]
   1.3   22531.00   710.00        0  100.00%           ether_demux [14]
   1.3   23222.00   691.00        0  100.00%           ip_output [15]
...

In this sample, there's a mutex contention problem, which shows up clearly. 
Without the call graph, there's more work yet to do to identify the source of 
contention, but the fact that pmc exhibits a relatively low overhead, not to 
mention higher accuracy, than the existing kernel profiling is great.  And, 
this isn't limited to instruction retirement.  You can also profile by cache 
line miss, mis-predicted branches, etc.  Here's an excerpt from the resource 
stall sample on the same workload:

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
   3.9    3225.00  3225.00        0  100.00%           m_freem [1]
   3.7    6253.00  3028.00        0  100.00%           ether_input [2]
   3.5    9116.00  2863.00        0  100.00%           cpu_switch [3]
   2.9   11470.00  2354.00        0  100.00%           mb_ctor_pack [4]
   2.7   13681.00  2211.00        0  100.00%           uma_zalloc_arg [5]
   2.7   15882.00  2201.00        0  100.00%           em_rxeof [6]
   2.6   18045.00  2163.00        0  100.00%           em_intr_fast [7]
   2.6   20148.00  2103.00        0  100.00% 
intr_event_schedule_thread
  [8]
   2.2   21941.00  1793.00        0  100.00%           if_handoff [9]
   2.2   23727.00  1786.00        0  100.00%           sleepq_signal [10]
   1.8   25222.00  1495.00        0  100.00%           _mtx_lock_sleep [11]
   1.7   26612.00  1390.00        0  100.00%           tcp_output [12]
   1.7   27997.00  1385.00        0  100.00%           bus_dmamap_load_mbuf_sg 
[13]
   1.6   29309.00  1312.00        0  100.00%           uma_zfree_arg [14]

So if you're doing kernel performance work, and not already using pmc, you 
probably should be.

Robert N M Watson



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060223143856.O9642>