Date: Sat, 2 Feb 2008 09:53:31 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Alexander Motin <mav@freebsd.org> Cc: freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Robert Watson <rwatson@freebsd.org> Subject: Re: Memory allocation performance Message-ID: <20080202080617.U56760@delplex.bde.org> In-Reply-To: <47A37E14.7050801@FreeBSD.org> References: <47A25412.3010301@FreeBSD.org> <47A25A0D.2080508@elischer.org> <47A2C2A2.5040109@FreeBSD.org> <20080201185435.X88034@fledge.watson.org> <47A37E14.7050801@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 1 Feb 2008, Alexander Motin wrote: > Robert Watson wrote: >> It would be very helpful if you could try doing some analysis with hwpmc -- >> "high resolution profiling" is of increasingly limited utility with modern You mean "of increasingly greater utility with modern CPUs". Low resolution kernel profiling stopped giving enough resolution in about 1990, and has become of increasingly limited utility since then, but high resolution kernel profiling uses the TSC or possibly a performance counter so it has kept up with CPU speedups. Cache effects and out of order execution are larger now, but they affect all types of profiling and still not too bad with high resulotion kernel profiling. High resolution kernel profiling doesn't really work with SMP, but that is not Alexander's problem since he profiled under UP. >> CPUs, where even a high frequency timer won't run very often. It's also >> quite subject to cycle events that align with other timers in the system. No, it isn't affected by either of these. The TSC timer is incremented on every CPU cycle and the performance counters run are incremented on every event. It is completely unaffected by other timers. > I have tried hwpmc but still not completely friendly with it. Whole picture > is somewhat alike to kgmon's, but it looks very noisy. Is there some "know > how" about how to use it better? hwpmc doesn't work for me either. I can't see how it could work as well as high resolution kernel profiling for events at the single function level, since it is statistics-based. With the statistics clock interrupt rate fairly limited, it just cannot get enough resolution over short runs. Also, it works poorly for me (with a current kernel and ~5.2 userland except for some utilities like pmc*). Generation of profiles stopped working for me, and it often fails with allocation errors. > I have tried it for measuring number of instructions. But I am in doubt that > instructions is a correct counter for performance measurement as different > instructions may have very different execution times depending on many > reasons, like cache misses and current memory traffic. Cycle counts are more useful, but high resolution kernel profiling can do this too, with some fixes: - update perfmon for newer CPUs. It is broken even for Athlons (takes a 2 line fix, or more lines with proper #defines and if()s). - select the performance counter to be used for profiling using sysctl machdep.cputime_clock=$((5 + N)) where N is the number of the performance counter for the instruction count (or any). I use hwpmc mainly to determine N :-). It may also be necessary to change the kernel variable cpu_clock_pmc_conf. Configuration of this is unfinished. - use high resolution kernel profiling normally. Note that switching to a perfmon counter is only permitted of !SMP (since it is too unsupported under SMP to do more than panic if permitted under SMP), and that the switch loses the calibration of profiling. Profiling normally compensates for overheads of the profiling itself, and the compensation would work almoost perfectly for event counters, unlike for time-related counters, since the extra events for profiling aren't much affected by caches. > I have tried to use > tsc to count CPU cycles, but got the error: > # pmcstat -n 10000 -S "tsc" -O sample.out > pmcstat: ERROR: Cannot allocate system-mode pmc with specification "tsc": > Operation not supported > What have I missed? This might be just because the TSC really is not supported. Many things require an APIC for hwpmc to support them. I get errors allocation like this for operations that work a few times before failing. > I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel > built without SMP to simplify profiling. What counters can you recommend me > to use on it for regular time profiling? Try them all :-). From userland first with an overall count, since looking at the details in gprof output takes too long (and doesn't work for me with hwpmc anyway). I use scripts like the following to try them all from userland: runpm: %%% c="ttcp -n100000 -l5 -u -t epsplex" ctr=0 while test $ctr -lt 256 do ctr1=$(printf "0x%02x\n" $ctr) case $ctr1 in 0x00) src=k8-fp-dispatched-fpu-ops;; 0x01) src=k8-fp-cycles-with-no-fpu-ops-retired;; 0x02) src=k8-fp-dispatched-fpu-fast-flag-ops;; 0x05) src=k8-fp-unknown-$ctr1;; 0x09) src=k8-fp-unknown-$ctr1;; 0x0d) src=k8-fp-unknown-$ctr1;; 0x11) src=k8-fp-unknown-$ctr1;; 0x15) src=k8-fp-unknown-$ctr1;; 0x19) src=k8-fp-unknown-$ctr1;; 0x1d) src=k8-fp-unknown-$ctr1;; 0x20) src=k8-ls-segment-register-load;; # XXX 0x21) src=kx-ls-microarchitectural-resync-by-self-mod-code;; 0x22) src=k8-ls-microarchitectural-resync-by-snoop;; 0x23) src=kx-ls-buffer2-full;; 0x24) src=k8-ls-locked-operation;; # XXX 0x25) src=k8-ls-microarchitectural-late-cancel;; 0x26) src=kx-ls-retired-cflush-instructions;; 0x27) src=kx-ls-retired-cpuid-instructions;; 0x2a) src=kx-ls-unknown-$ctr1;; 0x2b) src=kx-ls-unknown-$ctr1;; 0x2e) src=k7-ls-unknown-$ctr1;; 0x2f) src=k7-ls-unknown-$ctr1;; 0x32) src=kx-ls-unknown-$ctr1;; 0x33) src=kx-ls-unknown-$ctr1;; 0x36) src=k7-ls-unknown-$ctr1;; 0x37) src=k7-ls-unknown-$ctr1;; 0x3a) src=kx-ls-unknown-$ctr1;; 0x3b) src=kx-ls-unknown-$ctr1;; 0x3e) src=k7-ls-unknown-$ctr1;; 0x3f) src=k7-ls-unknown-$ctr1;; 0x40) src=kx-dc-accesses;; 0x41) src=kx-dc-misses;; 0x42) src=kx-dc-refills-from-l2;; # XXX 0x43) src=kx-dc-refills-from-system;; # XXX 0x44) src=kx-dc-writebacks;; # XXX 0x45) src=kx-dc-l1-dtlb-miss-and-l2-dtlb-hits;; 0x46) src=kx-dc-l1-and-l2-dtlb-misses;; 0x47) src=kx-dc-misaligned-references;; 0x48) src=kx-dc-microarchitectural-late-cancel-of-an-access;; 0x49) src=kx-dc-microarchitectural-early-cancel-of-an-access;; 0x4a) src=k8-dc-one-bit-ecc-error;; 0x4b) src=k8-dc-dispatched-prefetch-instructions;; 0x4c) src=k8-dc-dcache-accesses-by-locks;; 0x4d) src=k7-dc-unknown-$ctr1;; 0x4e) src=k7-dc-unknown-$ctr1;; 0x4f) src=k7-dc-unknown-$ctr1;; 0x50) src=kx-dc-unknown-$ctr1;; 0x51) src=kx-dc-unknown-$ctr1;; 0x55) src=kx-dc-unknown-$ctr1;; 0x56) src=kx-dc-unknown-$ctr1;; 0x57) src=kx-dc-unknown-$ctr1;; 0x58) src=kx-dc-unknown-$ctr1;; 0x59) src=kx-dc-unknown-$ctr1;; 0x5d) src=k7-dc-unknown-$ctr1;; 0x5e) src=k7-dc-unknown-$ctr1;; 0x5f) src=k7-dc-unknown-$ctr1;; 0x64) src=kx-bu-unknown-$ctr1;; 0x68) src=kx-bu-unknown-$ctr1;; 0x69) src=kx-bu-unknown-$ctr1;; 0x76) src=kx-bu-cpu-clk-unhalted;; 0x79) src=k8-bu-unknown-$ctr1;; 0x7d) src=k8-bu-internal-l2-request;; # XXX 0x7e) src=k8-bu-fill-request-l2-miss;; # XXX 0x7f) src=k8-bu-fill-into-l2;; # XXX 0x80) src=kx-ic-fetches;; 0x81) src=kx-ic-misses;; 0x82) src=kx-ic-refill-from-l2;; 0x83) src=kx-ic-refill-from-system;; 0x84) src=kx-ic-l1-itlb-misses;; 0x85) src=kx-ic-l1-l2-itlb-misses;; 0x86) src=k8-ic-microarchitectural-resync-by-snoop;; 0x87) src=kx-ic-instruction-fetch-stall;; 0x88) src=kx-ic-return-stack-hit;; 0x89) src=kx-ic-return-stack-overflow;; 0xc0) src=kx-fr-retired-instructions;; 0xc1) src=kx-fr-retired-ops;; 0xc2) src=kx-fr-retired-branches;; 0xc3) src=kx-fr-retired-branches-mispredicted;; 0xc4) src=kx-fr-retired-taken-branches;; 0xc5) src=kx-fr-retired-taken-branches-mispredicted;; 0xc6) src=kx-fr-retired-far-control-transfers;; 0xc7) src=kx-fr-retired-resync-branches;; 0xc8) src=kx-fr-retired-near-returns;; 0xc9) src=kx-fr-retired-near-returns-mispredicted;; 0xca) src=kx-fr-retired-taken-branches-mispred-by-addr-miscompare;; 0xcb) src=k8-fr-retired-fpu-instructions;; 0xcc) src=k8-fr-retired-fastpath-double-op-instructions;; 0xcd) src=kx-fr-interrupts-masked-cycles;; 0xce) src=kx-fr-interrupts-masked-while-pending-cycles;; 0xcf) src=kx-fr-hardware-interrupts;; 0xd0) src=kx-fr-decoder-empty;; 0xd1) src=kx-fr-dispatch-stalls;; 0xd2) src=kx-fr-dispatch-stall-from-branch-abort-to-retire;; 0xd3) src=kx-fr-dispatch-stall-for-serialization;; 0xd4) src=kx-fr-dispatch-stall-for-segment-load;; 0xd5) src=kx-fr-dispatch-stall-when-reorder-buffer-is-full;; 0xd6) src=kx-fr-dispatch-stall-when-reservation-stations-are-full;; 0xd7) src=kx-fr-dispatch-stall-when-fpu-is-full;; 0xd8) src=kx-fr-dispatch-stall-when-ls-is-full;; 0xd9) src=kx-fr-dispatch-stall-when-waiting-for-all-to-be-quiet;; 0xda) src=kx-fr-dispatch-stall-when-far-xfer-or-resync-br-pending;; 0xdb) src=k8-fr-fpu-exceptions;; 0xdc) src=k8-fr-number-of-breakpoints-for-dr0;; 0xdd) src=k8-fr-number-of-breakpoints-for-dr1;; 0xde) src=k8-fr-number-of-breakpoints-for-dr2;; 0xdf) src=k8-fr-number-of-breakpoints-for-dr3;; 0xe0) src=k8-nb-memory-controller-page-access-event;; 0xe1) src=k8-nb-memory-controller-page-table-overflow;; 0xe2) src=k8-nb-memory-controller-dram-slots-missed;; 0xe3) src=k8-nb-memory-controller-turnaround;; 0xe4) src=k8-nb-memory-controller-bypass-saturation;; # XXX 0xe5) src=k8-nb-sized-commands;; # XXX 0xec) src=k8-nb-probe-result;; # XXX 0xf6) src=k8-nb-ht-bus0-bandwidth;; 0xf7) src=k8-nb-ht-bus1-bandwidth;; 0xf8) src=k8-nb-ht-bus2-bandwidth;; 0xfc) src=k8-nb-unknown-$ctr1;; *) src=very-unknown-$ctr1;; esac case $src in k8-*) ctr=$(($ctr + 1)); continue;; *unknown-*) ctr=$(($ctr + 1)); continue;; esac echo "# s/$src " perfmon -c "$c" -ou -l 1 $ctr | egrep -v '(^total: |^mean: |^clocks \(at)' | sed -e 's/1: //' ctr=$(($ctr + 1)) done %%% runpmc: %%% for i in \ k8-fp-dispatched-fpu-ops \ k8-fp-cycles-with-no-fpu-ops-retired \ k8-fp-dispatched-fpu-fast-flag-ops \ k8-ls-segment-register-load \ k8-ls-microarchitectural-resync-by-self-modifying-code \ k8-ls-microarchitectural-resync-by-snoop \ k8-ls-buffer2-full \ k8-ls-locked-operation \ k8-ls-microarchitectural-late-cancel \ k8-ls-retired-cflush-instructions \ k8-ls-retired-cpuid-instructions \ k8-dc-access \ k8-dc-miss \ k8-dc-refill-from-l2 \ k8-dc-refill-from-system \ k8-dc-copyback \ k8-dc-l1-dtlb-miss-and-l2-dtlb-hit \ k8-dc-l1-dtlb-miss-and-l2-dtlb-miss \ k8-dc-misaligned-data-reference \ k8-dc-microarchitectural-late-cancel-of-an-access \ k8-dc-microarchitectural-early-cancel-of-an-access \ k8-dc-one-bit-ecc-error \ k8-dc-dispatched-prefetch-instructions \ k8-dc-dcache-accesses-by-locks \ k8-bu-cpu-clk-unhalted \ k8-bu-internal-l2-request \ k8-bu-fill-request-l2-miss \ k8-bu-fill-into-l2 \ k8-ic-fetch \ k8-ic-miss \ k8-ic-refill-from-l2 \ k8-ic-refill-from-system \ k8-ic-l1-itlb-miss-and-l2-itlb-hit \ k8-ic-l1-itlb-miss-and-l2-itlb-miss \ k8-ic-microarchitectural-resync-by-snoop \ k8-ic-instruction-fetch-stall \ k8-ic-return-stack-hit \ k8-ic-return-stack-overflow \ k8-fr-retired-x86-instructions \ k8-fr-retired-uops \ k8-fr-retired-branches \ k8-fr-retired-branches-mispredicted \ k8-fr-retired-taken-branches \ k8-fr-retired-taken-branches-mispredicted \ k8-fr-retired-far-control-transfers \ k8-fr-retired-resyncs \ k8-fr-retired-near-returns \ k8-fr-retired-near-returns-mispredicted \ k8-fr-retired-taken-branches-mispredicted-by-addr-miscompare \ k8-fr-retired-fpu-instructions \ k8-fr-retired-fastpath-double-op-instructions \ k8-fr-interrupts-masked-cycles \ k8-fr-interrupts-masked-while-pending-cycles \ k8-fr-taken-hardware-interrupts \ k8-fr-decoder-empty \ k8-fr-dispatch-stalls \ k8-fr-dispatch-stall-from-branch-abort-to-retire \ k8-fr-dispatch-stall-for-serialization \ k8-fr-dispatch-stall-for-segment-load \ k8-fr-dispatch-stall-when-reorder-buffer-is-full \ k8-fr-dispatch-stall-when-reservation-stations-are-full \ k8-fr-dispatch-stall-when-fpu-is-full \ k8-fr-dispatch-stall-when-ls-is-full \ k8-fr-dispatch-stall-when-waiting-for-all-to-be-quiet \ k8-fr-dispatch-stall-when-far-xfer-or-resync-branch-pending \ k8-fr-fpu-exceptions \ k8-fr-number-of-breakpoints-for-dr0 \ k8-fr-number-of-breakpoints-for-dr1 \ k8-fr-number-of-breakpoints-for-dr2 \ k8-fr-number-of-breakpoints-for-dr3 \ k8-nb-memory-controller-page-table-overflow \ k8-nb-memory-controller-dram-slots-missed \ k8-nb-memory-controller-bypass-saturation \ k8-nb-sized-commands \ k8-nb-probe-result do pmcstat -s $i sleep 1 2>&1 | sed -e 's/^ *//' -e 's/ */ /' \ -e 's/ *$//' -e 's/\/00\/k8-/\/k8-/' done %%% runpmc7: %%% for i in \ k7-dc-accesses \ k7-dc-misses \ k7-dc-refills-from-l2 \ k7-dc-refills-from-system \ k7-dc-writebacks \ k7-l1-dtlb-miss-and-l2-dtlb-hits \ k7-l1-and-l2-dtlb-misses \ k7-misaligned-references \ k7-ic-fetches \ k7-ic-misses \ k7-l1-itlb-misses \ k7-l1-l2-itlb-misses \ k7-retired-instructions \ k7-retired-ops \ k7-retired-branches \ k7-retired-branches-mispredicted \ k7-retired-taken-branches \ k7-retired-taken-branches-mispredicted \ k7-retired-far-control-transfers \ k7-retired-resync-branches \ k7-interrupts-masked-cycles \ k7-interrupts-masked-while-pending-cycles \ k7-hardware-interrupts do pmcstat -s $i sleep 1 2>&1 | sed -e 's/^ *//' -e 's/ */ /' -e 's/ *$//' -e 's/k7/kx/' done %%% "runpm" tries up to all 256 perfomance counters, with names like the hwpmc ones. k7 means AthlonXP only; k8 means Athlon64 only; kx means both, but many kx's don't really work or are not documented for both. A few like kx-fr-retired-near-returns-mispredicted (?) are not documented for AXP but seem to work and are useful. runpmc tries the documented A64 counters. runpmc7 tries the documented AXP counters. hwpmc is less useful than perfmon here since it doesn't support using the undocumented counters. There is a pmc* option that prints all the counters in the above lists. I checked that they are almost precisely the documented (in Athlon optimization manuals) ones. Names are unfortunately inconsistent between k7 and k8 in some cases, following inconsistencies in the documentation. I don't know anything about Pentium counters except what is in source code. gprof output for the mumble perfmon counter (kx-dc-misses?) while sending 100000 tiny packets using ttcp -t looks like this (after fixing the calibration): %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 2.81 seconds % cumulative self self total time seconds seconds calls us/call us/call name 11.0 0.308 0.308 100083 3 24 syscall [3] 10.8 0.613 0.305 200012 2 2 rn_match [16] 4.4 0.738 0.125 100019 1 1 _bus_dmamap_load_buffer [25] 4.3 0.859 0.121 300107 0 0 generic_copyin [27] 4.0 0.973 0.114 100006 1 9 ip_output [10] 3.8 1.079 0.106 100006 1 4 ether_output [12] 3.7 1.182 0.103 100007 1 1 fgetsock [30] 3.6 1.284 0.102 100006 1 2 bus_dmamap_load_mbuf [21] 3.6 1.385 0.101 200012 1 3 rtalloc_ign [11] 3.6 1.486 0.101 100083 1 25 Xint0x80_syscall [2] 3.6 1.587 0.101 200012 1 1 in_clsroute [32] 3.6 1.688 0.101 100006 1 20 sendto [4] 3.6 1.789 0.101 100008 1 1 in_pcblookup_hash [33] 3.6 1.890 0.101 100006 1 16 kern_sendit [6] 3.6 1.990 0.100 200012 1 2 in_matroute [15] 3.6 2.091 0.100 100748 1 1 doreti [35] 3.6 2.191 0.100 100007 1 2 getsockaddr [22] %%% I would like to be able to do this with hwpmc but don't see how it can. Only (non-statistical) counting at every function call and return can give precise counts like the above. However, non-statistical counting is better for some things. Back to the original problem. Uma allocation overhead shows up in TSC profiles of ttcp, but is just one of too many things that take a while. There are about function calls, each taking about 1%: % granularity: each sample hit covers 16 byte(s) for 0.00% of 0.86 seconds % % % cumulative self self total % time seconds seconds calls ns/call ns/call name % 44.9 0.388 0.388 0 100.00% mcount [1] % 20.9 0.569 0.180 0 100.00% mexitcount [7] % 8.0 0.638 0.069 0 100.00% cputime [14] % 1.8 0.654 0.016 0 100.00% user [25] % 1.6 0.668 0.014 100006 143 1051 udp_output [12] % 1.5 0.681 0.013 100006 133 704 ip_output [13] % 1.3 0.692 0.011 300120 37 37 copyin [30] % 1.2 0.702 0.010 100006 100 1360 sosend_dgram [10] % 0.9 0.710 0.008 200012 39 39 rn_match [38] % 0.9 0.718 0.007 300034 25 25 memcpy [39] % 0.8 0.725 0.007 200103 36 58 uma_zalloc_arg [29] % 0.8 0.732 0.007 100090 68 1977 syscall [4] All the times seem reasonable. Without profiling, sendto() and overheads takes about 1700 nsec in -current and about 1600 nsec in my version of 5.2. (This is for -current. The 100 nsec difference is very hard to understand in detail.) With high resolution kernel profiling, sendto() and overheads take about 8600 nsec here. Profiling has subtracted its own overheads and the result of 1977 nsec for syscall is consistent with syscall() taking a bit less that 1700 nsec when not looked at. (Profiling only subtracts its best-case overheads. Its runtime overheads must be larger due to cache effects, and if these are very large then we cannot trust the compensation. Since it compensated from 8600 down to about 1977, it has clearly down the compensation almost right. The compensation is delicate when there are a lot of functions taking ~20 nsec since the profiling overhead per function call is 82 nsec. % 0.8 0.738 0.007 200012 33 84 rtalloc1 [24] % 0.8 0.745 0.006 100006 65 139 bge_encap [26] % 0.7 0.751 0.006 100006 62 201 bge_start_locked [20] % 0.6 0.757 0.006 200075 28 28 bzero [44] % 0.6 0.761 0.005 100006 48 467 ether_output [15] % 0.6 0.766 0.005 100006 48 192 m_uiotombuf [22] % 0.5 0.771 0.005 200100 23 45 uma_zfree_arg [33] % 0.5 0.775 0.005 100006 46 46 bus_dmamap_load_mbuf_sg [45] % 0.5 0.780 0.004 100028 45 132 malloc [27] % 0.5 0.784 0.004 200012 20 104 rtalloc_ign [19] This is hard to optimize. uma has shown up as taking 58 nsec for uma_zalloc_arg() (including what it calls) and 45 nsec for uma_zfree_arg(). This is on a 2.2GHz A64. Anything larger than that might be a profiling error. But thes allocations here are tiny -- maybe large allocations cause cache methods. I was able to optimize away most the allocation overheads in sendto() be allocating the sockaddr on the stack, but this made little difference overall. (It reduces dynamic allocations per packet from 2 to 1. Both allocations use malloc() so they are a bit slower than pure uma. BTW, switching from mbuf-based allocation to malloc() in getsockaddr() etc. long ago cost 10 usec on a P1/133. A loss of 10000 nsec makes the overhead of 200 nsec for malloc now look tiny.) Remember I said that differences of 100 nsec are hard to understand? It is also not easy to understand why eliminating potentially 100 nsec of malloc() overhead makes almost no difference overall. The 100 nsec gets distributed differently, or maybe the profiling really was wrong for the malloc() part. Reads of the TSC are excuted possibly-out-of-order on some CPUs. This doesn't seem to have much effect on the accuracy of high resolution (TSC) kernel profiling on at least Athlons. rdtsc takes only 12 cycles on AXP-A64. I think it takes much longer on Pentiums. On Phenom it takes ~42 cycles (pessimized to share it across CPUs :-(). With it taking much longer than the functions that it profiles, the compensation might become too fragile. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080202080617.U56760>