Date: Mon, 24 May 2004 15:05:21 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Nate Lawson <nate@root.org> Cc: cvs-all@freebsd.org Subject: Re: cvs commit: src/sys/conf files.amd64 Message-ID: <20040524141152.V23484@gamplex.bde.org> In-Reply-To: <20040523205250.T66525@root.org> References: <20040523183858.7C3DA16A557@hub.freebsd.org> <20040523205250.T66525@root.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 23 May 2004, Nate Lawson wrote: > On Sun, 23 May 2004, Bruce Evans wrote: > > The SMP case hasn't been tested. The high resolution subcase of this uses > > the i8254, and as on i386's, the locking for this is deficient and the > > i8254 is too inefficient. The acpi timer is also too inefficient. > > The ACPI timer is significantly better than it used to be. It is > currently just a bus_space_read, which maps directly to inl. The fact > that IO ports are slow is inescapable. Actually, its speed hasn't changed, since it is the hardare speed that I mean. It is about 200 times as slow as rdtsc() on my amd64 test system. Sample profiling output which shows the problem indirectly: output produced by: %%% sync kgmon -rB time ping -fq -c100000 localhost kgmon -hp gprof -u `sysctl -n kern.bootfile` gmon.out >zp %%% %%% ... granularity: each sample hit covers 16 byte(s) for 0.00% of 5.40 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 40.4 2.181 2.181 mcount [1] 15.3 3.009 0.828 814149 1017 1017 acpi_timer_read [7] 14.7 3.802 0.793 mexitcount [9] 5.1 4.076 0.275 cputime [22] 2.3 4.202 0.125 user [42] 1.9 4.307 0.105 408446 258 258 cpu_switch [44] 0.9 4.358 0.051 76 671400 671400 acpi_cpu_c1 [51] 0.9 4.406 0.048 200007 242 1569 ip_output [21] 0.8 4.450 0.043 600161 72 1831 syscall [4] 0.7 4.490 0.040 1958847 20 20 critical_enter [52] 0.6 4.522 0.033 1300188 25 25 bzero [56] 0.5 4.552 0.030 600108 49 49 copyin [58] 0.5 4.580 0.029 800106 36 36 copyout [59] 0.5 4.608 0.027 1958847 14 14 critical_exit [61] 0.5 4.632 0.025 408453 61 1495 mi_switch [11] 0.4 4.656 0.024 200007 119 2043 ip_input [17] ... %%% This was made using the TSC clock for high resolution profiling (sysctl machdep.cputime_clock=2). It shows that the acpi_timer_read() takes about 40% of the of the time for "ping -fq localhost" (mcounting overheads = 40.4 + 14.7 + 5.1 = 60.2%, acpi_timer_read = 15.3%, other = 24.6%). It also shows that there are 8 timecounter calls per packet; 4 are for 4 context switches and 4 are for timestamps made by ping(8) (one to put in the packet and 3 to leep track of the time). The system is configured with acpi, so it chooses an ACPI timecounter (ACPI-fast) as the highest quality, but in fact the ACPI timecounter has much lower quality than the TSC (it is about 100 times slower and 100 times less precise, and no more stable since this is a desktop system which never suspends or throttles the CPU if I can help it). Switching to the TSC timecounter gives for the above: %%% ... granularity: each sample hit covers 16 byte(s) for 0.00% of 4.49 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 47.3 2.123 2.123 mcount [1] 17.2 2.895 0.772 mexitcount [2] 6.0 3.162 0.267 cputime [13] 2.8 3.289 0.126 user [23] 2.4 3.394 0.106 406425 260 260 cpu_switch [27] 1.1 3.445 0.051 200006 255 1129 ip_output [15] 0.9 3.487 0.041 600153 69 1155 syscall [5] 0.9 3.525 0.039 1301899 30 30 bzero [41] 0.8 3.561 0.036 1950822 18 18 critical_enter [44] 0.7 3.594 0.033 1950822 17 17 critical_exit [46] 0.7 3.626 0.032 200003 160 160 fgetsock [47] 0.7 3.658 0.032 800100 39 39 copyout [48] 0.6 3.687 0.029 600099 48 48 copyin [51] 0.6 3.714 0.027 406429 67 474 mi_switch [16] 0.5 3.738 0.024 300048 80 80 mb_free [58] 0.5 3.761 0.023 200006 114 1353 ip_input [12] 0.5 3.783 0.022 100004 221 743 soreceive [36] 0.5 3.804 0.021 100018 214 678 kern_select [38] 0.4 3.823 0.018 700032 26 26 in_cksumdata [65] 0.4 3.841 0.018 200000 91 1196 icmp_input [14] ... 0.2 4.184 0.009 811079 11 11 tsc_get_timecount [93] ... %%% This says among other things that tsc_get_timecounter() takes 11nS. It really does take about 11nS when not profiled (5.5nS for rdtsc at 2GHz + function call overheads) so I'm confident that the adjustments for mcounting overheads are accurate to within 10-100nS. So even with 8 tsc_get_timecount() calls per packet, the timecounter part of the overhead is small. Also, cputime() is basically just rdtsc(), so using it doesn't take too long (6%). If cputime() were based on the ACPI timer, then it would want to take 600% of the CPU. It couldn't do that of course, so running the profile would just take about 50 times as long as when high resulution profiling is not running instead of only about 4-5 times as long. Low resolution profiling takes about 3 times as long. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040524141152.V23484>