From owner-freebsd-performance@FreeBSD.ORG Mon Jun 2 19:05:42 2008 Return-Path: Delivered-To: freebsd-performance@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B283D1065670 for ; Mon, 2 Jun 2008 19:05:42 +0000 (UTC) (envelope-from sean@chittenden.org) Received: from davie.textdrive.com (davie.textdrive.com [207.7.108.101]) by mx1.freebsd.org (Postfix) with ESMTP id AD97E8FC0C for ; Mon, 2 Jun 2008 19:05:42 +0000 (UTC) (envelope-from sean@chittenden.org) Received: from [192.168.255.187] (dsl093-167-029.sfo1.dsl.speakeasy.net [66.93.167.29]) by davie.textdrive.com (Postfix) with ESMTP id 8950DC2E09; Mon, 2 Jun 2008 19:05:41 +0000 (GMT) Message-Id: From: Sean Chittenden To: Bruce Evans In-Reply-To: <20080602182214.I2764@delplex.bde.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v924) Date: Mon, 2 Jun 2008 12:05:39 -0700 References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602182214.I2764@delplex.bde.org> X-Mailer: Apple Mail (2.924) Cc: freebsd-performance@FreeBSD.org Subject: Re: Micro-benchmark for various time syscalls... X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jun 2008 19:05:42 -0000 >> I wrote a small micro-benchmark utility[1] to test various time >> syscalls and the results were a bit surprising to me. The results >> were from a UP machine and I believe that the difference between >> gettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would've >> been bigger on an SMP system and performance would've degraded >> further with each additional core. > > I wouldn't expect SMP to make much difference between CLOCK_REALTIME > and > CLOCK_REALTIME_FAST. The only difference is that the former calls > nanotime() where the latter calls getnanotime(). nanotime() always > does > more, but it doesn't have any extra SMP overheads in most cases (in > rare > cases like i386 using the i8254 timecounter, it needs to lock > accesses to > the timecounter hardware). gettimeofday() always does more than > CLOCK_REALTIME, but again no more for SMP. You may be right, I can only speculate. Going off of phk@'s rhetorical questions regarding gettimeofday(2) working across cores/ threads, I assumed there would be some kind of synchronization. http://lists.freebsd.org/mailman/htdig/freebsd-current/2005-October/057280.html >> clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for >> most authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- >> 10ms of CLOCK_REALTIME's value[2]). In fact, I'd assume that >> CLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2) >> (a statement I can't back up, but believe is likely to be correct) >> and therefore there isn't much harm (if any) in seeing >> clock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread use >> vs. gettimeofday(2). FYI. -sc > > The existence of most of CLOCK_* is a bug. I wouldn't use > CLOCK_REALTIME_FAST > for anything (if only because it doesn't exist in most kernels that I > run. I think that's debatable, actually. I modified my little micro- benchmark program to test the realtime values returned from each execution and found that CLOCK_REALTIME_FAST likely updates itself sufficiently frequently for most applications (not all, but most). My test ensures that time doesn't go backwards and tally's the number of times that the values are identical. It'd be nice of CLOCK_REALTIME_FAST incremented by a small and reasonable fudge factor every time it's invoked that way the values aren't identical. On my machine, I can make 100K gettimeofday(2) calls compared to 3M CLOCK_REALTIME_FAST calls, which is a significantly large delta when you're aiming for software that's handling around ~40-50Kpps and want to include time information periodically (see above comment about a fudge factor being included after every call *grin* ). http://sean.chittenden.org/pubfiles/freebsd/bench_clock_realtime.c % ./bench_clock_realtime 9079882 | sort -rnk1 clock realtime micro-benchmark. 9079882 syscall iterations. Avg. us/call Elapsed Name 9.317078 84.597968 gettimeofday(2) 8.960372 81.359120 time(3) 8.776467 79.689287 clock_gettime(2/CLOCK_REALTIME) 0.332357 3.017763 clock_gettime(2/CLOCK_REALTIME_FAST) 0.311705 2.830246 clock_gettime(2/CLOCK_SECOND) Value from time(3): 1212427374 Last value from gettimeofday(2): 1212427293.590511 Equal: 0 Last value from clock_gettime(2/CLOCK_SECOND): 1212427460.000000000 Equal: 9079878 Last value from clock_gettime(2/CLOCK_REALTIME_FAST): 1212427457.656410126 Equal: 9078198 Last value from clock_gettime(2/CLOCK_REALTIME): 1212427454.639076390 Equal: 0 % irb >> tot = 9079882 => 9079882 >> eq = 9078198 => 9078198 >> tot - eq => 1684 >> time = 3.017763 => 3.017763 >> (tot - eq) / time => 558.029242190324 >> tot / time => 3008812.15655437 # number of CLOCK_REALTIME_FAST calls per second >> tot / 84.597968 => 107329.788346689 # number of gettimeofday(2) calls per second > I switched from using gettimeofday() to CLOCK_REALTIME many years > ago when syscalls started taking less than 1 usec and still > occasionally > have problems from this running old kernels, because old i386 kernels > don't support CLOCK_REALTIME and old amd64 kernels have a broken > CLOCK_REALTIME in 32-bit mode). Entirely possible that's why things are more expensive on my test machine. % sysctl hw.model hw.model: AMD Athlon(tm) 64 Processor 3500+ % uname -a FreeBSD dev2.office.chittenden.org 7.0-RELEASE FreeBSD 7.0-RELEASE #0: Sun Feb 24 10:35:36 UTC 2008 root@driscoll.cse.buffalo.edu:/usr/ obj/usr/src/sys/GENERIC amd64 >> PS Is there a reason that time(3) can't be implemented in terms of >> clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to >> time_t's whole second resolution. > > I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low > accuracy timers provided by the get*time() family are accurate enough > to give the time in seconds. Unfortunately, they are still broken -- > they are all incoherent relative to nanotime() and some are incoherent > relative to each other. CLOCK_SECOND can lag the time in seconds > given > by up to tc_tick/HZ seconds. This is because CLOCK_SECOND returns the > time in seconds at the last tc_windup(), so it misses seeing rollovers > of the second in the interval between the rollover and the next > tc_windup(), while nanotime() doesn't miss seeing these rollovers so > it gives incoherent times, with nanotime()/CLOCK_REALTIME being > correct > and time_second/CLOCK_SECOND broken. Interesting. Incoherent, but accurate enough? We're talking about a <10ms window of incoherency, right? >> % ./bench_time 9079882 | sort -rnk1 >> Timing micro-benchmark. 9079882 syscall iterations. >> Avg. us/call Elapsed Name >> 9.322484 84.647053 gettimeofday(2) >> 8.955324 81.313291 time(3) >> 8.648315 78.525684 clock_gettime(2/CLOCK_REALTIME) >> 8.598495 78.073325 clock_gettime(2/CLOCK_MONOTONIC) >> 0.674194 6.121600 clock_gettime(2/CLOCK_PROF) >> 0.648083 5.884515 clock_gettime(2/CLOCK_VIRTUAL) >> 0.330556 3.001412 clock_gettime(2/CLOCK_REALTIME_FAST) >> 0.306514 2.783111 clock_gettime(2/CLOCK_SECOND) >> 0.262788 2.386085 clock_gettime(2/CLOCK_MONOTONIC_FAST) > > These are very slow. Are they on a 486? :-) I get about 262 ns for > CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems. > The syscall overhead is about 200 nsec (170 nsec for a simpler syscall > and maybe 30 nsec extra for copyin/out for clock_gettime()) and > reading > the TSC timecounter adds another 60 nsec, including a whole 6 nsec for > the hardware part of the read (perhaps more like 30 nsec than 60 for > the > whoe read). The TSC doesn't work on all machines (never for SMP), but > this will hopefully change. (Phenom is supposed to have TSCs that are > coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+ > to implement this :-(. Core2 already has a 40+ cycles rdtsc, but > AFAIK > it doesn't have coherent TSCs.) Other timecounters are much slower > than > the TSC, but I haven't seen one take 8000 nsec since 486 days. *shrug* elapsed / number of calls. Not doing anything fancy here. > Some of my benchmark results: Can I run this same test/see how this was written? > This system has a fairly fast ACPI and i8254 timecounters. 1500-800 > nsec is more typical for ACPI-fast, and 4000-5000 is more typical > for i8254. ACPI-fast should be named ACPI-not-very-slow. ACPI-safe > is very slow, perhaps slower than i8254. i8254 could be made about > twice as fast if anyone cared. Hrm. % sysctl -a | grep -i acpi_timer machdep.acpi_timer_freq: 3579545 dev.acpi_timer.0.%desc: 24-bit timer at 3.579545MHz dev.acpi_timer.0.%driver: acpi_timer dev.acpi_timer.0.%location: unknown dev.acpi_timer.0.%pnpinfo: unknown dev.acpi_timer.0.%parent: acpi0 % sysctl -a | grep -i tsc kern.timecounter.choice: TSC(800) ACPI-safe(850) i8254(0) dummy(-1000000) kern.timecounter.tc.TSC.mask: 4294967295 kern.timecounter.tc.TSC.counter: 2749242907 kern.timecounter.tc.TSC.frequency: 2222000000 kern.timecounter.tc.TSC.quality: 800 kern.timecounter.smp_tsc: 0 machdep.tsc_freq: 2222000000 > Other implementation bugs (all in clock_getres()): > - all of the clock ids that use getnanotime() claim a resolution of 1 > nsec, but that us bogus. The actual resolution is more like > tc_tick/HZ. > The extra resolution in a struct timespec is only used to return > garbage related to the incoherency of the clocks. (If it could be > arranged that tc_windup() always ran on a tc_tick/HZ boundary, then > the clocks would be coherent and the times would always be a multiple > of tc_tick/HZ, with no garbage in low bits.) > - CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is > bogus. The actual resolution is more like 1/stathz, or perhaps 1 > microsecond. hz is irrelevant here since statclock ticks are used. > statclock ticks only have a resolution of 1/stathz, but if 1 nsec is > correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since > caclru() calculates the time to a resolution of 1 usec; it is just > very inaccurate at that resolution. > "Resolution" is a poor term for the functionality needed here. I > think > a hint about the accuracy is more important. In simple > implementations > using interrupts and ticks, the accuracy would be about the the same > as > the resolution, but FreeBSD is more complicated. Is there any reason that the garbage resolution can't be zero'ed out to indicate confidence of the kernel in the precision of the information? -sc -- Sean Chittenden sean@chittenden.org http://sean.chittenden.org/