Date: Sun, 29 Jul 2012 14:58:36 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: amd64@FreeBSD.org, Jung-uk Kim <jkim@FreeBSD.org>, Andriy Gapon <avg@FreeBSD.org> Subject: Re: Use fences for kernel tsc reads. Message-ID: <20120729123231.K1193@besplex.bde.org> In-Reply-To: <20120728160202.GI2676@deviant.kiev.zoral.com.ua> References: <20120728160202.GI2676@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 28 Jul 2012, Konstantin Belousov wrote: > This was discussed on somewhat unwieldly thread on svn-src@ as a followup > to the commit r238755 which uncovered the problem in the first place. > > Below is the commit candidate. It changes the fix in r238755 to use CPUID > instead of rmb(), since rmb() translates to locked operation on i386, > and experimentation shown that Nehalem's RDTSC can pass LOCK. At least remove TSC-low instead of building another layer of pessimizations and source bloat on top of it. I might not mind even more source code bloat for TSC-low and fences if it were runtime tests to avoid using them unless necessary. At least sysctls to avoid using them. When the kernel selects TSC-low, ordinary TSC becomes unavailable. > ... > Handling of usermode will follow later. I hesitate to mention that this doesn't pessimize all uses of rdtsc: - x86/isa/clock.c uses raw rdtsc32() for DELAY() - x86/x86/mca.c uses raw rdtsc() for mca_check_status() - x86/x86/tsc.c uses raw rdtsc() for: - calibrating the TSC frequency. The calibration is otherwise sloppy and would only be slightly affected by additional sloppiness or its removal. - the CPU ticker. A function pointer is used and the address of the static inline rdtsc() is taken. It becomes not so raw -- a normal static function. I don't like the CPU ticker. It gives huge inaccuracies in some cases. Most uses of it go through calcru(), and it ameliorates some of the bugs. It enforces monotonicity of rusage times for other reasons. When the inaccuracies or overflow causes the time to go backwards, calcru() complains about some cases. This is about the only place in the kernel that checks for the time going backwars. Most uses of the CPU ticker are for setting switchtime on context switches. These uses are far apart, and near heavy locking, and on a single CPU, so the TSC as read by them can probably never go backwards. (The same magic that allows switching to be per-CPU should allow the TSC to be per-CPU.) - {amd64,i386}/include/cpu.h uses the CPU ticker for get_cyclecount(). I don't like get_cyclecount(). It is a bad API resulting from previous mistakes in this area. It is basically the TSC spelled differently so that the TSC could be used when the TSC timecounter was not properly exported or the hardware doesn't have it. x86 hardware doesn't have it, and non-x86 might name it differently even if it has it. When the hardware doesn't have it, a weak replacement like the current time counter is used. But this makes it too slow to use, since all places that use it use it because they don't want slowness (otherwise they would just use a timecounter directly). The selection used to be ifdefed in i386/include/cpu.h. Now, it just uses the CPU ticker on 386 (it still uses a raw rdtsc() on amd64). It is actually documented in section 9, but the much more important CPU ticker isn't. Its comment in at least the i386 machine/cpu.h still says that it returns a "bogo-time" for random harvesting purposes, but the CPU ticker time is not so bogus (it must be good enough for thread accounting, using deltas on the same CPU), and of course this mistake is now used for more than random harvesting. On other arches, the implementation of get_cyclecount() is differently bogus: - on amd64, it is an inline function that calls the inline rdtsc() - on arm, it is an inline function that always uses binuptime() and never goes through the CPU ticker - on ia64, it is #defined as a raw counter API spelled without `()' so it looks like a variable, but I guess it is a function - on i386, it is an inline function that calls the non-inline cpu_ticks(), so it is slower than the old ifdefed version. cpu_ticks() goes through at least a function pointer to reach the non-inline rdtsc(). - on mips, it is #defined as a raw counter API spelled with `()' - on powerpc, it is a dedicated inline function with inline asm - on sparc64, it is an inline function that calls a raw counter API That is a lot of differently spelled bloat to support an API that should never have existed. All arches could have just the same slowness as i386, with no MD code, by translating it to cpu_ticks() as on i386. Or callers of it could abuse cpu_ticks() directly. get_cyclecount() has escaped from random harvesting to the following places: - dev/de/if_devar.h: used for timestamps under a performance-testing option that seems to not be the default (so hardly anyone except its developer ever used it). Old (10 or 100 Mbps here?) NIC drivers don't need very efficient timestamps, and it would be nice if they weren't in unscaled bogotime. I use microtime() in not-so-old (1Gbps) NIC drivers. This with a TSC works well for timestamping events a few usec apart. It wouldn't work so well with an APIC timecounter taking about 1 usec to read. But the APIC timecounter should be fast enough for this use at 100 Mbps. The i8254 timecounter is 5 times slower again, but fast enough at 10 Mbps (original de?). - dev/random/harvest.c: the original use. A cycle counter seems like an especially bad choice for randomness. It just has a bit or two of randomness in its low bits. The serialization bug makes it better for this. - dev/random/randomdev_soft.c: another original use. - kern/init_main.c: seeding for randomdev. - kern/kern_ktr.c: for timestamps that should at least be monotonic and unique even if they are in unscaled bogotime. get_cyclecount() is highly unsuitable for this, especially when it uses a timecounter (don't put any ktr calls in timecounter code, else you might get endless recursion). Apparently, get_cyclecount() was used because it was the only portable API to reach the hardware name. For this use, get_cyclecount() should not be faked. Serializing might be needed here. TSC-low lossage is unwanted here -- it might break uniqueness. - netinet/sctp_os_bsd.h: for tracing. Like ktr, except it doesn't need to worry about recursion. Perhaps can be more like NIC drivers should be and just use a timecounter. Anyway, the unserialized TSC serves for providing bogo-time for get_cyclecount() even better than it serves for providing time for the CPU ticker. - i386/i386/machdep.c uses raw rdtsc() for recalibrating the TSC frequency. Similar sloppyness to initial calibration. - i386/i386/perfmon.c: TSC ticks are just an an especially important event. Serializing would break performance testing by changing the performance significantly. Since this is only for a user API, the breakage from the increased overhead would be dominated by existing overheads, as for the clock_getttime() syscall. - include/xen/xen-os.h: home made version of rdtsc() with different spelling. To change the spelling, it should use the standard inline and not repeat asm. - isa/prof_machdep.c. Like perfmon, but now rdtsc() is called on every function call and return for high resolution profiling, and there is no slow syscall to already slow this down. Function call lengths down to about 10 nsec can be measured reliably on average. The overhead is already much more than this, but is compensated for, so lengths of 10 nsec might remain measureable, but the overhead is already too much and adding to it wouldn't help, and might disturb the performance too much. This use really wants serialization, and out-of-order timestamps might show up as negative function call times, but they never have. High resolution kernel profiling never worked for SMP (except in my version, it works except for excessive lock contention) and was broken by gcc-4 changing the default compiler options, so this problem can be ignored for now. Please trim lots of the above if you reply. > diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c > index c253a96..101cbb3 100644 > --- a/sys/x86/x86/tsc.c > +++ b/sys/x86/x86/tsc.c > ... > @@ -328,15 +344,26 @@ init_TSC(void) > > #ifdef SMP > > -/* rmb is required here because rdtsc is not a serializing instruction. */ > +/* > + * RDTSC is not a serializing instruction, so we need to drain > + * instruction stream before executing it. It could be fixed by use of > + * RDTSCP, except the instruction is not available everywhere. Sentence breaks are 2 spaces in KNF. There was a recent thread about this. I used to police style in this file more, and it mostly uses 2 spaces. > + * > + * Use CPUID for draining. The timecounters use MFENCE for AMD CPUs, > + * and LFENCE for others (Intel and VIA) when SSE2 is present, and > + * nothing on older machines which also do not issue RDTSC > + * prematurely. There, testing for SSE2 and vendor is too cumbersome, > + * and we learn about TSC presence from CPUID. > + */ RDTSC apparently doesn't need full serialization (else fences wouldn't be enough, and the necessary serialization would be slower). The necessary serialization is very complex and not fully described above, but the above is not a good place to describe many details. It already describes too much. It is obvious that we shouldn't use large code or ifdefs to be portable in places where we don't care at all about efficiency. Only delicate timing requirements would prevent us using the simple CPUID in such code. For example, in code very much like this, we might need to _not_ slow things down, since we might want to see small differences between the TSCs on different CPUs. > @@ -592,23 +628,55 @@ sysctl_machdep_tsc_freq(SYSCTL_HANDLER_ARGS) > SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq, CTLTYPE_U64 | CTLFLAG_RW, > 0, 0, sysctl_machdep_tsc_freq, "QU", "Time Stamp Counter frequency"); > > -static u_int > +static inline u_int > tsc_get_timecount(struct timecounter *tc __unused) > { > > return (rdtsc32()); > } Adding `inline' gives up to 5 (!) style bugs at different levels: - `inline' is usually spelled `__inline'. The former is now standard, but sys/cdefs.h never caught up with this and has better magic for the latter. And `inline' doesn't match most existing spellings. - the function is still forward-declared without `inline'. The differenc is confusing. - forward declaration of inline functions is nonsense. It sort of asks for early uses to be non-inline and late uses to be inline. But gcc now doesn't care much about the order, since -funit-at-a-time is the default. - this function is never actually inline. You added calls to the others but not this. - if this function were inline, then this might happen automatically due to -finline-functions-called-once. See below. > > -static u_int > +static inline u_int > tsc_get_timecount_low(struct timecounter *tc) > { > uint32_t rv; > > __asm __volatile("rdtsc; shrd %%cl, %%edx, %0" > - : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); > + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); > return (rv); > } - as above, except now the function can actually be inline. It is still derefenced, so it is not actually inline in all cases. You added 2 calls to it, so it is inlined in 2 cases. gcc now inlines functions called once, especially when you don't want this. Here this is wanted. I don't know if gcc considers a use to be "once" if there is a single inlineable use and also a non-inlineable use. I miscounted the number of uses at first. There are now 2 inlineable cases, and although inlining would be good, it is no longer "once" so gcc shouldn't do it without an explicit __inline. > > +static inline u_int > +tsc_get_timecount_lfence(struct timecounter *tc __unused) > +{ > + > + lfence(); > + return (rdtsc32()); > +} Here and for mfence, you call the raw function. That's why tsc_get_timecount() is never inline. I first thought that the patch was missing support for plain TSC. I also don't like the rdtsc32() API :-). Since it is inline, compilers should be able to ignore the high bits in rdtsc(). But we're micro- optimizing this to get nice looking code and save a whole cycle or so. I'm normally happy to sacrifice a cycle for cleaner source code but not for 4 cycles :-). If the extra code were before the fence, then its overhead would be lost in the stalling for the fence. Perhaps similatly without fences. Since it is after, it is more likely to have a full cost, with a latency of many more than 1 cycle now not hidden by out-of-order execution. > + > +static inline u_int > +tsc_get_timecount_low_lfence(struct timecounter *tc) > +{ > + > + lfence(); > + return (tsc_get_timecount_low(tc)); > +} > + > +static inline u_int > +tsc_get_timecount_mfence(struct timecounter *tc __unused) > +{ > + > + mfence(); > + return (rdtsc32()); > +} > + > +static inline u_int > +tsc_get_timecount_low_mfence(struct timecounter *tc) > +{ > + > + mfence(); > + return (tsc_get_timecount_low(tc)); > +} > + > uint32_t > cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th) > { > I don't like the indirections for using these functions, but we already have them. Linux would modify the instruction stream according to a runtime probe reduce it to an inline rdtsc in binuptime() etc. if possible (otherwise modify to add fences as necessary). Not inlining rdtsc like we do probably reduces serialization problems. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120729123231.K1193>