Date: Thu, 17 Mar 2011 23:42:22 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Jung-uk Kim <jkim@FreeBSD.org> Cc: src-committers@FreeBSD.org, Roman Divacky <rdivacky@FreeBSD.org>, Bruce Evans <brde@optusnet.com.au>, svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, Maxim Dounin <mdounin@mdounin.ru> Subject: Re: svn commit: r219679 - head/sys/i386/include Message-ID: <20110317213445.U1128@besplex.bde.org> In-Reply-To: <201103161634.08104.jkim@FreeBSD.org> References: <201103152145.p2FLjAlt060256@svn.freebsd.org> <201103161233.16347.jkim@FreeBSD.org> <20110316174553.GA6367@freebsd.org> <201103161634.08104.jkim@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 16 Mar 2011, Jung-uk Kim wrote: > On Wednesday 16 March 2011 01:45 pm, Roman Divacky wrote: >> On Wed, Mar 16, 2011 at 12:32:56PM -0400, Jung-uk Kim wrote: >>> On Tuesday 15 March 2011 08:45 pm, Maxim Dounin wrote: >>>> This isn't really different as long as GENERIC kernel used, as >>>> GENERIC defines I486_CPU. >>> >>> Fixed in r219698, sorry. >>> >>> Actually, I think we should remove i486 from GENERIC at some >>> point. It has too many limitations. For example, I really love >>> to implement atomic 64-bit mem read/write using cmpxchg8b (no >>> 0xf00f joke, please) but I cannot do that cleanly without >>> removing I486 support or checking cpu_class at run-time. :-( I'm not sure how f00f applies to cmpxchg8b, but we still maintain the f00f workaround (perhaps not well enough to keep it actually working) though there might be more i486's still running FreeBSD than there are i586's with the f00f bug, since i486's might be reimplemented into embedded systems but i586's with the f00f bug might have been replaced and reimplementations of i586's wouldn't implement the f00f bug. >> if we drop i486 I think it makes sense to require something that >> has at least SSE2, thus we can have the same expectations as on >> amd64. >> >> and we can use sse2 unconditionally (str*, mem* etc.) > > This is a proof-of-concept patch for sys/x86/isa/clock.c: > > http://people.freebsd.org/~jkim/clock.diff Please include patches in mail. It is hard to refer to lines in urls. > You see the complexity, just because I wanted to load 64-bit value > atomically... :-( I still think it should use a generation count. (It can't use a mutex due to deadlock and other problems.) % Index: sys/x86/isa/clock.c % =================================================================== % --- sys/x86/isa/clock.c (revision 219697) % +++ sys/x86/isa/clock.c (working copy) % @@ -62,6 +62,7 @@ __FBSDID("$FreeBSD$"); % % #include <machine/clock.h> % #include <machine/cpu.h> % +#include <machine/cputypes.h> % #include <machine/intr_machdep.h> % #include <machine/ppireg.h> % #include <machine/timerreg.h> % @@ -245,40 +246,85 @@ getit(void) % return ((high << 8) | low); % } % % -static __inline void % -delay_tsc(int n) % +#ifdef __i386__ % +static __inline uint64_t % +atomic_fetch_quad_i386(uint64_t *p) % { % - uint64_t start, end, now; % + uint64_t v; % % - sched_pin(); % - start = rdtsc(); % - end = start + (tsc_freq * n) / 1000000; This reminds me to point out that DELAY() still has the old i386 code to manually optimize the division corresponding to this, for the i8254 case with n < 256. Plain i386 is no longer supported, and even i486's aren't so slow, and newer CPUs have much faster divisions, and anyway, gcc has been doing similar optimizations automatically for more than 15 (?) years -- it will turn 32-bit division into a multiplication and shift if possible. It only does this if this if it is possible to do exactly, while the manual code doesn't need to be very precise so may be able to do it when gcc can't. However, the general case is a 64 by 32 bit division, and gcc still generates very slow code for this (a __udivdi3 call). This takes about 65 cycles on an Athlon64. Since this is below the resolution of DELAY() on an Athlon64 (even underclocked to not less than 33 MHz), it doesn't matter. It might be just enough to worry about on an i486. But the optimization belongs in gcc and __udivdi3. The i8254 code uses the general case for n >= 256 since even i386/16's can execute __udivdi3 in 256 uS. % - do { % - cpu_spinwait(); % - now = rdtsc(); % - } while (now < end || (now > start && end < start)); % - sched_unpin(); % + /* i486 does not support SMP. */ % + __asm __volatile( % + " pushfl ; " % + " cli ; " % + " movl (%1), %%eax ; " % + " movl 4(%1), %%edx ; " % + " popfl" % + : "=&A" (v) : "r" (p)); % + return (v); % } Asms in *.c are style bugs. intr_disable() and intr_restore() exist as asms in cpufunc.h so that no asm in *.c is needed for cli/sti. % % -static __inline void % -delay_timecounter(struct timecounter *tc, int n) % +static __inline uint64_t % +atomic_fetch_quad_i586(uint64_t *p) % { % - uint64_t end, now; % + uint64_t v; % + % + __asm __volatile( % + " movl %%ebx,%%eax ; " % + " movl %%ecx,%%edx ; " % + " " MPLOCKED " cmpxchg8b (%1)" % + : "=&A" (v) : "r" (p) : "cc"); % + return (v); % +} This should be in atomic.h, with the runtime feature test there. atomic.h still has the compile-time feature test CPU_DISABLE_CMPXCHG which gives a version of cmpset32 using cli/sti, though it no longer has the i386 ifdef for this. % + % +static __inline uint64_t % +_fetch_frequency(uint64_t *p) % +{ % + % + if (cpu_class == CPUCLASS_486) % + return (atomic_fetch_quad_i386(p)); % + return (atomic_fetch_quad_i586(p)); % +} cpu_class should never be used. This should test the cpu feature for cmpxchg8b. Athlon64 docs say that cmpxchg8b can give "Invalid opcode", if its bit is not set in cpu feature, so it is not clear if cmpxchg8b is available even on all Athlon64's. You have a problem if any CPU that supports SMP doesn't support cmpxchg8b. % +#else % +#define _fetch_frequency(p) (*(p)) % +#endif % + % +static __inline int % +_delay(int n) % +{ % + struct timecounter *tc; % + uint64_t end, freq, now; % u_int last, mask, u; % + int use_tsc; % % - mask = tc->tc_counter_mask; % - last = tc->tc_get_timecount(tc) & mask; % - end = tc->tc_frequency * n / 1000000; % + tc = timecounter; % + freq = _fetch_frequency(&tsc_freq); % + use_tsc = tsc_is_invariant && freq != 0; % + if (use_tsc) { % + mask = ~0u; % + sched_pin(); % + last = rdtsc(); % + } else { % + if (tc->tc_quality <= 0) % + return (0); % + freq = _fetch_frequency(&tc->tc_frequency); % + mask = tc->tc_counter_mask; % + last = tc->tc_get_timecount(tc); % + } % + last &= mask; % + end = freq * n / 1000000; % now = 0; % do { % cpu_spinwait(); % - u = tc->tc_get_timecount(tc) & mask; % + u = (use_tsc ? rdtsc() : tc->tc_get_timecount(tc)) & mask; % if (u < last) % now += mask - last + u + 1; % else % now += u - last; % last = u; % } while (now < end); % + if (use_tsc) % + sched_unpin(); % + return (1); % } % % /* This does seem simpler than a generation count, since it doesn't require modifying places that set the code, but I don't see how it can work without modifying other places. There seems to be nothing to prevent atomic_fetch_quad_i586 reading 2 identical half-initialized values. Apart from minor races involving the modification running slow than the cmpxchg8b can see any changes, the modification may be preempted while half done, giving a very large race window. When the kernel was not preemptable, the race was limited to interrupt handlers. Now it extends to all threads that don't block on the lock being used by the sysctl. But!, when the kernel was not preemptable, there was no CPU running faster than 4GHz, and tsc_freq was only 32 bits, so there was no problem with updating it atomically enough (we don't care about using stale values). There is probably still no 386 running faster than 4GHz, and only a few overclocked newer ones running fast than 4HGz. Thus the race is mostly in the future, and we don't need these complications yet for the non-i486 case or probably ever for the i486 case. The problem can be pushed to much higher than 4 GHz by scaling tsc_freq a little; make tsc_freq the actual frequency divided by 4, and use rdtsc() / 4 instead of rdtsc() to reach 16 GHz with a 32-bit tsc_freq. Only urandom would notice the lost 2 bits, and it wouldn't divide by 4. % @@ -289,7 +335,6 @@ getit(void) % void % DELAY(int n) % { % - struct timecounter *tc; % int delta, prev_tick, tick, ticks_left; % % #ifdef DELAYDEBUG % @@ -298,15 +343,8 @@ DELAY(int n) % static int state = 0; % #endif % % - if (tsc_freq != 0) { % - delay_tsc(n); % + if (_delay(n)) % return; % - } % - tc = timecounter; % - if (tc->tc_quality > 0) { % - delay_timecounter(tc, n); % - return; % - } % #ifdef DELAYDEBUG % if (state == 0) { % state = 1; Still have the delay before code to debug it. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110317213445.U1128>