Date: Mon, 08 Nov 2004 21:28:16 -0500 From: Stephan Uphoff <ups@tree.com> To: John Baldwin <jhb@FreeBSD.org> Cc: Robert Watson <rwatson@FreeBSD.org> Subject: Re: cvs commit: src/sys/i386/i386 pmap.c Message-ID: <1099967296.24619.745.camel@palm.tree.com> In-Reply-To: <200411081317.52153.jhb@FreeBSD.org> References: <Pine.NEB.3.96L.1041105102349.90766E-100000@fledge.watson.org> <1099779852.8097.68.camel@palm.tree.com> <200411081317.52153.jhb@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2004-11-08 at 13:17, John Baldwin wrote: > On Saturday 06 November 2004 05:24 pm, Stephan Uphoff wrote: > > On Fri, 2004-11-05 at 07:01, Robert Watson wrote: > > > On Fri, 29 Oct 2004, Mike Silbersack wrote: > > > > I think we really need some sort of light-weight critical_enter that > > > > simply assures you that you won't get rescheduled to another CPU, but > > > > gives no guarantees beyond that. > > > > > > <snip> > > > > > > > Er, wait - I guess I'm forgetting something, there exists the potential > > > > for the interrupt that preempted whatever was calling arc4random to > > > > also call arc4random, thereby breaking things... > > > > > > I've been looking at related issues for the last couple of days and must > > > have missed this thread while at EuroBSDCon. Alan Cox pointed me at it, > > > so here I am. :-) > > > > > > Right now, the cost of acquiring and dropping an uncontended a sleep > > > mutex on a UP kernel is very low -- about 21 cycles on my PIII and 40 on > > > my P4, including some efficiency problems in my measurement which > > > probably add a non-trivial overhead. Compare this with the SMP versions > > > on the PIII (90 cycles) and P4 (260 cycles!). Critical sections on the > > > SMP PIII are about the same cost as the SMP mutex, but on the P4 a > > > critical section is less than half the cost. Getting to a model where > > > critical sections were as cheap as UP sleep mutexes, or where we could > > > use a similar combination of primitives (such as UP mutexes with pinning) > > > would be very useful. Otherwise, optimizing through use of critical > > > sections will improve SMP but potentially damage performance on UP. > > > There's been a fair amount of discussion of such approaches, including > > > the implementation briefly present in the FreeBSD. I know John Baldwin > > > and Justin Gibbs both have theories and plans in this area. > > > > > > If we do create a UP mutex primitive for use on SMP, I would suggest we > > > actually expand the contents of the UP mutex structure slightly to > > > include a cpu number that can be asserted, along with pinning, when an > > > operation is attempted and INVARIANTS is present. One of the great > > > strengths of the mutex/lock model is a strong assertion capability, both > > > for the purposes of documentation and testing, so we should make sure > > > that carries into any new synchronization primitives. > > > > > > Small table of synchronization primitives below; in each case, the count > > > is in cycles and reflects the cost of acquiring and dropping the > > > primitive (lock+unlock, enter+exit). The P4 is a 3ghz box, and the PIII > > > is an 800mhz box. Note that the synchronization primitives requiring > > > atomic operations are substantially pessimized on the P4 vs the PIII. > > > > > > A discussion with John Baldwin and Scott Long yesterday revealed that the > > > UP spin mutex is currently pessimized from a critical section to a > > > critical section plus mutex internals due to a need for mtx_owned() on > > > spin locks. I'm not convinced that explains the entire performance > > > irregularity I see for P4 spin mutexes on UP, however. Note that 39 (P4 > > > UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin > > > mutex) by a fair amount. Figuring out what's going on there would be a > > > good idea, although it could well be a property of my measurement > > > environment. I'm currently using this to do measurements: > > > > > > //depot/user/rwatson/percpu/sys/test/test_synch_timing.c > > > > > > In all of the below, the standard deviation is very small if you're > > > careful about not bumping into hard clock or other interrupts during > > > testing, especially when it comes to spin mutexes and critical sections. > > > > > > Robert N M Watson FreeBSD Core Team, TrustedBSD Projects > > > robert@fledge.watson.org Principal Research Scientist, McAfee > > > Research > > > > > > sleep mutex crit section spin mutex > > > UP SMP UP SMP UP SMP > > > PIII 21 90 83 81 112 141 > > > P4 39 260 120 119 274 342 > > > > Nice catch! > > On a UP releasing a spin mutex involves a xchgl operation while > > releasing an uncontested sleep mutex uses cmpxchgl. > > Since the xchgl does an implicit LOCK (and cmpxchgl does NOT) this could > > explain why the spin mutex needs a lot more cycles. > > This should be easy to fix since the xchgl is not needed on a UP system. > > Right now I am sick and don't trust my own code so I won't write a patch > > for the next few days ... hopefully someone else can get to it first. > > I've tried changing the store_rel() to just do a simple store since writes are > ordered on x86, but benchmarks on SMP showed that it actually hurt. Strange - any idea why? Can you recommend a specific benchmark for this? I would like to try some variations. > However, > it would probably be good to at least do that for UP. The current patch to > do it for all kernels is: > > --- //depot/vendor/freebsd/src/sys/i386/include/atomic.h 2004/03/12 21:50:47 > +++ //depot/projects/smpng/sys/i386/include/atomic.h 2004/08/02 15:16:35 > @@ -69,7 +69,7 @@ > > int atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src); > > -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \ > +#define ATOMIC_STORE_LOAD(TYPE, LOP) \ > u_##TYPE atomic_load_acq_##TYPE(volatile u_##TYPE *p); \ > void atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v) > > @@ -175,12 +175,12 @@ > #if defined(I386_CPU) > > /* > - * We assume that a = b will do atomic loads and stores. > - * > - * XXX: This is _NOT_ safe on a P6 or higher because it does not guarantee > - * memory ordering. These should only be used on a 386. > + * We assume that a = b will do atomic loads and stores. However, on a > + * PentiumPro or higher, reads may pass writes, so for that case we have > + * to use a serializing instruction (i.e. with LOCK) to do the load. For > + * the 386 case we can use a simple read since 386s don't support SMP. > */ > -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \ > +#define ATOMIC_STORE_LOAD(TYPE, LOP) \ > static __inline u_##TYPE \ > atomic_load_acq_##TYPE(volatile u_##TYPE *p) \ > { \ > @@ -190,14 +190,14 @@ > static __inline void \ > atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\ > { \ > + __asm __volatile("" : : : "memory"); \ > *p = v; \ > - __asm __volatile("" : : : "memory"); \ > } \ > struct __hack > > #else /* !defined(I386_CPU) */ > > -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \ > +#define ATOMIC_STORE_LOAD(TYPE, LOP) \ > static __inline u_##TYPE \ > atomic_load_acq_##TYPE(volatile u_##TYPE *p) \ > { \ > @@ -211,16 +211,11 @@ > return (res); \ > } \ > \ > -/* \ > - * The XCHG instruction asserts LOCK automagically. \ > - */ \ > static __inline void \ > atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\ > { \ > - __asm __volatile(SOP \ > - : "+m" (*p), /* 0 */ \ > - "+r" (v) /* 1 */ \ > - : : "memory"); \ > + __asm __volatile("" : : : "memory"); \ > + *p = v; \ > } \ > struct __hack > > @@ -230,7 +225,7 @@ > > extern int atomic_cmpset_int(volatile u_int *, u_int, u_int); > > -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \ > +#define ATOMIC_STORE_LOAD(TYPE, LOP) \ > extern u_##TYPE atomic_load_acq_##TYPE(volatile u_##TYPE *p); \ > extern void atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v) > > @@ -258,10 +253,10 @@ > ATOMIC_ASM(add, long, "addl %1,%0", "ir", v); > ATOMIC_ASM(subtract, long, "subl %1,%0", "ir", v); > > -ATOMIC_STORE_LOAD(char, "cmpxchgb %b0,%1", "xchgb %b1,%0"); > -ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1", "xchgw %w1,%0"); > -ATOMIC_STORE_LOAD(int, "cmpxchgl %0,%1", "xchgl %1,%0"); > -ATOMIC_STORE_LOAD(long, "cmpxchgl %0,%1", "xchgl %1,%0"); > +ATOMIC_STORE_LOAD(char, "cmpxchgb %b0,%1"); > +ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1"); > +ATOMIC_STORE_LOAD(int, "cmpxchgl %0,%1"); > +ATOMIC_STORE_LOAD(long, "cmpxchgl %0,%1"); > > #undef ATOMIC_ASM > #undef ATOMIC_STORE_LOAD
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1099967296.24619.745.camel>