Date: Sat, 06 Nov 2004 17:24:12 -0500 From: Stephan Uphoff <ups@tree.com> To: Robert Watson <rwatson@FreeBSD.org> Cc: cvs-all@FreeBSD.org Subject: Re: cvs commit: src/sys/i386/i386 pmap.c Message-ID: <1099779852.8097.68.camel@palm.tree.com> In-Reply-To: <Pine.NEB.3.96L.1041105102349.90766E-100000@fledge.watson.org> References: <Pine.NEB.3.96L.1041105102349.90766E-100000@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2004-11-05 at 07:01, Robert Watson wrote: > On Fri, 29 Oct 2004, Mike Silbersack wrote: > > > I think we really need some sort of light-weight critical_enter that > > simply assures you that you won't get rescheduled to another CPU, but > > gives no guarantees beyond that. > <snip> > > Er, wait - I guess I'm forgetting something, there exists the potential > > for the interrupt that preempted whatever was calling arc4random to also > > call arc4random, thereby breaking things... > > I've been looking at related issues for the last couple of days and must > have missed this thread while at EuroBSDCon. Alan Cox pointed me at it, > so here I am. :-) > > Right now, the cost of acquiring and dropping an uncontended a sleep mutex > on a UP kernel is very low -- about 21 cycles on my PIII and 40 on my P4, > including some efficiency problems in my measurement which probably add a > non-trivial overhead. Compare this with the SMP versions on the PIII (90 > cycles) and P4 (260 cycles!). Critical sections on the SMP PIII are about > the same cost as the SMP mutex, but on the P4 a critical section is less > than half the cost. Getting to a model where critical sections were as > cheap as UP sleep mutexes, or where we could use a similar combination of > primitives (such as UP mutexes with pinning) would be very useful. > Otherwise, optimizing through use of critical sections will improve SMP > but potentially damage performance on UP. There's been a fair amount of > discussion of such approaches, including the implementation briefly > present in the FreeBSD. I know John Baldwin and Justin Gibbs both have > theories and plans in this area. > > If we do create a UP mutex primitive for use on SMP, I would suggest we > actually expand the contents of the UP mutex structure slightly to include > a cpu number that can be asserted, along with pinning, when an operation > is attempted and INVARIANTS is present. One of the great strengths of the > mutex/lock model is a strong assertion capability, both for the purposes > of documentation and testing, so we should make sure that carries into any > new synchronization primitives. > > Small table of synchronization primitives below; in each case, the count > is in cycles and reflects the cost of acquiring and dropping the primitive > (lock+unlock, enter+exit). The P4 is a 3ghz box, and the PIII is an > 800mhz box. Note that the synchronization primitives requiring atomic > operations are substantially pessimized on the P4 vs the PIII. > > A discussion with John Baldwin and Scott Long yesterday revealed that the > UP spin mutex is currently pessimized from a critical section to a > critical section plus mutex internals due to a need for mtx_owned() on > spin locks. I'm not convinced that explains the entire performance > irregularity I see for P4 spin mutexes on UP, however. Note that 39 (P4 > UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin > mutex) by a fair amount. Figuring out what's going on there would be a > good idea, although it could well be a property of my measurement > environment. I'm currently using this to do measurements: > > //depot/user/rwatson/percpu/sys/test/test_synch_timing.c > > In all of the below, the standard deviation is very small if you're > careful about not bumping into hard clock or other interrupts during > testing, especially when it comes to spin mutexes and critical sections. > > Robert N M Watson FreeBSD Core Team, TrustedBSD Projects > robert@fledge.watson.org Principal Research Scientist, McAfee Research > > sleep mutex crit section spin mutex > UP SMP UP SMP UP SMP > PIII 21 90 83 81 112 141 > P4 39 260 120 119 274 342 Nice catch! On a UP releasing a spin mutex involves a xchgl operation while releasing an uncontested sleep mutex uses cmpxchgl. Since the xchgl does an implicit LOCK (and cmpxchgl does NOT) this could explain why the spin mutex needs a lot more cycles. This should be easy to fix since the xchgl is not needed on a UP system. Right now I am sick and don't trust my own code so I won't write a patch for the next few days ... hopefully someone else can get to it first. Stephan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1099779852.8097.68.camel>