Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 06 Nov 2004 17:24:12 -0500
From:      Stephan Uphoff <ups@tree.com>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        cvs-all@FreeBSD.org
Subject:   Re: cvs commit: src/sys/i386/i386 pmap.c
Message-ID:  <1099779852.8097.68.camel@palm.tree.com>
In-Reply-To: <Pine.NEB.3.96L.1041105102349.90766E-100000@fledge.watson.org>
References:  <Pine.NEB.3.96L.1041105102349.90766E-100000@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2004-11-05 at 07:01, Robert Watson wrote:
> On Fri, 29 Oct 2004, Mike Silbersack wrote:
> 
> > I think we really need some sort of light-weight critical_enter that
> > simply assures you that you won't get rescheduled to another CPU, but
> > gives no guarantees beyond that. 
> <snip> 
> > Er, wait - I guess I'm forgetting something, there exists the potential 
> > for the interrupt that preempted whatever was calling arc4random to also 
> > call arc4random, thereby breaking things...
> 
> I've been looking at related issues for the last couple of days and must
> have missed this thread while at EuroBSDCon.  Alan Cox pointed me at it,
> so here I am. :-)
> 
> Right now, the cost of acquiring and dropping an uncontended a sleep mutex
> on a UP kernel is very low -- about 21 cycles on my PIII and 40 on my P4,
> including some efficiency problems in my measurement which probably add a
> non-trivial overhead.  Compare this with the SMP versions on the PIII (90
> cycles) and P4 (260 cycles!).  Critical sections on the SMP PIII are about
> the same cost as the SMP mutex, but on the P4 a critical section is less
> than half the cost.  Getting to a model where critical sections were as
> cheap as UP sleep mutexes, or where we could use a similar combination of
> primitives (such as UP mutexes with pinning) would be very useful.
> Otherwise, optimizing through use of critical sections will improve SMP
> but potentially damage performance on UP.  There's been a fair amount of
> discussion of such approaches, including the implementation briefly
> present in the FreeBSD.  I know John Baldwin and Justin Gibbs both have
> theories and plans in this area.
> 
> If we do create a UP mutex primitive for use on SMP, I would suggest we
> actually expand the contents of the UP mutex structure slightly to include
> a cpu number that can be asserted, along with pinning, when an operation
> is attempted and INVARIANTS is present.  One of the great strengths of the
> mutex/lock model is a strong assertion capability, both for the purposes
> of documentation and testing, so we should make sure that carries into any
> new synchronization primitives.
> 
> Small table of synchronization primitives below; in each case, the count
> is in cycles and reflects the cost of acquiring and dropping the primitive
> (lock+unlock, enter+exit).  The P4 is a 3ghz box, and the PIII is an
> 800mhz box.  Note that the synchronization primitives requiring atomic
> operations are substantially pessimized on the P4 vs the PIII.
> 
> A discussion with John Baldwin and Scott Long yesterday revealed that the
> UP spin mutex is currently pessimized from a critical section to a
> critical section plus mutex internals due to a need for mtx_owned() on
> spin locks.  I'm not convinced that explains the entire performance
> irregularity I see for P4 spin mutexes on UP, however.  Note that 39 (P4
> UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin
> mutex) by a fair amount.  Figuring out what's going on there would be a
> good idea, although it could well be a property of my measurement
> environment.  I'm currently using this to do measurements:
> 
>     //depot/user/rwatson/percpu/sys/test/test_synch_timing.c
> 
> In all of the below, the standard deviation is very small if you're
> careful about not bumping into hard clock or other interrupts during
> testing, especially when it comes to spin mutexes and critical sections. 
> 
> Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> robert@fledge.watson.org      Principal Research Scientist, McAfee Research
> 
>         sleep mutex     crit section    spin mutex
>         UP      SMP     UP      SMP     UP      SMP
> PIII    21      90      83      81      112     141
> P4      39      260     120     119     274     342

Nice catch!
On a UP releasing a spin mutex involves a xchgl operation while
releasing an uncontested sleep mutex uses cmpxchgl.
Since the xchgl does an implicit LOCK (and cmpxchgl does NOT) this could
explain why the spin mutex needs a lot more cycles.
This should be easy to fix since the xchgl is not needed on a UP system.
Right now I am sick and don't trust my own code so I won't write a patch
for the next few days ... hopefully someone else can get to it first.

	Stephan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1099779852.8097.68.camel>