Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 Nov 2004 10:57:54 -0800
From:      Peter Wemm <peter@wemm.org>
To:        Stephan Uphoff <ups@tree.com>
Cc:        Julian Elischer <julian@elischer.org>
Subject:   Re: cvs commit: src/sys/i386/i386 pmap.c
Message-ID:  <200411091057.54867.peter@wemm.org>
In-Reply-To: <1100024464.29384.30.camel@palm.tree.com>
References:  <Pine.NEB.3.96L.1041109103037.73102S-100000@fledge.watson.org> <4191062A.6090009@elischer.org> <1100024464.29384.30.camel@palm.tree.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tuesday 09 November 2004 10:21 am, Stephan Uphoff wrote:
> On Tue, 2004-11-09 at 13:02, Julian Elischer wrote:
> > Robert Watson wrote:
> > >This change made a large difference, and eliminates the
> > > unexplained costs. Here's a revised table as compared to the
> > > above:
> > >
> > > sleep mutex crit section spin mutex new spin mutex
> > > UP SMP UP SMP UP SMP UP SMP
> > >PIII 21 81 83 81 112 141 95 141
> > >P4 39 260 120 119 274 342 132 231
> > >
> > >So it basically cut 140 cycles off the P4 UP spin lock, 15 off the
> > > PIII UP spin lock, and 110 cycles off the P4 SMP spin lock.  The
> > > PIII SMP spin lock looks the same.  Keep in mind that all of
> > > these measurements have a standard deviation of between 0 and 3
> > > cycles, most in the 1 range.  Also keep in mind that these are
> > > entirely uncontended measurements.
> > >
> > >Assuming that these changes are correct, and pass whatever tests
> > > people have in mind, this would be a very strong merge candidate
> > > for performance reasons.  The difference is visible in packet
> > > send tests from user space as a percentage or two improvement on
> > > UP on my P4, although it's a litte hard to tell due to the noise.
> >
> > Can you explain why a spin mutex is more expensive than a sleep
> > mutex (I assume this is uncontested)?
>
> cli() and sti() used for the critical section are expensive.

... on INTEL cpus!  Don't make the mistake of assuming that all x86 cpus 
are as slow as Intel's P4 family on this stuff.   Other cpus don't have 
the same massive microcode penalty.  My recollection is that athlon 
(and athlon64 cpus in 32 bit mode) take about 8-12 clocks to do a cli 
or sti, compared to 300+ for a P4 cpu.  And things like 50-90 clocks 
for an invlpg vs 1200-1600 clocks for a P4.

Please don't accidently penalize those of us with cpus that were 
designed for good all-round performance.  The P4 family was designed 
for games and 3d graphics, not all-round performance.

(This isn't aimed at anybody in particular..  I just wanted to remind 
people that the P4 code is a particularly pathological case (and the 
writing is on the wall for that core).  Other cpus, including intel's 
newer non-P4 cores, dont have the same pathological problems.)

-- 
Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com
"All of this is for nothing if we don't go to the stars" - JMS/B5



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200411091057.54867.peter>