Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 05 May 2004 19:54:53 -0600
From:      Scott Long <scottl@freebsd.org>
To:        Gerrit Nagelhout <gnagelhout@sandvine.com>
Cc:        'Andrew Gallatin' <gallatin@cs.duke.edu>
Subject:   Re: 4.7 vs 5.2.1 SMP/UP bridging performance
Message-ID:  <40999AED.9080606@freebsd.org>
In-Reply-To: <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>

index | next in thread | previous in thread | raw e-mail

Gerrit Nagelhout wrote:
> Andrew Gallatin wrote:
> 
>>Bruce Evans writes:
>>
>> > 
>> > Athlon XP2600 UP system:  !SMP case: 22 cycles   SMP case: 
>>37 cycles
>> > Celeron 366 SMP system:              35                    48
>> > 
>> > The extra cycles for the SMP case are just the extra cost 
>>of a one lock
>> > instruction.  Note that SMP should cost twice as much 
>>extra, but the
>> > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by 
>>using xchgl
>> > which always locks the bus.  After fixing this:
>> > 
>> > Athlon XP2600 UP system:  !SMP case:  6 cycles   SMP case: 
>>37 cycles
>> > Celeron 366 SMP system:              10                    48
>> > 
>> > Mutexes take longer than simple locks, but not much longer 
>>unless the
>> > lock is contested.  In particular, they don't lock the bus any more
>> > and the extra cycles for locking dominate (even in the 
>>!SMP case due
>> > to the pessimization).
>> > 
>> > So there seems to be something wrong with your benchmark.  
>>Locking the
>> > bus for the SMP case always costs about 20+ cycles, but this hasn't
>> > changed since RELENG_4 and mutexes can't be made much faster in the
>> > uncontested case since their overhead is dominated by the bus lock
>> > time.
>> > 
>>
>>Actually, I think his tests are accurate and bus locked instructions
>>take an eternity on P4.  See
>>http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html 
>>
>>For example, with your test above, I see 212 cycles for the UP case on
>>a 2.53GHz P4.  Replacing the atomic_store_rel_int(&slock, 0) with a
>>simple slock = 0; reduces that count to 18 cycles.
>>
>>If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
>>then I think you should do it.  Of course, that still leaves mutexes
>>as very expensive on SMP (253 cycles on the 2.53GHz from above).
>>
>>Drew
>>
> 
> 
> I wonder if there is anything that can be done to make the locking more
> efficient for the Xeon.  Are there any other locking types that could
> be used instead?
> This might also explain why we are seeing much worse system call 
> performance under 4.7 in SMP versus UP.  Here is a table of results
> for some system call tests I ran.  (The numbers are calls/s)

Int 0x80 system calls are known to be extremely expensive on a P4.  I
think that Jeff Roberson measured them as taking 300 cycles on average.
Some work was done on implementing the alternate sysenter/sysexit
method, but I don't think it was ever finished.  I think that it was
shown to have a modest speed improvement, but there was still a lot of
overhead that made it slow on a P4.  There are other optimizations that
can be done like having a shared page that lets you avoid calls like
getpid and gettimeofday, but it opens some security risks that have to
be dealt with.

Scott


home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?40999AED.9080606>