Date: Wed, 05 May 2004 19:54:53 -0600 From: Scott Long <scottl@freebsd.org> To: Gerrit Nagelhout <gnagelhout@sandvine.com> Cc: 'Andrew Gallatin' <gallatin@cs.duke.edu> Subject: Re: 4.7 vs 5.2.1 SMP/UP bridging performance Message-ID: <40999AED.9080606@freebsd.org> In-Reply-To: <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>
index | next in thread | previous in thread | raw e-mail
Gerrit Nagelhout wrote: > Andrew Gallatin wrote: > >>Bruce Evans writes: >> >> > >> > Athlon XP2600 UP system: !SMP case: 22 cycles SMP case: >>37 cycles >> > Celeron 366 SMP system: 35 48 >> > >> > The extra cycles for the SMP case are just the extra cost >>of a one lock >> > instruction. Note that SMP should cost twice as much >>extra, but the >> > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by >>using xchgl >> > which always locks the bus. After fixing this: >> > >> > Athlon XP2600 UP system: !SMP case: 6 cycles SMP case: >>37 cycles >> > Celeron 366 SMP system: 10 48 >> > >> > Mutexes take longer than simple locks, but not much longer >>unless the >> > lock is contested. In particular, they don't lock the bus any more >> > and the extra cycles for locking dominate (even in the >>!SMP case due >> > to the pessimization). >> > >> > So there seems to be something wrong with your benchmark. >>Locking the >> > bus for the SMP case always costs about 20+ cycles, but this hasn't >> > changed since RELENG_4 and mutexes can't be made much faster in the >> > uncontested case since their overhead is dominated by the bus lock >> > time. >> > >> >>Actually, I think his tests are accurate and bus locked instructions >>take an eternity on P4. See >>http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html >> >>For example, with your test above, I see 212 cycles for the UP case on >>a 2.53GHz P4. Replacing the atomic_store_rel_int(&slock, 0) with a >>simple slock = 0; reduces that count to 18 cycles. >> >>If its really safe to remove the xchg* from non-SMP atomic_store_rel*, >>then I think you should do it. Of course, that still leaves mutexes >>as very expensive on SMP (253 cycles on the 2.53GHz from above). >> >>Drew >> > > > I wonder if there is anything that can be done to make the locking more > efficient for the Xeon. Are there any other locking types that could > be used instead? > This might also explain why we are seeing much worse system call > performance under 4.7 in SMP versus UP. Here is a table of results > for some system call tests I ran. (The numbers are calls/s) Int 0x80 system calls are known to be extremely expensive on a P4. I think that Jeff Roberson measured them as taking 300 cycles on average. Some work was done on implementing the alternate sysenter/sysexit method, but I don't think it was ever finished. I think that it was shown to have a modest speed improvement, but there was still a lot of overhead that made it slow on a P4. There are other optimizations that can be done like having a shared page that lets you avoid calls like getpid and gettimeofday, but it opens some security risks that have to be dealt with. Scotthome | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?40999AED.9080606>
