Date: Thu, 6 May 2004 09:52:27 -0400 From: Don Bowman <don@sandvine.com> To: 'Bruce Evans' <bde@zeta.org.au>, Andrew Gallatin <gallatin@cs.duke.edu> Cc: Gerrit Nagelhout <gnagelhout@sandvine.com> Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance Message-ID: <FE045D4D9F7AED4CBFF1B3B813C85337045D8CB5@mail.sandvine.com>
next in thread | raw e-mail | index | archive | help
From: Bruce Evans [mailto:bde@zeta.org.au] > On Wed, 5 May 2004, Andrew Gallatin wrote: > ... > > > > Actually, I think his tests are accurate and bus locked instructions > > take an eternity on P4. See > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html > > > > For example, with your test above, I see 212 cycles for the > UP case on > > a 2.53GHz P4. Replacing the atomic_store_rel_int(&slock, 0) with a > > simple slock = 0; reduces that count to 18 cycles. > > This seems to be right, unfortunately. I wonder if this has > anything to > do with freebsd.org having no P4 machines. > > > If its really safe to remove the xchg* from non-SMP > atomic_store_rel*, > > then I think you should do it. Of course, that still leaves mutexes > > as very expensive on SMP (253 cycles on the 2.53GHz from above). > > I forgot (again) that there are memory access ordering issues. A lock > may be needed to get everything synced. See the comment > before the i386 > versions in i386/include/atomic.h. A single lock may be enough. The > best example I could think of easily is: On the P4, there are mfence,lfence,sfence instructions to enforce memory ordering. These are cheaper than "lock; andl" or "cpuid", which are the traditional 'sync' instructions.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?FE045D4D9F7AED4CBFF1B3B813C85337045D8CB5>