From owner-freebsd-current@FreeBSD.ORG Wed May 5 06:32:24 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B661116A4CE for ; Wed, 5 May 2004 06:32:24 -0700 (PDT) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id B0FF243D39 for ; Wed, 5 May 2004 06:32:23 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])i45DWM5v030027; Wed, 5 May 2004 23:32:22 +1000 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i45DWJHW024863; Wed, 5 May 2004 23:32:20 +1000 Date: Wed, 5 May 2004 23:32:18 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Gerrit Nagelhout In-Reply-To: Message-ID: <20040505222636.H15444@gamplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@freebsd.org Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 May 2004 13:32:24 -0000 On Tue, 4 May 2004, Gerrit Nagelhout wrote: > I ran the following fragment of code to determine the cost of a LOCK & > UNLOCK on both UP and SMP: > > #define EM_LOCK(_sc) mtx_lock(&(_sc)->mtx) > #define EM_UNLOCK(_sc) mtx_unlock(&(_sc)->mtx) > > unsigned int startTime, endTime, delta; > startTime = rdtsc(); > for (i = 0; i < 100; i++) > { > EM_LOCK(adapter); > EM_UNLOCK(adapter); > } > endTime = rdtsc(); > delta = endTime - startTime; > printf("delta %u start %u end %u \n", (unsigned int)delta, startTime, > endTime); > > On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per LOCK&UNLOCK, > and dividing by 100) under UP, and ~300 cycles for SMP. Assuming 10 > locks for every packet(which is conservative), at 500Kpps, this accounts > for: > 300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 billion cycles) 300 cyles seems far too much. I get the following times for slightly simpler locking in userland: %%% #define _KERNEL #include ... int slock; ... for (i = 0; i < 1000000; i++) { while (atomic_cmpset_acq_int(&slock, 0, 1) == 0) ; atomic_store_rel_int(&slock, 0); } %%% Athlon XP2600 UP system: !SMP case: 22 cycles SMP case: 37 cycles Celeron 366 SMP system: 35 48 The extra cycles for the SMP case are just the extra cost of a one lock instruction. Note that SMP should cost twice as much extra, but the non-SMP atomic_store_rel_int(&slock, 0) is pessimized by using xchgl which always locks the bus. After fixing this: Athlon XP2600 UP system: !SMP case: 6 cycles SMP case: 37 cycles Celeron 366 SMP system: 10 48 Mutexes take longer than simple locks, but not much longer unless the lock is contested. In particular, they don't lock the bus any more and the extra cycles for locking dominate (even in the !SMP case due to the pessimization). So there seems to be something wrong with your benchmark. Locking the bus for the SMP case always costs about 20+ cycles, but this hasn't changed since RELENG_4 and mutexes can't be made much faster in the uncontested case since their overhead is dominated by the bus lock time. -current is sloer than RELENG_4, especially for networking, because it does lots more locking and may contest locks more, and when it hits a lock and for some other operations it does slow context switches. Your profile didn't seem to show much of the latter 2, so the problem for bridging may be that there is just too much fine-grained locking. The profile didn't seem quite right. I was missing all the call counts and times. The times are not useful for short runs unless high resolution profiling is used, but the call counts are. Profiling has been broken in -current since last November so some garbage needs to be ignored to interpret profiles. Bruce