From owner-freebsd-current@FreeBSD.ORG Wed May 5 14:49:47 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B234C16A4CF for ; Wed, 5 May 2004 14:49:47 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id A9F9A43D45 for ; Wed, 5 May 2004 14:49:46 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.11/8.12.11) with ESMTP id i45LnZhM061153; Wed, 5 May 2004 17:49:35 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)i45LnYJu061150; Wed, 5 May 2004 17:49:34 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Wed, 5 May 2004 17:49:34 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Gerrit Nagelhout In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@freebsd.org Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 May 2004 21:49:47 -0000 On Tue, 4 May 2004, Gerrit Nagelhout wrote: > I ran the following fragment of code to determine the cost of a LOCK & > UNLOCK on both UP and SMP: > > #define EM_LOCK(_sc) mtx_lock(&(_sc)->mtx) > #define EM_UNLOCK(_sc) mtx_unlock(&(_sc)->mtx) > > unsigned int startTime, endTime, delta; > startTime = rdtsc(); > for (i = 0; i < 100; i++) > { > EM_LOCK(adapter); > EM_UNLOCK(adapter); > } > endTime = rdtsc(); > delta = endTime - startTime; > printf("delta %u start %u end %u \n", (unsigned int)delta, startTime, > endTime); > > On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per > LOCK&UNLOCK, and dividing by 100) under UP, and ~300 cycles for SMP. > Assuming 10 locks for every packet(which is conservative), at 500Kpps, > this accounts for: 300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 > billion cycles) Any comments? One of the sets of changes I have in a local branch performs coallescing of interface unlock/lock operations. Right now, if you look at the incoming packet handling in interface code, it tends to read: struct mbuf *m; while (packets_ready(sc)) { m = read_packet(sc); XX_UNLOCK(sc); ifp->if_input(sc, m); XX_LOCK(sc); } I revised the structure for some testing as follows: struct mbuf *m, *mqueue, *mqueue_tail; mqueue = mqueue_tail = NULL; while (packets_read(sc)) { m = packets_ready(sc); if (mqueue != NULL) { mqueue_tail->m_nextpkt = m; mqueue_tail = m; } else mqueue = mqueue_tail = m; } if (mqueue != NULL) { XX_UNLOCK(sc); while (mqueue != NULL) { m = mqueue; mqueue = mqueue->m_nextpkt; m->m_nextpkt = NULL; ifp->if_input(ifp, m); } XX_LOCK(sc); } Obviously, if done properly, you'd want to bound the size of the temporary queue, etc, etc, but even in basic testing I wasn't able to measure an improvement on the hardware I had on-hand at the time. However, I need to re-run this in a post-netperf world and with 64-bit PCI and see if it does now. One important thing in this process, though, is to avoid reordering of packets -- they need to remain serialized by source interface. Doing it at this queue is easy, but if we start passing chains of packets into other pieces, we'll need to be careful where multiple queues get involved, etc. Even simple and relatively infrequent packet reordering can cause TCP to get pretty unhappy. The fact that the above didn't help performance suggests two things: first, that my testbed has other bottlenecks, such as PCI bus bandwidth, and second, that the primary cost currently involved isn't from these mutexes. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Senior Research Scientist, McAfee Research