From owner-freebsd-current@FreeBSD.ORG  Wed May  5 14:49:47 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id B234C16A4CF
	for <freebsd-current@freebsd.org>;
	Wed,  5 May 2004 14:49:47 -0700 (PDT)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id A9F9A43D45
	for <freebsd-current@freebsd.org>;
	Wed,  5 May 2004 14:49:46 -0700 (PDT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (localhost [127.0.0.1])
	by fledge.watson.org (8.12.11/8.12.11) with ESMTP id i45LnZhM061153;
	Wed, 5 May 2004 17:49:35 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Received: from localhost (robert@localhost)i45LnYJu061150;
	Wed, 5 May 2004 17:49:34 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Wed, 5 May 2004 17:49:34 -0400 (EDT)
From: Robert Watson <rwatson@freebsd.org>
X-Sender: robert@fledge.watson.org
To: Gerrit Nagelhout <gnagelhout@sandvine.com>
In-Reply-To: <FE045D4D9F7AED4CBFF1B3B813C85337021AB377@mail.sandvine.com>
Message-ID: <Pine.NEB.3.96L.1040505173922.55250D-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: freebsd-current@freebsd.org
Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 05 May 2004 21:49:47 -0000


On Tue, 4 May 2004, Gerrit Nagelhout wrote:

> I ran the following fragment of code to determine the cost of a LOCK & 
> UNLOCK on both UP and SMP:
> 
> #define	EM_LOCK(_sc)		mtx_lock(&(_sc)->mtx)
> #define	EM_UNLOCK(_sc)		mtx_unlock(&(_sc)->mtx)
> 
>     unsigned int startTime, endTime, delta;
>     startTime = rdtsc();
>     for (i = 0; i < 100; i++)
>     {
>         EM_LOCK(adapter);
>         EM_UNLOCK(adapter);
>     }
>     endTime = rdtsc();
>     delta = endTime - startTime;
>     printf("delta %u start %u end %u \n", (unsigned int)delta, startTime,
> endTime);
> 
> On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per
> LOCK&UNLOCK, and dividing by 100) under UP, and ~300 cycles for SMP. 
> Assuming 10 locks for every packet(which is conservative), at 500Kpps,
> this accounts for:  300 * 10 * 500000 = 1.5 billion cycles (out of 2.8
> billion cycles)  Any comments? 

One of the sets of changes I have in a local branch performs coallescing
of interface unlock/lock operations.  Right now, if you look at the
incoming packet handling in interface code, it tends to read:


   struct mbuf *m;

   while (packets_ready(sc)) {
        m = read_packet(sc);
        XX_UNLOCK(sc);
        ifp->if_input(sc, m);
        XX_LOCK(sc);
   }

I revised the structure for some testing as follows:

   struct mbuf *m, *mqueue, *mqueue_tail;

   mqueue = mqueue_tail = NULL;
   while (packets_read(sc)) {
       m = packets_ready(sc);
       if (mqueue != NULL) {
            mqueue_tail->m_nextpkt = m;
            mqueue_tail = m;
       } else
            mqueue = mqueue_tail = m;
   }
   if (mqueue != NULL) {
       XX_UNLOCK(sc);
       while (mqueue != NULL) {
           m = mqueue;
           mqueue = mqueue->m_nextpkt;
           m->m_nextpkt = NULL;
           ifp->if_input(ifp, m);
       }
       XX_LOCK(sc);
   }
      
Obviously, if done properly, you'd want to bound the size of the temporary
queue, etc, etc, but even in basic testing I wasn't able to measure an
improvement on the hardware I had on-hand at the time.  However, I need to
re-run this in a post-netperf world and with 64-bit PCI and see if it does
now.  One important thing in this process, though, is to avoid reordering
of packets -- they need to remain serialized by source interface.  Doing
it at this queue is easy, but if we start passing chains of packets into
other pieces, we'll need to be careful where multiple queues get involved,
etc.  Even simple and relatively infrequent packet reordering can cause
TCP to get pretty unhappy.

The fact that the above didn't help performance suggests two things:
first, that my testbed has other bottlenecks, such as PCI bus bandwidth,
and second, that the primary cost currently involved isn't from these
mutexes.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Senior Research Scientist, McAfee Research