From owner-freebsd-net@FreeBSD.ORG  Mon Apr  6 10:37:21 2009
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 68C071065688
	for <freebsd-net@freebsd.org>; Mon,  6 Apr 2009 10:37:21 +0000 (UTC)
	(envelope-from barney_cordoba@yahoo.com)
Received: from web63904.mail.re1.yahoo.com (web63904.mail.re1.yahoo.com
	[69.147.97.119]) by mx1.freebsd.org (Postfix) with SMTP id 19F058FC12
	for <freebsd-net@freebsd.org>; Mon,  6 Apr 2009 10:37:20 +0000 (UTC)
	(envelope-from barney_cordoba@yahoo.com)
Received: (qmail 63616 invoked by uid 60001); 6 Apr 2009 10:37:20 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024;
	t=1239014240; bh=t61mFWB4CKSnqZioDLU3wCAaq5MB6KLbBAVNmJIodN4=;
	h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
	b=V+ssyr9A+GmlFRtJH87ZPK+lJ6W75tWSVjxHFFC0Zw9v+IoBZgrW+qS2slpin3MnIN6T1LdmhhqMpqelS83898pAwbdc7mP5NVZYZGDRf8QtSh43Yqd72qeJp6qH13e5gVA2xDgBrP3GRlwwixVWEqLUxxmnYFGlHQayA5FJ9wU=
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
	b=MpvEkaOxsTkqvTmwJMVUF5KFEStUFo5Xm6bqNTf2RiApQKbUXo86kxvN3HwpjwYGutJrktu2FxiXev1eAFpTH0JqBZ+X/WSo8sy96hXM6y63yOZLeNdnv1bUGKvmYFs5jiB8vuaAYQrOo4fJpRRFv7T3cOZf9qbPhsjOER9F24I=;
Message-ID: <86599.63596.qm@web63904.mail.re1.yahoo.com>
X-YMail-OSG: mhj7Sc4VM1ni_V9G4FD5czP8DVwFTjlyLkQXi6Sl90nc.DvtqIlSdYlmUwbULgIrqwe878qSX0_NqvBqwhSnGbuqFQ0l225Od5zXMPi6iYVjWpXdSVOSC7jkjf5BdIVbevvasq6p1F9MvqTWbgyVXj2zpja3sPmQJoffDoz1kC30WTlr9Dzo52Aqas1MmBB8kIs3A_Tb.hgEYqYWYSAvx_W3CKyQq4v.VIPWpvarG3DzW.SndAUNIzFh2NjDj_ZFrHcta_qNGwC8nDDf4i0MNuobBRRysoT.NY1L0H4Br9eNjT4C8QgESAc5UHAZ
Received: from [98.242.222.229] by web63904.mail.re1.yahoo.com via HTTP;
	Mon, 06 Apr 2009 03:37:19 PDT
X-Mailer: YahooMailWebService/0.7.289.1
Date: Mon, 6 Apr 2009 03:37:19 -0700 (PDT)
From: Barney Cordoba <barney_cordoba@yahoo.com>
To: Ivan Voras <ivoras@freebsd.org>, Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <alpine.BSF.2.00.0904052243250.34905@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: freebsd-net@freebsd.org
Subject: Re: Advice on a multithreaded netisr  patch?
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: barney_cordoba@yahoo.com
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 06 Apr 2009 10:37:21 -0000


--- On Sun, 4/5/09, Robert Watson <rwatson@FreeBSD.org> wrote:

> From: Robert Watson <rwatson@FreeBSD.org>
> Subject: Re: Advice on a multithreaded netisr  patch?
> To: "Ivan Voras" <ivoras@freebsd.org>
> Cc: freebsd-net@freebsd.org
> Date: Sunday, April 5, 2009, 6:17 PM
> On Sun, 5 Apr 2009, Ivan Voras wrote:
> 
> >> The argument is not that they are slower (although
> they probably are a bit slower), rather that they introduce
> serialization bottlenecks by requiring synchronization
> between CPUs in order to distribute the work. Certainly some
> of the scalability issues in the stack are not a result of
> that, but a good number are.
> > 
> > I'd like to understand more. If (in netisr) I have
> a mbuf with headers, is this data already transfered from
> the card or is it magically "not here yet"?
> 
> A lot depends on the details of the card and driver.  The
> driver will take cache misses on the descriptor ring entry,
> if it's not already in cache, and the link layer will
> take a cache miss on the front of the ethernet frame in the
> cluster pointed to by the mbuf header as part of its demux. 
> What happens next depends on your dispatch model and cache
> line size.  Let's make a few simplifying assumptions
> that are mostly true:
> 
> - The driver associats a single cluster with each receive
> ring entry for each
>   packet to be stored in, and the cluster is
> cacheline-aligned.  No header
>   splitting is enabled.
> 
> - Standard ethernet encapsulation of IP is used, without
> additional VLAN
>   headers or other encapsulation, etc.  There are no IP
> options.
> 
> - We don't need to validate any checksums because the
> hardware has done it for
>   us, so no need to take cache misses on data that
> doesn't matter until we
>   reach higher layers.
> 
> In the device driver/ithread code, we'll now proceed to
> take some cache misses assuming we're not pretty lucky:
> 
> (1) The descriptor ring entry
> (2) The mbuf packet header
> (3) The first cache line in the cluster
> 
> This is sufficient to figure out what protocol we're
> going to dispatch to, and depending on dispatch model, we
> now either enqueue the packet for delivery to a netisr, or
> we directly dispatch the handler for IP.
> 
> If the packet is processed on the current CPU and we're
> direct dispatching, or if we've dispatched to a netisr
> on the same CPU and we're quite lucky, the mbuf packet
> header and front of the cluster will be in the cache.
> 
> However, what happens next depends on the cache fetch and
> line size.  If things happen in 32-byte cache lines or
> smaller, we cache miss on the end of the IP header, because
> the last two bytes of the destination IP address start at
> offset 32 into the cluster.  If we have 64-byte fetching and
> line size, things go better because both the full IP and TCP
> headers should be in that first cache line.
> 
> One big advantage to direct dispatch is that it maximizes
> the chances that we don't blow out the low-level CPU
> caches between link-layer and IP-layer processing, meaning
> that we might actually get through all the IP and TCP
> headers without a cache miss on a 64-byte line size.  If we
> netisr dispatch to another CPU without a shared cache, or we
> netisr dispatch to the current CPU but there's a
> scheduling delay, other packets queued first, etc, we'll
> take a number of the same cache misses over again as things
> get pulled into the right cache.
> 
> This presents a strong cache motivation to keep a packet
> "on" a CPU and even in the same thread once
> you've started processing it.  If you have to enqueue,
> you take locks, take a context switch, deal with the fact
> that LRU on cache lines isn't going to like your queue
> depth, and potentially pay a number of additional cache
> misses on the same data.  There are also some other good
> reasons to use direct dispatch, such as avoiding doing work
> on packets that will later be dropped if the netisr queue
> overflows.
> 
> This is why we direct dispatch by default, and why this is
> quite a good strategy for multiple input queue network
> cards, where it also buys us parallelism.
> 
> Note that if the flow RSS hash is in the same cache line as
> the rest of the receive descriptor ring entry, you may be
> able to avoid the cache miss on the cluster and simply
> redirect it to another CPU's netisr without ever reading
> packet data, which avoids at least one and possibly two
> cache misses, but also means that you have to run the link
> layer in the remote netisr, rather than locally in the
> ithread.
> 
> > In the first case, the package reception code path is
> not changed until it's queued on a thread, on which
> it's handled in the future (or is the influence of
> "other" data like timers and internal TCP
> reassembly buffers so large?). In the second case, why?
> 
> The good news about TCP reassembly is that we don't
> have to look at the data, only mbuf headers and reassembly
> buffer entries, so with any luck we've avoided actually
> taking a cache miss on the data.  If things go well, we can
> avoid looking at anything but mbuf and packet headers until
> the socket copies out, but I'm not sure how well we do
> that in practice.
> 
> > As the card and the OS can already process many
> packets per second for something fairly complex as routing
> (http://www.tancsa.com/blast.html), and TCP chokes swi:net
> at 100% of a core, isn't this indication there's
> certainly more space for improvement even with a
> single-queue old-fashioned NICs?
> 
> Maybe.  It depends on the relative costs of local
> processing vs redistributing the work, which involves
> schedulers, IPIs, additional cache misses, lock contention,
> and so on.  This means there's a period where it
> can't possibly be a win, and then at some point it's
> a win as long as the stack scales.  This is essentially the
> usual trade-off in using threads and parallelism: does the
> benefit of multiple parallel execution units make up for the
> overheads of synchronization and data migration?
> 
> There are some previous e-mail threads where people have
> observed that for some workloads, switching to netisr wins
> over direct dispatch.  For example, if you have a number of
> cores and are doing firewall processing, offloading work to
> the netisr from the input ithread may improve performance. 
> However, this appears not to be the common case for end-host
> workloads on the hardware we mostly target, and this is
> increasingly true as multiple input queues come into play,
> as the card itself will allow us to use multiple CPUs
> without any interactions between the CPUs.
> 
> This isn't to say that work redistribution using a
> netisr-like scheme isn't a good idea: in a world where
> CPU threads are weak compared to the wire workflow, and
> there's cache locality across threads on the same core,
> or NUMA is present, there may be a potential for a big win
> when available work significantly exceeds what a single CPU
> thread/core can handle.  In that case, we want to place the
> work as close as possible to take advantage of shared caches
> or the memory being local to the CPU thread/core doing the
> deferred work.
> 
> FYI, the localhost case is a bit weird -- I think we have
> some scheduling issues that are causing loopback netisr
> stuff to be pessimally scheduled. Here are some suggestions
> for things to try and see if they help, though:
> 
> - Comment out all ifnet, IP, and TCP global statistics in
> your local stack --
>   especially look for things tcpstat.whatever++;.
> 
> - Use cpuset to pin ithreads, the netisr, and whatever
> else, to specific cores
>   so that they don't migrate, and if your system uses
> HTT, experiment with
>   pinning the ithread and the netisr on different threads
> on the same core, or
>   at least, different cores on the same die.
> 
> - Experiment with using just the source IP, the source +
> destination IP, and
>   both IPs plus TCP ports in your hash.
> 
> - If your card supports RSS, pass the flowid up the stack
> in the mbuf packet
>   header flowid field, and use that instead of the hash for
> work placement.
> 
> - If you're doing pure PPS tests with UDP (or the
> like), and your test can
>   tolerate disordering, try hashing based on the mbuf
> header address or
>   something else that will distribute the work but not take
> a cache miss.
> 
> - If you have a flowid or the above disordered condition
> applies, try shifting
>   the link layer dispatch to the netisr, rather than doing
> the demux in the
>   ithread, as that will avoid cache misses in the ithread
> and do all the demux
>   in the netisr.
> 
> Robert N M Watson
> Computer Laboratory
> University of Cambridge

Is there a way to give a kernel thread exclusive use of a core? I know you
can pin a kernel thread with sched_bind(), but is there a way to keep
other threads from using the core? On an 8 core system it almost seems
that the randomness of more cores is a negative in some situations.

Also, I've noticed that calling sched_bind() during bootup is a bad thing
in that it locks the system. I'm not certain but I suspect its the 
thread_lock that is the culprit. Is there a clean way to determine that
its safe to lock curthread and do a cpu bind?

Barney