From owner-freebsd-net@FreeBSD.ORG  Tue Apr  7 12:11:50 2009
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D89491065825
	for <freebsd-net@freebsd.org>; Tue,  7 Apr 2009 12:11:50 +0000 (UTC)
	(envelope-from barney_cordoba@yahoo.com)
Received: from web63906.mail.re1.yahoo.com (web63906.mail.re1.yahoo.com
	[69.147.97.121]) by mx1.freebsd.org (Postfix) with SMTP id 6F4388FC08
	for <freebsd-net@freebsd.org>; Tue,  7 Apr 2009 12:11:50 +0000 (UTC)
	(envelope-from barney_cordoba@yahoo.com)
Received: (qmail 35966 invoked by uid 60001); 7 Apr 2009 12:11:50 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024;
	t=1239106309; bh=QQnklHzORTgtnBxvApx/35hxpzLUqRD4UPae3d/Tc6o=;
	h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
	b=sA9hP0o38ntWa01awAfsC1A0EU4AhmzDXdHBzKJNkM+v0iBkoJJWjLRRXpVHUY70EAh3ejaubn65yKxKualyMGe4+7C6mIBf/N6vTEpF6ELUBxtA/RcDMNK247y1NK2y8hqpa8tSoIPL12bFNInF88M7Q4+51rb6uhAVDRNQrqk=
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com;
	h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
	b=rxa+BBk7r+B5uH+R5s9NJmh6yCCTuqY7SY4bL7UsDomTUfU+8bGNjKOHlOdVPDOPuOFyl+OCMyGAyTQhg6o0xAIOUWn/bQmcQCKQxr+G33rPGDcgxBuwnhJQ/OIsHozlipo+vhY24iuguupEdfp5WPU8xjGo71OVwAdYQqQZp2I=;
Message-ID: <952316.35609.qm@web63906.mail.re1.yahoo.com>
X-YMail-OSG: qPdgWekVM1kVWVo3zshllwJm0FXN5.Y29LgUBsOxAwqH91p0lPj_QYvwafSOggJrXuZs4mGJKkRfpIMTkov9eaboET89cPiWAUEsy.P_3NPlCI0v3EjL2Cwt_v9WYpZ7Yizu7N2d6zcx4qaQN_xGtxp8YmcXDrQmNXMnYs4pioh.lkk41Q3NTdmgIX0bdhKtknCFrLlmK_Qe6XHyVu4AF84tmxCRtGKQjASKvdW2OqVoVXpQ6XLIV1kHOLaK4SxrmpWDedZsvwC0dSX4eFoRzOkpKqgTffiZH5R82G2tbg3v.jWNNNnc7jcRjm1ZU5JTRyxPeG3EfkANJX9yRyhi1RTF
Received: from [98.242.222.229] by web63906.mail.re1.yahoo.com via HTTP;
	Tue, 07 Apr 2009 05:11:49 PDT
X-Mailer: YahooMailWebService/0.7.289.1
Date: Tue, 7 Apr 2009 05:11:49 -0700 (PDT)
From: Barney Cordoba <barney_cordoba@yahoo.com>
To: Ivan Voras <ivoras@freebsd.org>, Robert Watson <rwatson@FreeBSD.org>
In-Reply-To: <alpine.BSF.2.00.0904061238250.34905@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: freebsd-net@freebsd.org
Subject: Re: Advice on a multithreaded netisr  patch?
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: barney_cordoba@yahoo.com
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Apr 2009 12:12:06 -0000


--- On Mon, 4/6/09, Robert Watson <rwatson@FreeBSD.org> wrote:

> From: Robert Watson <rwatson@FreeBSD.org>
> Subject: Re: Advice on a multithreaded netisr  patch?
> To: "Ivan Voras" <ivoras@freebsd.org>
> Cc: freebsd-net@freebsd.org
> Date: Monday, April 6, 2009, 7:59 AM
> On Mon, 6 Apr 2009, Ivan Voras wrote:
> 
> >>> I'd like to understand more. If (in
> netisr) I have a mbuf with headers, is this data already
> transfered from the card or is it magically "not here
> yet"?
> >> 
> >> A lot depends on the details of the card and
> driver.  The driver will take cache misses on the descriptor
> ring entry, if it's not already in cache, and the link
> layer will take a cache miss on the front of the ethernet
> frame in the cluster pointed to by the mbuf header as part
> of its demux. What happens next depends on your dispatch
> model and cache line size. Let's make a few simplifying
> assumptions that are mostly true:
> > 
> > So, a mbuf can reference data not yet copied from the
> NIC hardware? I'm specifically trying to undestand what
> m_pullup() does.
> 
> I think we're talking slightly at cross purposes. 
> There are two transfers of interest:
> 
> (1) DMA of the packet data to main memory from the NIC
> (2) Servicing of CPU cache misses to access data in main
> memory
> 
> By the time you receive an interrupt, the DMA is complete,
> so once you believe a packet referenced by the descriptor
> ring is done, you don't have to wait for DMA.  However,
> the packet data is in main memory rather than your CPU
> cache, so you'll need to take a cache miss in order to
> retrieve it.  You don't want to prefetch before you know
> the packet data is there, or you may prefetch stale data
> from the previous packet sent or received from the cluster.
> 
> m_pullup() has to do with mbuf chain memory contiguity
> during packet processing.  The usual usage is something
> along the following lines:
> 
> 	struct whatever *w;
> 
> 	m = m_pullup(m, sizeof(*w));
> 	if (m == NULL)
> 		return;
> 	w = mtod(m, struct whatever *);
> 
> m_pullup() here ensures that the first sizeof(*w) bytes of
> mbuf data are contiguously stored so that the cast of w to
> m's data will point at a complete structure we can use
> to interpret packet data.  In the common case in the receipt
> path, m_pullup() should be a no-op, since almost all drivers
> receive data in a single cluster.
> 
> However, there are cases where it might not happen, such as
> loopback traffic where unusual encapsulation is used,
> leading to a call to M_PREPEND() that inserts a new mbuf on
> the front of the chain, which is later m_defrag()'d
> leading to a higher level header crossing a boundary or the
> like.
> 
> This issue is almost entirely independent from things like
> the cache line miss issue, unless you hit the uncommon case
> of having to do work in m_pullup(), in which case life
> sucks.
> 
> It would be useful to use DTrace to profile a number of the
> workfull m_foo() functions to make sure we're not
> hitting them in normal workloads, btw.
> 
> >>> As the card and the OS can already process
> many packets per second for
> >>> something fairly complex as routing
> >>> (http://www.tancsa.com/blast.html), and TCP
> chokes swi:net at 100% of
> >>> a core, isn't this indication there's
> certainly more space for
> >>> improvement even with a single-queue
> old-fashioned NICs?
> >> 
> >> Maybe.  It depends on the relative costs of local
> processing vs
> >> redistributing the work, which involves
> schedulers, IPIs, additional
> >> cache misses, lock contention, and so on.  This
> means there's a period
> >> where it can't possibly be a win, and then at
> some point it's a win as
> >> long as the stack scales.  This is essentially the
> usual trade-off in
> >> using threads and parallelism: does the benefit of
> multiple parallel
> >> execution units make up for the overheads of
> synchronization and data
> >> migration?
> > 
> > Do you have any idea at all why I'm seeing the
> weird difference of netstat packets per second (250,000) and
> my application's TCP performance (< 1,000 pps)?
> Summary: each packet is guaranteed to be a whole message
> causing a transaction in the application - without the
> changes I see pps almost identical to tps. Even if the
> source of netstat statistics somehow manages to count
> packets multiple time (I don't see how that can happen),
> no relation can describe differences this huge. It almost
> looks like something in the upper layers is discarding
> packets (also not likely: TCP timeouts would occur and the
> application wouldn't be able to push 250,000 pps) - but
> what? Where to look?
> 
> Is this for the loopback workload?  If so, remember that
> there may be some other things going on:
> 
> - Every packet is processed at least two times: once went
> sent, and then again
>   when it's received.
> 
> - A TCP segment will need to be ACK'd, so if you're
> sending data in chunks in
>   one direction, the ACKs will not be piggy-backed on
> existing data tranfers,
>   and instead be sent independently, hitting the network
> stack two more times.
> 
> - Remember that TCP works to expand its window, and then
> maintains the highest
>   performance it can by bumping up against the top of
> available bandwidth
>   continuously.  This involves detecting buffer limits by
> generating packets
>   that can't be sent, adding to the packet count.  With
> loopback traffic, the
>   drop point occurs when you exceed the size of the
> netisr's queue for IP, so
>   you might try bumping that from the default to something
> much larger.
> 
> And nothing beats using tcpdump -- have you tried
> tcpdumping the loopback to see what is actually being sent? 
> If not, that's always educational -- perhaps something
> weird is going on with delayed ACKs, etc.
> 
> > You mean for the general code? I purposely don't
> lock my statistics variables because I'm not that
> interested in exact numbers (orders of magnitude are
> relevant). As far as I understand, unlocked "x++"
> should be trivially fast in this case?
> 
> No.  x++ is massively slow if executed in parallel across
> many cores on a variable in a single cache line.  See my
> recent commit to kern_tc.c for an example: the updating of
> trivial statistics for the kernel time calls reduced 30m
> syscalls/second to 3m syscalls/second due to heavy
> contention on the cache line holding the statistic.  One of
> my goals for 8.0 is to fix this problem for IP and TCP
> layers, and ideally also ifnet but we'll see.  We should
> be maintaining those stats per-CPU and then aggregating to
> report them to userspace.  This is what we already do for a
> number of system stats -- UMA and kernel malloc, syscall and
> trap counters, etc.
> 
> >> - Use cpuset to pin ithreads, the netisr, and
> whatever else, to specific
> >> cores
> >>   so that they don't migrate, and if your
> system uses HTT, experiment with
> >>   pinning the ithread and the netisr on different
> threads on the same
> >> core, or
> >>   at least, different cores on the same die.
> > 
> > I'm using em hardware; I still think there's a
> possibility I'm fighting the driver in some cases but
> this has priority #2.
> 
> Have you tried LOCK_PROFILING?  It would quickly tell you
> if driver locks were a source of significant contention.  It
> works quite well...

When I enabled LOCK_PROFILING my side modules, such as if_ibg, 
stopped working. It seems that the ifnet structure or something 
changed with that option enabled. Is there a way to sync this without
having to integrate everything into a specific kernel build?

Barney