From owner-freebsd-net@FreeBSD.ORG  Sat Oct  4 01:58:05 2008
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 564F21065687
	for <freebsd-net@freebsd.org>; Sat,  4 Oct 2008 01:58:05 +0000 (UTC)
	(envelope-from freebsd-net@m.gmane.org)
Received: from ciao.gmane.org (main.gmane.org [80.91.229.2])
	by mx1.freebsd.org (Postfix) with ESMTP id D04038FC17
	for <freebsd-net@freebsd.org>; Sat,  4 Oct 2008 01:58:04 +0000 (UTC)
	(envelope-from freebsd-net@m.gmane.org)
Received: from list by ciao.gmane.org with local (Exim 4.43)
	id 1KlwP8-00014V-KM
	for freebsd-net@freebsd.org; Sat, 04 Oct 2008 01:57:58 +0000
Received: from 77.237.105.56 ([77.237.105.56])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <freebsd-net@freebsd.org>; Sat, 04 Oct 2008 01:57:58 +0000
Received: from ivoras by 77.237.105.56 with local (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <freebsd-net@freebsd.org>; Sat, 04 Oct 2008 01:57:58 +0000
X-Injected-Via-Gmane: http://gmane.org/
To: freebsd-net@freebsd.org
From: Ivan Voras <ivoras@freebsd.org>
Date: Sat, 04 Oct 2008 03:57:50 +0200
Lines: 52
Message-ID: <gc6iip$n2c$1@ger.gmane.org>
References: <gbk0j9$cpj$1@ger.gmane.org> <gbu0im$t4r$1@ger.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: 77.237.105.56
User-Agent: Thunderbird 2.0.0.16 (Windows/20080708)
In-Reply-To: <gbu0im$t4r$1@ger.gmane.org>
Sender: news <news@ger.gmane.org>
Cc: freebsd-hackers@freebsd.org
Subject: Network IO & scheduling problem? (was: Optimizing for high PPS,
 Intel NICs)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Oct 2008 01:58:05 -0000

I experimented some more with my problem and it would be pretty 
incredible if it weren't for the fact that I can reliably reproduce it. 
Excuse me if this description is overly verbose, I can't decide which 
information might be important.

First, here's some more background on the problem: The application (not 
created for the purpose of being a benchmark) accepts TCP connections 
and assigns them into one of predefined connection groups, configured at 
serve startup. Each connection group is polled for network IO events by 
its own thread. There are no overlaps between these groups. This polling 
can be done by either kqueue() or poll(). The client application is a 
stress test application that creates 40+ parallel, long lived TCP 
connections and tries to saturate the server with queries (so, for 
example with 40 client connections and 4 connection groups on the 
server, each kqueue or poll list has 10 entries). For testing purposes, 
the server doesn't actually do any useful work so the emphasis is on 
network IO.

The server hardware is 2x quad Xeon 5405, 8 cores total, running FreeBSD 
8-CURRENT amd64 (debugging options turned off).

The client system actually doesn't matter, I've tested with many 
systems, including desktops and laptops with different NICs.

The problem is:

a) When IO polling on the server is done with kqueues, one kqueue per 
thread / connection group, I can create up to 3 threads / connection 
groups without any problems. When I create 4 threads, suddenly the em1 
taskq thread starts eating 100% CPU. With 3 or less threads, em taskq 
spends less than 1% CPU time. At this point I can push 150,000 packets 
in each direction.
b) When polling with poll(), I can create up to 4 server threads without 
saturating the em taskq, but at 4 threads it starts to spend high random 
amounts of CPU time, from 30% to 80%. At 5 or more threads it's pinned 
to 100%. With 4 threads I can push 170,000 packets per direction. With 3 
  or less threads the em taskq seems to spike in CPU usage right at the 
start when clients connect and then goes to < 1% CPU time.
c) It looks like the effect is much less pronounced on a 4-core machine. 
I don't have it now but previous tests showed em taskq at 10% with 5 
threads and kqueue polling.

Some things I tried: disabling TSO doesn't help, disabling PREEMPTION 
doesn't help, it's not an interrupt storm, the taskq thread doesn't seem 
to jump all over cpu cores, BUT the amount of context switches rises 
sharply from ~~12,000 with 3 threads to ~~65,000 with 4 threads to 
~~220,000 with 5 threads. Interrupt rate varies between 1000 and 3000 
(interrupt moderation by the NIC?).

I'm looking for ideas that can explain all this, and also for guidance 
on how to instrument the kernel to find out what is happening here. 
Fixes would also be welcome :)