From owner-freebsd-net@FreeBSD.ORG Sat Oct 4 01:58:05 2008 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 564F21065687 for ; Sat, 4 Oct 2008 01:58:05 +0000 (UTC) (envelope-from freebsd-net@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id D04038FC17 for ; Sat, 4 Oct 2008 01:58:04 +0000 (UTC) (envelope-from freebsd-net@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1KlwP8-00014V-KM for freebsd-net@freebsd.org; Sat, 04 Oct 2008 01:57:58 +0000 Received: from 77.237.105.56 ([77.237.105.56]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 04 Oct 2008 01:57:58 +0000 Received: from ivoras by 77.237.105.56 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 04 Oct 2008 01:57:58 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-net@freebsd.org From: Ivan Voras Date: Sat, 04 Oct 2008 03:57:50 +0200 Lines: 52 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 77.237.105.56 User-Agent: Thunderbird 2.0.0.16 (Windows/20080708) In-Reply-To: Sender: news Cc: freebsd-hackers@freebsd.org Subject: Network IO & scheduling problem? (was: Optimizing for high PPS, Intel NICs) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 04 Oct 2008 01:58:05 -0000 I experimented some more with my problem and it would be pretty incredible if it weren't for the fact that I can reliably reproduce it. Excuse me if this description is overly verbose, I can't decide which information might be important. First, here's some more background on the problem: The application (not created for the purpose of being a benchmark) accepts TCP connections and assigns them into one of predefined connection groups, configured at serve startup. Each connection group is polled for network IO events by its own thread. There are no overlaps between these groups. This polling can be done by either kqueue() or poll(). The client application is a stress test application that creates 40+ parallel, long lived TCP connections and tries to saturate the server with queries (so, for example with 40 client connections and 4 connection groups on the server, each kqueue or poll list has 10 entries). For testing purposes, the server doesn't actually do any useful work so the emphasis is on network IO. The server hardware is 2x quad Xeon 5405, 8 cores total, running FreeBSD 8-CURRENT amd64 (debugging options turned off). The client system actually doesn't matter, I've tested with many systems, including desktops and laptops with different NICs. The problem is: a) When IO polling on the server is done with kqueues, one kqueue per thread / connection group, I can create up to 3 threads / connection groups without any problems. When I create 4 threads, suddenly the em1 taskq thread starts eating 100% CPU. With 3 or less threads, em taskq spends less than 1% CPU time. At this point I can push 150,000 packets in each direction. b) When polling with poll(), I can create up to 4 server threads without saturating the em taskq, but at 4 threads it starts to spend high random amounts of CPU time, from 30% to 80%. At 5 or more threads it's pinned to 100%. With 4 threads I can push 170,000 packets per direction. With 3 or less threads the em taskq seems to spike in CPU usage right at the start when clients connect and then goes to < 1% CPU time. c) It looks like the effect is much less pronounced on a 4-core machine. I don't have it now but previous tests showed em taskq at 10% with 5 threads and kqueue polling. Some things I tried: disabling TSO doesn't help, disabling PREEMPTION doesn't help, it's not an interrupt storm, the taskq thread doesn't seem to jump all over cpu cores, BUT the amount of context switches rises sharply from ~~12,000 with 3 threads to ~~65,000 with 4 threads to ~~220,000 with 5 threads. Interrupt rate varies between 1000 and 3000 (interrupt moderation by the NIC?). I'm looking for ideas that can explain all this, and also for guidance on how to instrument the kernel to find out what is happening here. Fixes would also be welcome :)