From owner-freebsd-net@FreeBSD.ORG Sun Apr 5 13:21:23 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 701AE1065670; Sun, 5 Apr 2009 13:21:23 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 4527A8FC12; Sun, 5 Apr 2009 13:21:23 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id DAD7946B3B; Sun, 5 Apr 2009 09:21:22 -0400 (EDT) Date: Sun, 5 Apr 2009 14:21:22 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Ivan Voras In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-net@freebsd.org Subject: Re: Advice on a multithreaded netisr patch? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Apr 2009 13:21:23 -0000 On Sun, 5 Apr 2009, Ivan Voras wrote: > I'm developing an application that needs a high rate of small TCP > transactions on multi-core systems, and I'm hitting a limit where a kernel > task, usually swi:net (but it depends on the driver) hits 100% of a CPU at > some transactions/s rate and blocks further performance increase even though > other cores are 100% idle. You can find a similar, if possibly more mature, implementation here: //depot/projects/rwatson/netisr2/... I haven't updated it in about six months since I've been waiting for the RSS-based flowid support in HEAD to mature. One of the fundamental problems with hashing packets to distribute work is that it involves taking cache misses on packet headers, not just once, but twice, which often is one of the largest costs in processing packets. Most modern, interesting high-performance network cards can already take the hash in hardware, and you want to use that hash to place work where possible. In 8.x, you shouldn't be experiencing high lock contention for the TCP receipt path when doing bulk transfers, as we use read locking for the tcbinfo lock in most cases. In fact, you can even get fairly decent scalability even in 7.x because the regular packet processing path for TCP uses mutual exclusion only briefly. However, the current approach does dirty a lot of cache lines, especially locks and stats, and does not scale well (in 8.x, or at all in 7.x) if you have lots of short connections. Also, be aware that if you're outputting to a single interface or queue, there's a *lot* of lock contention in the device driver. Kip Macy has patches to support multiple output queues on cxgb, which should facilitate support for other drivers as well, and the plan is to get that in 8.0 as well. The patch above doesn't know about the mbuf packetheader flowid yet, but it's trivial to teach it about that. I have plans to get back to the netisr2 code before we finalize 8.0, but have some other stuff in the queue first. We're, briefly, in a period where input queue count is about the same density as CPU cores; it's not entirely clear, but we may soon be back in a situation where CPU core count exceeds queues, in which case doing software work placement will continue to be important. Right now, as long as your high-performance card supports multiple input queues, we already do pretty effective work placement by virtue of RSS and multiple ithreads. Robert N M Watson Computer Laboratory University of Cambridge > > So I've got an idea and tested it out, but it fails in an unexpected > way. I'm not very familiar with the network code so I'm probably missing > something obvious. The idea was to locate where the packet processing > takes place and offload packets to several new kernel threads. I see > this can happen in several places - netisr, ip_input and tcp_input, and > I chose netisr because I thought maybe it would also help other uses > (routing?). Here's a patch against CURRENT: > > http://people.freebsd.org/~ivoras/diffs/mpip.patch > > It's fairly simple - starts a configurable number of threads in > start_netisr(), assigns circular queues to each, and modifies what I > think are entry points for packets in the non-netisr.direct case. I also > try to have TCP and UDP traffic from the same host+port processed by the > same thread. It has some rough edges but I think this is enough to test > the idea. I know that there are several people officially working in > this area and I'm not an expert in it so think of it as a weekend hack > for learning purposes :) > > These parameters are needed in loader.conf to test it: > > net.isr.direct=0 > net.isr.mtdispatch_n_threads=2 > > I expected things like the contention in upper layers (TCP) leading to > not improving performance one bit, but I can't explain what I'm getting > here. While testing the application on a plain kernel, I get approx. > 100,000 - 120,000 packets/s per direction (by looking at "netstat 1") > and a similar number of transactions/s in the application. With the > patch I get up to 250,000 packets/s in netstat (3 mtdispatch threads), > but for some weird reason the actual number of transactions processed by > the application drops to less than 1,000 at the beginning (~~ 30 > seconds), then jumps to close to 100,000 transactions/s, with netstat > also showing a drop this number of packets. In the first phase, the new > threads (netd0..3) are using CPU time almost 100%, in the second phase I > can't see where the CPU time is going (using top). > > I thought this has something to deal with NIC moderation (em) but can't > really explain it. The bad performance part (not the jump) is also > visible over the loopback interface. > > Any ideas? > >