From owner-freebsd-net@FreeBSD.ORG Sun Apr 5 13:54:20 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 427A6106567F; Sun, 5 Apr 2009 13:54:20 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 018AB8FC12; Sun, 5 Apr 2009 13:54:20 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id AB86846B89; Sun, 5 Apr 2009 09:54:19 -0400 (EDT) Date: Sun, 5 Apr 2009 14:54:19 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Ivan Voras In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-net@freebsd.org Subject: Re: Advice on a multithreaded netisr patch? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Apr 2009 13:54:20 -0000 On Sun, 5 Apr 2009, Ivan Voras wrote: >>> I thought this has something to deal with NIC moderation (em) but can't >>> really explain it. The bad performance part (not the jump) is also visible >>> over the loopback interface. >> >> FYI, if you want high performance, you really want a card supporting >> multiple input queues -- igb, cxgb, mxge, etc. if_em-only cards are >> fundamentally less scalable in an SMP environment because they require >> input or output to occur only from one CPU at a time. > > Makes sense, but on the other hand - I see people are routing at least > 250,000 packets per seconds per direction with these cards, so they probably > aren't the bottleneck (pro/1000 pt on pci-e). The argument is not that they are slower (although they probably are a bit slower), rather that they introduce serialization bottlenecks by requiring synchronization between CPUs in order to distribute the work. Certainly some of the scalability issues in the stack are not a result of that, but a good number are. Historically, we've had a number of bottlenecks in, say, the bulk data receive and send paths, such as: - Initial receipt and processing of packets on a single CPU as a result of a single input queue from the hardware. Addressed by using multiple input queue hardware with appropriately configured drivers (generally the default is to use multiple input queues in 7.x and 8.x for supporting hardware). - Cache line contention on stats data structures in drivers and various levels of the network stack due to bouncing around exclusive ownership of the cache line. ifnet introduces at least a few, but I think most of the interesting ones are at the IP and TCP layers for receipt. - Global locks protecting connection lists, all rwlocks as of 7.1, but not necessarily always used read-only for packet processing. For UDP we do a very good job at avoiding write locks, but for TCP in 7.x we still use a global write lock, if briefly, for every packet. There's a change in 8.x to use a global read lock for most packets, especially steady state packets, but I didn't merge it for 7.2 because it's not well-benchmarked. Assuming I get positive feedback from more people, I will merge them before 7.3. - If the user application is multi-threaded and receiving from many threads at once, we see contention on the file descriptor table lock. This was markedly improved by the file descriptor table locking rewrite in 7.0, but we're continuing to look for ways to mitigate this. A lockless approach would be really nice... On the transmit path, the bottlenecks are similar but different: - Neither 7.x nor 8.x supports multiple transmit queues as shipped; Kip has patches for both that add it for cxgb. Maintaining ordering here, and ideally affinity to the appropriate associated input queue, is important. As the patches aren't in the tree yet, or for single-queue drivers, contention on the device driver send path and queues can be significant, especially for device drivers where the send and receive path are protected by the same lock (bge!). - Stats at various levels in the stack still dirty cache lines. - We don't acquire, in the common case, any global connection list locks during transmit. - Routing table locks may be an issue. Kip has patches against 8.x to re-introduce inpcb route as well as link layer flow caching. These are in my review queue currently... In 8.x the global radix tree lock is a read-write lock and we use read-locking where possible, but in 7.x it's still a mutex. This probably isn't an MFCable change. Another change coming in 8.x is increased use of read-mostly locks, rmlocks, which avoid writes to shared cache lines for read-acquire, but have a more expensive write-acquire. We're already using this in a few spots, including for firewall registration, but need to use it in more. With a fast CPU, introducing more cores may not necessarily speed up, and might often slow down, processing even if all bottlenecks are eliminated--fundamentally, if you have the CPU capacity to do the work on one CPU, then moving the work to other CPUs is an overhead best avoided. Especially if the device itself forces serialization due to having a single input queue and a single output queue. However, if we, reasonably, assume a capping of core speed over time, and increasing CPU density, software work placement becomes more important. And with multi-queue devices, avoiding writing to common cache lines from CPUs is increasingly possible. We have a 32-thread MIPS embedded eval board in the Netperf cluster now, which we'll begin using for 10gbps testing fairly soon, I hope. One of its properties is that individual threads are decidedly non-zippy compared to, say, a 10gbps interface running at line-rate, so it will allow us to explore these issues more effectively than we could before. Robert N M Watson Computer Laboratory University of Cambridge