From owner-freebsd-net@FreeBSD.ORG Sun Apr 5 22:17:59 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B365E106566C; Sun, 5 Apr 2009 22:17:59 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 586E28FC13; Sun, 5 Apr 2009 22:17:59 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id E5E7E46B0C; Sun, 5 Apr 2009 18:17:58 -0400 (EDT) Date: Sun, 5 Apr 2009 23:17:58 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Ivan Voras In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-net@freebsd.org Subject: Re: Advice on a multithreaded netisr patch? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Apr 2009 22:18:00 -0000 On Sun, 5 Apr 2009, Ivan Voras wrote: >> The argument is not that they are slower (although they probably are a bit >> slower), rather that they introduce serialization bottlenecks by requiring >> synchronization between CPUs in order to distribute the work. Certainly >> some of the scalability issues in the stack are not a result of that, but a >> good number are. > > I'd like to understand more. If (in netisr) I have a mbuf with headers, is > this data already transfered from the card or is it magically "not here > yet"? A lot depends on the details of the card and driver. The driver will take cache misses on the descriptor ring entry, if it's not already in cache, and the link layer will take a cache miss on the front of the ethernet frame in the cluster pointed to by the mbuf header as part of its demux. What happens next depends on your dispatch model and cache line size. Let's make a few simplifying assumptions that are mostly true: - The driver associats a single cluster with each receive ring entry for each packet to be stored in, and the cluster is cacheline-aligned. No header splitting is enabled. - Standard ethernet encapsulation of IP is used, without additional VLAN headers or other encapsulation, etc. There are no IP options. - We don't need to validate any checksums because the hardware has done it for us, so no need to take cache misses on data that doesn't matter until we reach higher layers. In the device driver/ithread code, we'll now proceed to take some cache misses assuming we're not pretty lucky: (1) The descriptor ring entry (2) The mbuf packet header (3) The first cache line in the cluster This is sufficient to figure out what protocol we're going to dispatch to, and depending on dispatch model, we now either enqueue the packet for delivery to a netisr, or we directly dispatch the handler for IP. If the packet is processed on the current CPU and we're direct dispatching, or if we've dispatched to a netisr on the same CPU and we're quite lucky, the mbuf packet header and front of the cluster will be in the cache. However, what happens next depends on the cache fetch and line size. If things happen in 32-byte cache lines or smaller, we cache miss on the end of the IP header, because the last two bytes of the destination IP address start at offset 32 into the cluster. If we have 64-byte fetching and line size, things go better because both the full IP and TCP headers should be in that first cache line. One big advantage to direct dispatch is that it maximizes the chances that we don't blow out the low-level CPU caches between link-layer and IP-layer processing, meaning that we might actually get through all the IP and TCP headers without a cache miss on a 64-byte line size. If we netisr dispatch to another CPU without a shared cache, or we netisr dispatch to the current CPU but there's a scheduling delay, other packets queued first, etc, we'll take a number of the same cache misses over again as things get pulled into the right cache. This presents a strong cache motivation to keep a packet "on" a CPU and even in the same thread once you've started processing it. If you have to enqueue, you take locks, take a context switch, deal with the fact that LRU on cache lines isn't going to like your queue depth, and potentially pay a number of additional cache misses on the same data. There are also some other good reasons to use direct dispatch, such as avoiding doing work on packets that will later be dropped if the netisr queue overflows. This is why we direct dispatch by default, and why this is quite a good strategy for multiple input queue network cards, where it also buys us parallelism. Note that if the flow RSS hash is in the same cache line as the rest of the receive descriptor ring entry, you may be able to avoid the cache miss on the cluster and simply redirect it to another CPU's netisr without ever reading packet data, which avoids at least one and possibly two cache misses, but also means that you have to run the link layer in the remote netisr, rather than locally in the ithread. > In the first case, the package reception code path is not changed until it's > queued on a thread, on which it's handled in the future (or is the influence > of "other" data like timers and internal TCP reassembly buffers so large?). > In the second case, why? The good news about TCP reassembly is that we don't have to look at the data, only mbuf headers and reassembly buffer entries, so with any luck we've avoided actually taking a cache miss on the data. If things go well, we can avoid looking at anything but mbuf and packet headers until the socket copies out, but I'm not sure how well we do that in practice. > As the card and the OS can already process many packets per second for > something fairly complex as routing (http://www.tancsa.com/blast.html), and > TCP chokes swi:net at 100% of a core, isn't this indication there's > certainly more space for improvement even with a single-queue old-fashioned > NICs? Maybe. It depends on the relative costs of local processing vs redistributing the work, which involves schedulers, IPIs, additional cache misses, lock contention, and so on. This means there's a period where it can't possibly be a win, and then at some point it's a win as long as the stack scales. This is essentially the usual trade-off in using threads and parallelism: does the benefit of multiple parallel execution units make up for the overheads of synchronization and data migration? There are some previous e-mail threads where people have observed that for some workloads, switching to netisr wins over direct dispatch. For example, if you have a number of cores and are doing firewall processing, offloading work to the netisr from the input ithread may improve performance. However, this appears not to be the common case for end-host workloads on the hardware we mostly target, and this is increasingly true as multiple input queues come into play, as the card itself will allow us to use multiple CPUs without any interactions between the CPUs. This isn't to say that work redistribution using a netisr-like scheme isn't a good idea: in a world where CPU threads are weak compared to the wire workflow, and there's cache locality across threads on the same core, or NUMA is present, there may be a potential for a big win when available work significantly exceeds what a single CPU thread/core can handle. In that case, we want to place the work as close as possible to take advantage of shared caches or the memory being local to the CPU thread/core doing the deferred work. FYI, the localhost case is a bit weird -- I think we have some scheduling issues that are causing loopback netisr stuff to be pessimally scheduled. Here are some suggestions for things to try and see if they help, though: - Comment out all ifnet, IP, and TCP global statistics in your local stack -- especially look for things tcpstat.whatever++;. - Use cpuset to pin ithreads, the netisr, and whatever else, to specific cores so that they don't migrate, and if your system uses HTT, experiment with pinning the ithread and the netisr on different threads on the same core, or at least, different cores on the same die. - Experiment with using just the source IP, the source + destination IP, and both IPs plus TCP ports in your hash. - If your card supports RSS, pass the flowid up the stack in the mbuf packet header flowid field, and use that instead of the hash for work placement. - If you're doing pure PPS tests with UDP (or the like), and your test can tolerate disordering, try hashing based on the mbuf header address or something else that will distribute the work but not take a cache miss. - If you have a flowid or the above disordered condition applies, try shifting the link layer dispatch to the netisr, rather than doing the demux in the ithread, as that will avoid cache misses in the ithread and do all the demux in the netisr. Robert N M Watson Computer Laboratory University of Cambridge