From owner-freebsd-arch@FreeBSD.ORG Sat Oct 8 13:39:27 2005 Return-Path: X-Original-To: arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AA02516A41F for ; Sat, 8 Oct 2005 13:39:27 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0666643D49 for ; Sat, 8 Oct 2005 13:39:27 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by cyrus.watson.org (Postfix) with ESMTP id 9F41546BA6 for ; Sat, 8 Oct 2005 09:39:26 -0400 (EDT) Date: Sat, 8 Oct 2005 14:39:26 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: arch@FreeBSD.org Message-ID: <20051008143854.B84936@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: Call for performance evaluation: net.isr.direct (fwd) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 08 Oct 2005 13:39:27 -0000 FYI, as this is a general architectural issue. Please follow up to performance@/net@. Thanks, Robert N M Watson ---------- Forwarded message ---------- Date: Wed, 5 Oct 2005 17:12:14 +0100 (BST) From: Robert Watson To: performance@FreeBSD.org Cc: net@FreeBSD.org Subject: Call for performance evaluation: net.isr.direct In 2003, Jonathan Lemon added initial support for direct dispatch of netisr handlers from the calling thread, as part of his DARPA/NAI Labs contract in the DARPA CHATS research program. Over the last two years since then, Sam Leffler and I have worked to refine this implementation, removing a number of ordering related issues, opportunities for excessive parallelism, recursion issues, and testing with a broad range of network components. There has also been a significant effort to complete MPSAFE locking work throughout the network stack. Combined with the earlier move to ithreads and a functional direct dispatch ("process to completion" implementation), there are a number of exciting possible benefits. - Possible parallelism by packet source -- ithreads can dispatch simultaenously into the higher level network stack layers. Since ithreads can execute in parallel on different CPU, so can code they invoke directly. - Elimination of context switches in the network receive path -- rather than context switching to the netisr thread from the ithread, we can now directly execute netisr code from the ithread. - A CPU-bound netisr thread on a multi-processor system will no longer rate limit traffic to the available resources on one CPU. - Eliminating the additional queueing in the handoff reduces the opportunity for queues to overfill as a result of scheduling delays. There are, however, some possible downsides and/or trade-offs: - Higher level network processing will now compete with the interrupt handler for CPU resources available to the ithread. This means less time for the interrupt code to execute in the thread if the thread is CPU-bound. - Lower levels of parallelism between portions of the inbound packet processing path. Without direct dispatch, there is possible parallelism between receive network driver execution and higher level stack layers, whereas with direct dispatch they can no longer execute in parallel. - Re-queued packets from tunnel and encapsulation processing will now require a context switch to process, since they will be processed in the netisr proper rather than in the ithread, whereas before the netisr thread would pick them up immediately after completing the current processing without a context switch. - Code that previously ran in the SWI at a SWI priority now runs in the ithread at an ithread priority, elevating the general priority at which network processing takes place. And there are a few mixed things, that can offer good and bad elements: - Less queueing takes place in the network stack in in-bound processing: packets are taken directly from the driver and processed to completion one by one, rather than queued for batch processing. Packets will be dropped before the link layer, rather than on the boundary between the link and protocol layers. This is good in that we invest less work in packets we were going to drop anyway, but bad in that less queueing means less room for scheduling delays. In previous FreeBSD releases, such as several 5.x series releases, net.isr.enable could not be turned on by default because there was insufficient synchronization in the network stack. As of 5.5 and 6.0, I believe there is sufficient synchronization, especially given that we force non-MPSAFE protocol handlers to run in the netisr without direct dispatch. As such, there has been a gradual conversation going on about making direct dispatch the default behavior in the 7.x development series, and more publically documenting and supporting the use of direct dispatch in the 6.x release engineering series. Obviously, this is about two things: performance, and stability. Many of us have been running with direct dispatch on by default for quite some time, so it passes some of the basic "does it run" tests. However, since it significantly increases the opportunity for parallelism in the receive path of the network stack, it likely will trigger otherwise latent or infrequent races and bugs to occur more frequently. The second aspect is performance: many results suggest that direct dispatch has a significant performance benefit. However, evaluating the impact on a broad range of results is required in order for us to go ahead with what is effectively a significant architectural change in how we perform network stack processing. To give you a sense of some of the performance effect I've measured recently, using the netperf measurement tool (with -DHISTOGRAM removed from the FreeBSD port build), here are some results. In each case, I've put parenthesis around host or router to indicate which is the host where the configuration change is being tested. These tests were performed using dual Xeon systems, and using back-to-back gigabit ethernet cards and the if_em driver: TCP round trip benchmark (TCP_RR), host-(host): 7.x UP: 0.9% performance improvement 7.x SMP: 0.7% performance improvement TCP round trip benchmark (TCP_RR), host-(router)-host: 7.x UP: 2.4% performance improvement 7.x SMP: 2.9% performance improvement UDP round trip benchmark (UDP_RR), host-(host): 7.x UP: 0.7% performance improvement 7.x SMP: 0.6% performance improvement UDP round trip benchmark (UDP_RR), host-(router)-host: 7.x UP: 2.2% performance improvement 7.x SMP: 3.0% performance improvement TCP stream banchmark (TCP_STREAM), host-(host): 7.x UP: 0.8% performance improvement 7.x SMP: 1.8% performance improvement TCP stream benchmark (TCP_STREAM), host-(router)-host: 7.x UP: 13.6% performance improvement 7.x SMP: 15.7% performance improvement UDP stream benchmark (UDP_STREAM), host-(host): 7.x UP: none 7.x SMP: none UDP stream benchmark (UDP_STREAM), host-(router)-host: 7.x UP: none 7.x SMP: none TCP connect benchmark (src/tools/tools/netrate/tcpconnect) 7.x UP: 7.90383% +/- 0.553773% 7.x SMP: 12.2391% +/- 0.500561% So in some cases, the impact is negligible -- in other places, it is quite significant. So far, I've not measured a case where performance has gotten worse, but that's probably because I've only been measuring a limited number of cases, and with a fairly limited scope of configurations, especially given that the hardware I have is pushing the limits of what the wire supports, so minor changes in latency are possible, but not large changes in throughput. So other than a summary of the status quo, this is also a call to action. I would like to get more widespread benchmarking of the impact of direct dispatch on network-related workloads. This means a variety of things: (1) Performance of low level network services, such as routing, bridging, and filtering. (2) Performance of high level application servces, such as web and database. (3) Performance of integrated kernel network services, such as the NFS client and server. (4) Performance of user space distributed file systems, such as Samba and AFS. All you need to do to switch to direct dispatch mode is set the sysctl or tunable "net.isr.dispatch" to 1. To disable it again, remove the setting, or set it to 0. It can be modified at run-time, although during the transition from one mode to the other, there may be a small quantity of packet misordering, so benchmarking over the transition is discouraged. FYI: as of 6.0-RC1 and recent 7.0, net.isr.dispatch is the name of the variable. In earlier releases, the name of this variable was net.isr.enable. Some important details: - Only non-local protocol traffic is affected: loopback traffic still goes via the netisr to avoid issues of recursion and lock order. - In the general case, only in-bound traffic is directly affected by this change. As such, send-only benchmarks may reveal little change. They are still interesting, however. - However, the send path is indirectly affected due to changes in scheduling, workload, interrupt handling, and so on. - Because network benchmarks, especially micro-benchmarks, are especially sensitive to minor perturbations, I highly recommend running in a minimal multi-user or ideally single-user environment, and suggest isolating undesired sources of network traffic from segments where testing is occuring. For macro-benchmarks this can be less important, but should be paid attention to. - Please make sure debugging features are turned off when running tests -- especially WITNESS, INVARIANTS, INVARIANT_SUPPORT, and user space malloc debugging. These can have a significant impact on performance, both potentially overshadowing changes, and in some cases, actually reversing results (due to higher overhead under locks, for example). - Do not use net.isr.enable in the 5.x line unless you know what you are doing. While it is reasonably safe with 5.4 forwards, it is not a supported configuration, and may cause stability issues with specific workloads. - What we're particularly interested in is a statistically meaningful comparison of the "before" and "after" case. When doing measurements, I like to run 10-12 samples, and usually discard the first one or two, depending on the details of the benchmark. I'll then use src/tools/tools/ministat to compare the data sets. Running a number of samples is quite important, because the variance in many tests can be significant, and if the two sample sets overlap, you can quite easily draw the entirely wrong conclusion about the results from a small number of measurements in a sample. Assuming you have a fixed width font, typicaly output from ministat looks something like the following and may be human readable: x 7SMP/tcpconnect_queue + 7SMP/tcpconnect_direct +--------------------------------------------------------------------------+ |x xx + +| |xxxxx xx ++ +++++ +| ||__A__| |___A__| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 5425 5503 5460 5456.3 26.284977 + 10 6074 6169 6126 6124.1 31.606785 Difference at 95.0% confidence 667.8 +/- 27.3121 12.2391% +/- 0.500561% (Student's t, pooled s = 29.0679) Of particular interest is if changing to direct dispatch hurts performance in your environment, and understanding why that is. Thanks, Robert N M Watson _______________________________________________ freebsd-performance@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-performance To unsubscribe, send any mail to "freebsd-performance-unsubscribe@freebsd.org"