From owner-freebsd-performance@FreeBSD.ORG  Tue Oct 11 14:01:13 2005
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: performance@FreeBSD.org
Delivered-To: freebsd-performance@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 60AE116A41F;
	Tue, 11 Oct 2005 14:01:13 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id DDAFF43D58;
	Tue, 11 Oct 2005 14:01:12 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id E0CA346B7C;
	Tue, 11 Oct 2005 10:01:11 -0400 (EDT)
Date: Tue, 11 Oct 2005 15:01:11 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: performance@FreeBSD.org
In-Reply-To: <20051005133730.R87201@fledge.watson.org>
Message-ID: <20051011145923.B92528@fledge.watson.org>
References: <20051005133730.R87201@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: net@FreeBSD.org
Subject: Re: Call for performance evaluation: net.isr.direct
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Oct 2005 14:01:13 -0000


On Wed, 5 Oct 2005, Robert Watson wrote:

> In 2003, Jonathan Lemon added initial support for direct dispatch of 
> netisr handlers from the calling thread, as part of his DARPA/NAI Labs 
> contract in the DARPA CHATS research program.  Over the last two years 
> since then, Sam Leffler and I have worked to refine this implementation, 
> removing a number of ordering related issues, opportunities for 
> excessive parallelism, recursion issues, and testing with a broad range 
> of network components.  There has also been a significant effort to 
> complete MPSAFE locking work throughout the network stack.  Combined 
> with the earlier move to ithreads and a functional direct dispatch 
> ("process to completion" implementation), there are a number of exciting 
> possible benefits.

If I don't hear anything back in the near future, I will commit a change 
to 7.x to make direct dispatch the default, in order to let a broader 
community do the testing.  :-)  If you are setup to easily test stability 
and performance relating to direct dispatch, I would appreciate any help.

As of 6.0-RC1 and recent 7.x, the name of the sysctl is "net.isr.direct"; 
previously it has been named "net.isr.enable", but its use is not 
recommend in versions that do not use the new name.

Thanks,

Robert N M Watson

>
> - Possible parallelism by packet source -- ithreads can dispatch
>  simultaenously into the higher level network stack layers.  Since
>  ithreads can execute in parallel on different CPU, so can code they
>  invoke directly.
>
> - Elimination of context switches in the network receive path -- rather
>  than context switching to the netisr thread from the ithread, we can now
>  directly execute netisr code from the ithread.
>
> - A CPU-bound netisr thread on a multi-processor system will no longer
>  rate limit traffic to the available resources on one CPU.
>
> - Eliminating the additional queueing in the handoff reduces the
>  opportunity for queues to overfill as a result of scheduling delays.
>
> There are, however, some possible downsides and/or trade-offs:
>
> - Higher level network processing will now compete with the interrupt
>  handler for CPU resources available to the ithread.  This means less
>  time for the interrupt code to execute in the thread if the thread is
>  CPU-bound.
>
> - Lower levels of parallelism between portions of the inbound packet
>  processing path.  Without direct dispatch, there is possible parallelism
>  between receive network driver execution and higher level stack layers,
>  whereas with direct dispatch they can no longer execute in parallel.
>
> - Re-queued packets from tunnel and encapsulation processing will now
>  require a context switch to process, since they will be processed in the
>  netisr proper rather than in the ithread, whereas before the netisr
>  thread would pick them up immediately after completing the current
>  processing without a context switch.
>
> - Code that previously ran in the SWI at a SWI priority now runs in the
>  ithread at an ithread priority, elevating the general priority at which
>  network processing takes place.
>
> And there are a few mixed things, that can offer good and bad elements:
>
> - Less queueing takes place in the network stack in in-bound processing:
>  packets are taken directly from the driver and processed to completion
>  one by one, rather than queued for batch processing.  Packets will be
>  dropped before the link layer, rather than on the boundary between the
>  link and protocol layers.  This is good in that we invest less work in
>  packets we were going to drop anyway, but bad in that less queueing
>  means less room for scheduling delays.
>
> In previous FreeBSD releases, such as several 5.x series releases, 
> net.isr.enable could not be turned on by default because there was 
> insufficient synchronization in the network stack.  As of 5.5 and 6.0, I 
> believe there is sufficient synchronization, especially given that we force 
> non-MPSAFE protocol handlers to run in the netisr without direct dispatch. 
> As such, there has been a gradual conversation going on about making direct 
> dispatch the default behavior in the 7.x development series, and more 
> publically documenting and supporting the use of direct dispatch in the 6.x 
> release engineering series.
>
> Obviously, this is about two things: performance, and stability.  Many of us 
> have been running with direct dispatch on by default for quite some time, so 
> it passes some of the basic "does it run" tests.  However, since it 
> significantly increases the opportunity for parallelism in the receive path 
> of the network stack, it likely will trigger otherwise latent or infrequent 
> races and bugs to occur more frequently.  The second aspect is performance: 
> many results suggest that direct dispatch has a significant performance 
> benefit.  However, evaluating the impact on a broad range of results is 
> required in order for us to go ahead with what is effectively a significant 
> architectural change in how we perform network stack processing.
>
> To give you a sense of some of the performance effect I've measured recently, 
> using the netperf measurement tool (with -DHISTOGRAM removed from the FreeBSD 
> port build), here are some results.  In each case, I've put parenthesis 
> around host or router to indicate which is the host where the configuration 
> change is being tested.  These tests were performed using dual Xeon systems, 
> and using back-to-back gigabit ethernet cards and the if_em driver:
>
> TCP round trip benchmark (TCP_RR), host-(host):
>
> 7.x UP: 0.9% performance improvement
> 7.x SMP: 0.7% performance improvement
>
> TCP round trip benchmark (TCP_RR), host-(router)-host:
>
> 7.x UP: 2.4% performance improvement
> 7.x SMP: 2.9% performance improvement
>
> UDP round trip benchmark (UDP_RR), host-(host):
>
> 7.x UP: 0.7% performance improvement
> 7.x SMP: 0.6% performance improvement
>
> UDP round trip benchmark (UDP_RR), host-(router)-host:
>
> 7.x UP: 2.2% performance improvement
> 7.x SMP: 3.0% performance improvement
>
> TCP stream banchmark (TCP_STREAM), host-(host):
>
> 7.x UP: 0.8% performance improvement
> 7.x SMP: 1.8% performance improvement
>
> TCP stream benchmark (TCP_STREAM), host-(router)-host:
>
> 7.x UP: 13.6% performance improvement
> 7.x SMP: 15.7% performance improvement
>
> UDP stream benchmark (UDP_STREAM), host-(host):
>
> 7.x UP: none
> 7.x SMP: none
>
> UDP stream benchmark (UDP_STREAM), host-(router)-host:
>
> 7.x UP: none
> 7.x SMP: none
>
> TCP connect benchmark (src/tools/tools/netrate/tcpconnect)
>
> 7.x UP: 7.90383% +/- 0.553773%
> 7.x SMP: 12.2391% +/- 0.500561%
>
> So in some cases, the impact is negligible -- in other places, it is quite 
> significant.  So far, I've not measured a case where performance has gotten 
> worse, but that's probably because I've only been measuring a limited number 
> of cases, and with a fairly limited scope of configurations, especially given 
> that the hardware I have is pushing the limits of what the wire supports, so 
> minor changes in latency are possible, but not large changes in throughput.
>
> So other than a summary of the status quo, this is also a call to action. I 
> would like to get more widespread benchmarking of the impact of direct 
> dispatch on network-related workloads.  This means a variety of things:
>
> (1) Performance of low level network services, such as routing, bridging,
>    and filtering.
>
> (2) Performance of high level application servces, such as web and
>    database.
>
> (3) Performance of integrated kernel network services, such as the NFS
>    client and server.
>
> (4) Performance of user space distributed file systems, such as Samba and
>    AFS.
>
> All you need to do to switch to direct dispatch mode is set the sysctl or 
> tunable "net.isr.dispatch" to 1.  To disable it again, remove the setting, or 
> set it to 0.  It can be modified at run-time, although during the transition 
> from one mode to the other, there may be a small quantity of packet 
> misordering, so benchmarking over the transition is discouraged.
> FYI: as of 6.0-RC1 and recent 7.0, net.isr.dispatch is the name of the 
> variable.  In earlier releases, the name of this variable was net.isr.enable.
>
> Some important details:
>
> - Only non-local protocol traffic is affected: loopback traffic still goes
>  via the netisr to avoid issues of recursion and lock order.
>
> - In the general case, only in-bound traffic is directly affected by this
>  change.  As such, send-only benchmarks may reveal little change.  They
>  are still interesting, however.
>
> - However, the send path is indirectly affected due to changes in
>  scheduling, workload, interrupt handling, and so on.
>
> - Because network benchmarks, especially micro-benchmarks, are especially
>  sensitive to minor perturbations, I highly recommend running in a
>  minimal multi-user or ideally single-user environment, and suggest
>  isolating undesired sources of network traffic from segments where
>  testing is occuring.  For macro-benchmarks this can be less important,
>  but should be paid attention to.
>
> - Please make sure debugging features are turned off when running tests --
>  especially WITNESS, INVARIANTS, INVARIANT_SUPPORT, and user space malloc
>  debugging.  These can have a significant impact on performance, both
>  potentially overshadowing changes, and in some cases, actually reversing
>  results (due to higher overhead under locks, for example).
>
> - Do not use net.isr.enable in the 5.x line unless you know what you are
>  doing.  While it is reasonably safe with 5.4 forwards, it is not a
>  supported configuration, and may cause stability issues with specific
>  workloads.
>
> - What we're particularly interested in is a statistically meaningful
>  comparison of the "before" and "after" case.  When doing measurements, I
>  like to run 10-12 samples, and usually discard the first one or two,
>  depending on the details of the benchmark.  I'll then use
>  src/tools/tools/ministat to compare the data sets.  Running a number of
>  samples is quite important, because the variance in many tests can be
>  significant, and if the two sample sets overlap, you can quite easily
>  draw the entirely wrong conclusion about the results from a small number
>  of measurements in a sample.
>
> Assuming you have a fixed width font, typicaly output from ministat looks 
> something like the following and may be human readable:
>
> x 7SMP/tcpconnect_queue
> + 7SMP/tcpconnect_direct
> +--------------------------------------------------------------------------+
> |x xx                                                                +    +|
> |xxxxx  xx                                                       ++ +++++ +|
> ||__A__|                                                          |___A__| |
> +--------------------------------------------------------------------------+
>    N           Min           Max        Median           Avg        Stddev
> x  10          5425          5503          5460        5456.3     26.284977
> +  10          6074          6169          6126        6124.1     31.606785
> Difference at 95.0% confidence
>        667.8 +/- 27.3121
>        12.2391% +/- 0.500561%
>        (Student's t, pooled s = 29.0679)
>
> Of particular interest is if changing to direct dispatch hurts performance in 
> your environment, and understanding why that is.
>
> Thanks,
>
> Robert N M Watson
> _______________________________________________
> freebsd-performance@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-performance
> To unsubscribe, send any mail to 
> "freebsd-performance-unsubscribe@freebsd.org"
>