From owner-freebsd-net@freebsd.org Tue Oct 20 14:51:12 2015 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E60ADA1A3F3 for ; Tue, 20 Oct 2015 14:51:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id C8855DF3 for ; Tue, 20 Oct 2015 14:51:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id C7BC0A1A3F2; Tue, 20 Oct 2015 14:51:11 +0000 (UTC) Delivered-To: net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AD854A1A3F1 for ; Tue, 20 Oct 2015 14:51:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 59182DF1; Tue, 20 Oct 2015 14:51:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 81F6310452E2; Wed, 21 Oct 2015 01:51:01 +1100 (AEDT) Date: Wed, 21 Oct 2015 01:51:00 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: "Eggert, Lars" cc: Ian Smith , Kevin Oberman , "net@freebsd.org" , "jfv@FreeBSD.org" , "ricera10@gmail.com" , Luigi Rizzo , Giuseppe Lettieri Subject: Re: ixl 40G bad performance? In-Reply-To: <6CD6754D-FC0E-4B24-AAEC-7C9D68284141@netapp.com> Message-ID: <20151020232218.G1833@besplex.bde.org> References: <79830D9D-94E6-47A9-92B9-D63DF5432272@netapp.com> <20151020190541.B15983@sola.nimnet.asn.au> <6CD6754D-FC0E-4B24-AAEC-7C9D68284141@netapp.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=R6/+YolX c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=kj9zAlcOel0A:10 a=JDjsHSkAAAAA:8 a=p_OtkrAIStXhInMiOlUA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 20 Oct 2015 14:51:12 -0000 On Tue, 20 Oct 2015, Eggert, Lars wrote: > Hi, > > On 2015-10-20, at 10:24, Ian Smith wrote: >> Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead. > > Done. > > On 2015-10-19, at 17:55, Luigi Rizzo wrote: >> On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars wrote: >>> The only other sysctls in ixl(4) that look relevant are: >>> >>> hw.ixl.rx_itr >>> The RX interrupt rate value, set to 8K by default. >>> >>> hw.ixl.tx_itr >>> The TX interrupt rate value, set to 4K by default. >>> >> >> yes those. raise to 20-50k and see what you get in >> terms of ping latency. > > While ixl(4) talks about 8K and 4K, the defaults actually seem to be: > > hw.ixl.tx_itr: 122 > hw.ixl.rx_itr: 62 ixl seems to have a different set of itr sysctl bugs than em. In em, 122 for the itr means 125 initially, but it is documented (only by sysctl -d, not by the man page) as having units usecs/4. The units are actually usecs*4 except initially, and these units take effect if you write the initial value back -- writing back 122 changes the active period from 125 to 488. 122 instead of 125 is the result of confusion between powers of 2 and powers of 10. The first obvious bug in ixl is that the above sysctls are read-only global tunables (not documented as sysctls of course), but you can write them using per-device sysctls (dev.ixl.[0-N].*itr?). Writing them for 1 device clobbers the globals and probably the settings for all ixl devices. sysctl -d doesn't say anything useful about ixl's itrs. It misdocuments the units for all of them as being rates. Actually, the units for 2 of them are boolean and the units for the other 2 are periods. ixl(4) uses better wording for the booleans but even worse wording for the periods ("rate value"). em uses better wording for its itr sysctl but em(4) has no documentation for any sysctl or its itr tunable. igb is more like em than ixl here. 122 seems to be the result of mis-scaling 125, and 62 from correctly scaling 62.5, but these numbers are also off by a factor of 2. Either there is a scaling bug or the undocumented units are usecs/2 where em's documented units are usecs/4. In em, the default itr rate is 8 kHz (power of 10), but in ixl it is unclear if 4K and 8K are actually 4000 and 8000, since they are scaled more in hardware (IXL_ITR_4K is hard-coded as 122; the scale is linear but their aren't enough bits to preserve linearity; it is unclear if the hard-coded values are defined by the hardware or are the result of precomputing the values (using hard-coded 0x7A (122) where em uses 1000000 / SCALE (100000 being user-friendly microseconds and SCALE a hardware clock frequency)). I think 122 really does mean a period that approximates the period for a frequency of 4 khz. The period for this frequency is 250 usecs, and 122 is 250 with units of usec*2, with an approximate error of 3 units. Or 122 is the period for the documented frequency of 4K (binary power of 2 with undocumented units which I assume are Hz), with the weird usec*2 units and a tiny error. Similarly for 62 and 8K, except there is a rounding error of almost 1. > Doubling those values *increases* flood ping latency to ~200 usec (from ~116 usec). Since they are periods and not frequencies, doubling them should double the latency. Since their units are weird and undocumented, it is hard to predict what the latency actually is. But I predict that if the units are usecs*2, then the unscaled values give average latencies from interrupt moderation. This gives 122 + 62 = 184 plus maybe another 20 for other delays. Since the observed average latency is less than half that, the units seem to usecs*1 and it is the documented frequencies that are off by a power of 2. > Halving them to 62/31 decreases flood ping latency to ~50 usec, but still doesn't increase iperf throughput (still 2.8 Gb/s). Going to 31/16 further drops latency to 24 usec, with no change in throughput. For em and lem, I use itr = 0 or 1 when optimizing for latency. This reduces the latency to 50 for lem but only to 73 for em (where the connection goes through a slow switch to not so slow bge). 24 seems quite good, and the lowest I have seen for 1 Gbps is 26, but this requires kludges like a direct connection and polling, and I would hope for 40 times lower at 40 Gbps. > (Looking at the "interrupt Moderation parameters" #defines in sys/dev/ixl/ixl.h it seems that ixl likes to have its irq rates specified with some weird divider scheme.) > > With 5/5 (which corresponds to IXL_ITR_100K), I get down to 16 usec. Unfortunately, throughput is then also down to about 2 Gb/s. Lowering (improving) latency always lowers (unimproves) throughput by increasing load. itr = 8 kHz is resonable for 1 Gbps (it gives higher latency than I like), but scaling that to 40 Gbps gives itr = 320 kHz and it is impossible to scale up the speed of a single CPU to reasonbly keep up with that. Fix for em: X diff -u2 if_em.c~ if_em.c X --- if_em.c~ 2015-09-28 06:29:35.000000000 +0000 X +++ if_em.c 2015-10-18 18:49:36.876699000 +0000 X @@ -609,8 +609,8 @@ X em_tx_abs_int_delay_dflt); X em_add_int_delay_sysctl(adapter, "itr", X - "interrupt delay limit in usecs/4", X + "interrupt delay limit in usecs", X &adapter->tx_itr, X E1000_REGISTER(hw, E1000_ITR), X - DEFAULT_ITR); X + 1000000 / MAX_INTS_PER_SEC); X X /* Sysctl for limiting the amount of work done in the taskqueue */ "delay limit" is fairly good wording. Other parameters tend to give long delays, but itr limits the longest delay due to interrupt moderation to whatever the itr respresents. Bruce