From owner-freebsd-net@FreeBSD.ORG Wed Apr 17 11:54:51 2013 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id AC3F03BA; Wed, 17 Apr 2013 11:54:51 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail28.syd.optusnet.com.au (mail28.syd.optusnet.com.au [211.29.133.169]) by mx1.freebsd.org (Postfix) with ESMTP id 2FDCF233; Wed, 17 Apr 2013 11:54:50 +0000 (UTC) Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail28.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r3HBndcc032596 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 17 Apr 2013 21:49:40 +1000 Date: Wed, 17 Apr 2013 21:49:39 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Sepherosa Ziehau Subject: Re: bge(4) sysctl tuneables -- a blast from the past. In-Reply-To: Message-ID: <20130417203212.K1099@besplex.bde.org> References: <1365781568.1418.1.camel@localhost> <20130413200512.G1165@besplex.bde.org> <1366065356.1350.7.camel@localhost> <20130416152121.G904@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=S7iBW/QP c=1 sm=1 a=vYrNp6gXSs8A:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=_Js9cBt6JEEA:10 a=OYM9qD30a7WLajOxBi8A:9 a=CjuIK1q_8ugA:10 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: "pyunyh@gmail.com" , David Christensen , "freebsd-net@freebsd.org" , bde , Bruce Evans X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Apr 2013 11:54:51 -0000 On Tue, 16 Apr 2013, Sepherosa Ziehau wrote: > On Tue, Apr 16, 2013 at 1:56 PM, Bruce Evans wrote: >> >> Technical bugs include: >> - wrong defaults are claimed for *coal_ticks. The defaults are 150, but >> are claimed to be 150 milliseconds. These values are dimensionless, >> but since ticks take 1 microsecond each, 150 gives 150 microseconds, >> not 150 milliseconds. > > The real effect of TX coalesce ticks is confusing to me; TX interrupt > does not come at the rate you have specified, at least for several > PCI-e bge(4) I have tested. However, RX coalesce ticks work as > expected. It works for me on a 5701 (PCI-X) on a PCI-33 bus. Perhaps you are just seeing rx interrupts mixed with tx interrupts. At least the FreeBSD driver doesn't determine the interrupt type, so it always processes tx activity when it gets an rx interrupt. I had to do the following to avoid getting rx interrupts (without this the interrupt rate increased by a factor of 3-4 with tx_coal_ticks = 150, from ~6.7 kHz to 19-24 kHz): - I use ttcp for testing, so on the receiving system use ttcp -u -r so that it doesn't echo anything (otherwise it would "echo" with icmp port-unreachable unless firewalled). - Use an old receiving system that doesn't support flow control. The system can't keep up, and drops about half of the packets, so if it did flow control then there would be a lot of rx interrupts. > Here is how the tests were conducted: > - Send only test, no RX > - Each packet consume only one BD; UDP datagram, using hardware > checksum offloading > - TX coalesce BDs is set to 0, so only TX coalesce ticks have effect > > The interrupt rate I had got seemed to be related to packet size?! I > had tested two TX coalesce ticks settings: > (the result I had recorded was using BCM5720) This might be due to larger packet size causing less rx activity. > The first setting was 1023us; the first col is UDP data size, the > second col is rough interrupt rate > 18B 667/s > 64B 611/s Oops, this doesn't look like rx activity. We expect a rate of 977 Hz, possibly increased significantly by tx activity. I get 996-1004 here (1023us is actually 1000?). > 128B 538/s > 256B 432/s > 512B 311/s > 1024B 194/s > 1472B 146/s I get 996-1004 for all of these. Now I remember another problem that I work around using huge ifqueues (10k or 20k entries) and/or busy-waiting in the send() in ttcp. It is too easy for the tx to stop because there is nothing on the ifqueue to refill it. Then it won't restart until the application starts sending again. It is normal for all the queues to fill up. Then send returns ENOBUFS and there is no good way for the application to handle this, since select() on the queues not being full is broken (never supported). Bad ways include: - sleep for a while in the application. It is hard to know when to wake up, and impossible to wake up soon enough if timeout granularity is large. - use huge ifqueues, so that long delays in the application work - spin trying send(). > Tecond setting was 128us; the first col is UDP data size, the second > col is rough interrupt rate > 18B 1647/s > 64B 1338/s Now you should be getting much higher interrupt rates, unless something can't keep up. I get 7904-7967 and 7906-7971. > 128B 1030/s > 256B 700/s > 512B 430/s > 1024B 235/s > 1472B 169/s I get little dependency on the packet size. At 1472B, the packet rate is ~58900. Eveything on the tx side can keep up with that though not much more, so no drop is expected. > Well, to be frank, it does not make too much sense to me. I found timestamps and counters for bge_*xeof() good for understanding the flow of control. It is easy to generate too much data, so I keep the tx and rx statistics separate and try to understand tx and rx activity separately. Some for tx with tx_coal_ticks = 1023 and packet size 18: @ 976 1366197879.094951 454 25 349 1366197879.094976 105 @ 971 1366197879.095947 455 26 351 1366197879.095973 104 @ 972 1366197879.096945 455 25 355 1366197879.096970 100 @ 975 1366197879.097945 451 24 351 1366197879.097969 100 @ 974 1366197879.098943 443 24 337 1366197879.098967 106 The large numbers are absolute timestamps for bge_txeof() entry and exit. The entries are separated by almost exactly 1000 us (not 1023 us as expected). The first numeric column gives the time in us between the previous exit and this entry. Not very relevant here. The fourth numeric column gives the time in us between this entry and exit. Not very relevant here. The third and final numeric columns give the ring indexes on entry and exit, and the 5th numeric column gives the difference of these. These are relevant here. Ideally the ring would be almost but not quite full whenever we start, and the difference would be almost 512, but ttcp apparently can't generate data fast enough to keep it full, so it has an average of 350+ entries and the packet rate is 350+kpps. We don't want the ring to be completely full when we start, since that means that we are not interrupting enough to keep up with the generator and probably also with the hardware. This system can do 640+kpps when ideally configured, using tx_coal_ticks = 1000000 and tx_coal_bds = 384. With tx_coal_ticks = 1023 (1000) and tx_coal_bds = 0. it couldn't do more than 512kpps. Its current non-ideal configuration includes firwalling, sharing the bge interrupt with rl, and not overclocking. In this configuration, the above 2 tx_coal_* settings are almost equally good (tx_coal_ticks = 1023 reduces latency for reaping descriptors, but latency doesn't matter; tx_coal_ticks = 100000 reduces interrupts when not under load, but when not under load interrupt overhead isn't a problem). Bruce