From owner-freebsd-net@FreeBSD.ORG Fri Oct 22 23:59:15 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3D1AB1065675; Fri, 22 Oct 2010 23:59:15 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id AF9828FC29; Fri, 22 Oct 2010 23:59:14 +0000 (UTC) Received: from lawrence1.loshell.room52.net (ppp59-167-184-191.static.internode.on.net [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id 55FD27E87B; Sat, 23 Oct 2010 10:59:12 +1100 (EST) Message-ID: <4CC2254C.7070104@freebsd.org> Date: Sat, 23 Oct 2010 10:59:08 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU; rv:1.9.2.9) Gecko/20101006 Lightning/1.0b2 Thunderbird/3.1.4 MIME-Version: 1.0 To: Sriram Gorti References: <4CA5D1F0.3000307@freebsd.org> <4CA9B6AC.20403@freebsd.org> <4CBB6CE9.1030009@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lauren.room52.net Cc: freebsd-net@freebsd.org, Andre Oppermann Subject: Re: Question on TCP reassembly counter X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Oct 2010 23:59:15 -0000 On 10/22/10 18:10, Sriram Gorti wrote: > Hi, > > On Mon, Oct 18, 2010 at 3:08 AM, Lawrence Stewart wrote: >> On 10/04/10 22:12, Lawrence Stewart wrote: >>> On 10/01/10 22:20, Andre Oppermann wrote: >>>> On 01.10.2010 12:01, Sriram Gorti wrote: >>>>> Hi, >>>>> >>>>> In the following is an observation when testing our XLR/XLS network >>>>> driver with 16 concurrent instances of netperf on FreeBSD-CURRENT. >>>>> Based on this observation, I have a question on which I hope to get >>>>> some understanding from here. >>>>> >>>>> When running 16 concurrent netperf instances (each for about 20 >>>>> seconds), it was found that after some number of runs performance >>>>> degraded badly (almost by a factor of 5). All subsequent runs remained >>>>> so. Started debugging this from TCP-side as other driver tests were >>>>> doing fine for comparably long durations on same board+s/w. >>>>> >>>>> netstat indicated the following: >>>>> >>>>> $ netstat -s -f inet -p tcp | grep discarded >>>>> 0 discarded for bad checksums >>>>> 0 discarded for bad header offset fields >>>>> 0 discarded because packet too short >>>>> 7318 discarded due to memory problems >>>>> >>>>> Then, traced the "discarded due to memory problems" to the following >>>>> counter: >>>>> >>>>> $ sysctl -a net.inet.tcp.reass >>>>> net.inet.tcp.reass.overflows: 7318 >>>>> net.inet.tcp.reass.maxqlen: 48 >>>>> net.inet.tcp.reass.cursegments: 1594<--- // corresponds to >>>>> V_tcp_reass_qsize variable >>>>> net.inet.tcp.reass.maxsegments: 1600 >>>>> >>>>> Our guess for the need for reassembly (in this low-packet-loss test >>>>> setup) was the lack of per-flow classification in the driver, causing >>>>> it to spew incoming packets across the 16 h/w cpus instead of packets >>>>> of a flow being sent to the same cpu. While we are working on >>>>> addressing this driver limitation, debugged further to see how/why the >>>>> V_tcp_reass_qsize grew (assuming that out-of-order segments should >>>>> have dropped to zero at the end of the run). It was seen that this >>>>> counter was actually growing up from the initial runs but only when it >>>>> reached near to maxsgements, perf degradation was seen. Then, started >>>>> looking at vmstat also to see how many of the reassembly segments were >>>>> lost. But, there were no segments lost. We could not reconcile "no >>>>> lost segments" with "growth of this counter across test runs". >>>> >>>> A patch is in the works to properly autoscale the reassembly queue >>>> and should be comitted shortly. >>>> >>>>> $ sysctl net.inet.tcp.reass ; vmstat -z | egrep "FREE|mbuf|tcpre" >>>>> net.inet.tcp.reass.overflows: 0 >>>>> net.inet.tcp.reass.maxqlen: 48 >>>>> net.inet.tcp.reass.cursegments: 147 >>>>> net.inet.tcp.reass.maxsegments: 1600 >>>>> ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP >>>>> mbuf_packet: 256, 0, 4096, 3200, 5653833, 0, 0 >>>>> mbuf: 256, 0, 1, 2048, 4766910, 0, 0 >>>>> mbuf_cluster: 2048, 25600, 7296, 6, 7297, 0, 0 >>>>> mbuf_jumbo_page: 4096, 12800, 0, 0, 0, 0, 0 >>>>> mbuf_jumbo_9k: 9216, 6400, 0, 0, 0, 0, 0 >>>>> mbuf_jumbo_16k: 16384, 3200, 0, 0, 0, 0, 0 >>>>> mbuf_ext_refcnt: 4, 0, 0, 0, 0, 0, 0 >>>>> tcpreass: 20, 1690, 0, 845, 1757074, 0, 0 >>>>> >>>>> In view of these observations, my question is: is it possible for the >>>>> V_tcp_reass_qsize variable to be unsafely updated on SMP ? (The >>>>> particular flavor of XLS that was used in the test had 4 cores with 4 >>>>> h/w threads/core). I see that the tcp_reass function assumes some lock >>>>> is taken but not sure if it is the per-socket or the global tcp lock. >>>> >>>> The updating of the global counter is indeed unsafe and becomes obsolete >>>> with the autotuning patch. >>>> >>>> The patch is reviewed by me and ready for commit. However lstewart@ is >>>> currently writing his thesis and has only very little spare time. I'll >>>> send you the patch in private email so you can continue your testing. >>> >>> Quick update on this: patch is blocked while waiting for Jeff to review >>> some related UMA changes. As soon as I get the all clear I'll push >>> everything into head. >> >> Revision 213913 of the svn head branch finally has all patches. If you >> encounter any additional odd behaviour related to reassembly or notice >> net.inet.tcp.reass.overflows increasing, please let me know. >> > > Thanks for the fix. Tried it on XLR/XLS and the earlier tests pass > now. net.inet.tcp.reass.overflows was always zero after the tests (and > in the samples I took while the tests were running). Great, thanks for testing. > One observation though: net.inet.tcp.reass.cursegments was non-zero > (it was just 1) after 30 rounds, where each round is (as earlier) > 15-concurrent instances of netperf for 20s. This was on the netserver > side. And, it was zero before the netperf runs. On the other hand, > Andre told me (in a separate mail) that this counter is not relevant > anymore - so, should I just ignore it ? It's relevant, just not guaranteed to be 100% accurate at any given point in time. The value is calculated based on synchronised access to UMA zone stats and unsynchronised access to UMA per-cpu zone stats. The latter is safe, but causes the overall result to potentially be inaccurate due to use of stale data. The accuracy vs overhead tradeoff was deemed worthwhile for informational counters like this one. That being said, I would not expect the value to remain persistently at 1 after all TCP activity has finished on the machine. It won't affect performance, but I'm curious to know if the calculation method has a flaw. I'll try to reproduce locally, but can you please confirm if the value stays at 1 even after many minutes of no TCP activity? Cheers, Lawrence