From owner-freebsd-net@FreeBSD.ORG Mon Oct 4 11:28:30 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 16DE71065670 for ; Mon, 4 Oct 2010 11:28:30 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id A17198FC12 for ; Mon, 4 Oct 2010 11:28:29 +0000 (UTC) Received: from lawrence1.loshell.room52.net (ppp59-167-184-191.static.internode.on.net [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id 89E387E87B; Mon, 4 Oct 2010 22:12:54 +1100 (EST) Message-ID: <4CA9B6AC.20403@freebsd.org> Date: Mon, 04 Oct 2010 22:12:44 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU; rv:1.9.2.9) Gecko/20100913 Lightning/1.0b2 Thunderbird/3.1.3 MIME-Version: 1.0 To: Andre Oppermann References: <4CA5D1F0.3000307@freebsd.org> In-Reply-To: <4CA5D1F0.3000307@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lauren.room52.net Cc: freebsd-net@freebsd.org, Sriram Gorti Subject: Re: Question on TCP reassembly counter X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Oct 2010 11:28:30 -0000 On 10/01/10 22:20, Andre Oppermann wrote: > On 01.10.2010 12:01, Sriram Gorti wrote: >> Hi, >> >> In the following is an observation when testing our XLR/XLS network >> driver with 16 concurrent instances of netperf on FreeBSD-CURRENT. >> Based on this observation, I have a question on which I hope to get >> some understanding from here. >> >> When running 16 concurrent netperf instances (each for about 20 >> seconds), it was found that after some number of runs performance >> degraded badly (almost by a factor of 5). All subsequent runs remained >> so. Started debugging this from TCP-side as other driver tests were >> doing fine for comparably long durations on same board+s/w. >> >> netstat indicated the following: >> >> $ netstat -s -f inet -p tcp | grep discarded >> 0 discarded for bad checksums >> 0 discarded for bad header offset fields >> 0 discarded because packet too short >> 7318 discarded due to memory problems >> >> Then, traced the "discarded due to memory problems" to the following >> counter: >> >> $ sysctl -a net.inet.tcp.reass >> net.inet.tcp.reass.overflows: 7318 >> net.inet.tcp.reass.maxqlen: 48 >> net.inet.tcp.reass.cursegments: 1594<--- // corresponds to >> V_tcp_reass_qsize variable >> net.inet.tcp.reass.maxsegments: 1600 >> >> Our guess for the need for reassembly (in this low-packet-loss test >> setup) was the lack of per-flow classification in the driver, causing >> it to spew incoming packets across the 16 h/w cpus instead of packets >> of a flow being sent to the same cpu. While we are working on >> addressing this driver limitation, debugged further to see how/why the >> V_tcp_reass_qsize grew (assuming that out-of-order segments should >> have dropped to zero at the end of the run). It was seen that this >> counter was actually growing up from the initial runs but only when it >> reached near to maxsgements, perf degradation was seen. Then, started >> looking at vmstat also to see how many of the reassembly segments were >> lost. But, there were no segments lost. We could not reconcile "no >> lost segments" with "growth of this counter across test runs". > > A patch is in the works to properly autoscale the reassembly queue > and should be comitted shortly. > >> $ sysctl net.inet.tcp.reass ; vmstat -z | egrep "FREE|mbuf|tcpre" >> net.inet.tcp.reass.overflows: 0 >> net.inet.tcp.reass.maxqlen: 48 >> net.inet.tcp.reass.cursegments: 147 >> net.inet.tcp.reass.maxsegments: 1600 >> ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP >> mbuf_packet: 256, 0, 4096, 3200, 5653833, 0, 0 >> mbuf: 256, 0, 1, 2048, 4766910, 0, 0 >> mbuf_cluster: 2048, 25600, 7296, 6, 7297, 0, 0 >> mbuf_jumbo_page: 4096, 12800, 0, 0, 0, 0, 0 >> mbuf_jumbo_9k: 9216, 6400, 0, 0, 0, 0, 0 >> mbuf_jumbo_16k: 16384, 3200, 0, 0, 0, 0, 0 >> mbuf_ext_refcnt: 4, 0, 0, 0, 0, 0, 0 >> tcpreass: 20, 1690, 0, 845, 1757074, 0, 0 >> >> In view of these observations, my question is: is it possible for the >> V_tcp_reass_qsize variable to be unsafely updated on SMP ? (The >> particular flavor of XLS that was used in the test had 4 cores with 4 >> h/w threads/core). I see that the tcp_reass function assumes some lock >> is taken but not sure if it is the per-socket or the global tcp lock. > > The updating of the global counter is indeed unsafe and becomes obsolete > with the autotuning patch. > > The patch is reviewed by me and ready for commit. However lstewart@ is > currently writing his thesis and has only very little spare time. I'll > send you the patch in private email so you can continue your testing. Quick update on this: patch is blocked while waiting for Jeff to review some related UMA changes. As soon as I get the all clear I'll push everything into head. Cheers, Lawrence