From owner-freebsd-net@FreeBSD.ORG Sun Oct 17 21:38:55 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E61D8106564A; Sun, 17 Oct 2010 21:38:55 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 4AC458FC1A; Sun, 17 Oct 2010 21:38:54 +0000 (UTC) Received: from lawrence1.loshell.room52.net (ppp59-167-184-191.static.internode.on.net [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id 228D67E8AE; Mon, 18 Oct 2010 08:38:53 +1100 (EST) Message-ID: <4CBB6CE9.1030009@freebsd.org> Date: Mon, 18 Oct 2010 08:38:49 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU; rv:1.9.2.9) Gecko/20101006 Lightning/1.0b2 Thunderbird/3.1.4 MIME-Version: 1.0 To: Andre Oppermann References: <4CA5D1F0.3000307@freebsd.org> <4CA9B6AC.20403@freebsd.org> In-Reply-To: <4CA9B6AC.20403@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lauren.room52.net Cc: freebsd-net@freebsd.org, Sriram Gorti Subject: Re: Question on TCP reassembly counter X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Oct 2010 21:38:56 -0000 On 10/04/10 22:12, Lawrence Stewart wrote: > On 10/01/10 22:20, Andre Oppermann wrote: >> On 01.10.2010 12:01, Sriram Gorti wrote: >>> Hi, >>> >>> In the following is an observation when testing our XLR/XLS network >>> driver with 16 concurrent instances of netperf on FreeBSD-CURRENT. >>> Based on this observation, I have a question on which I hope to get >>> some understanding from here. >>> >>> When running 16 concurrent netperf instances (each for about 20 >>> seconds), it was found that after some number of runs performance >>> degraded badly (almost by a factor of 5). All subsequent runs remained >>> so. Started debugging this from TCP-side as other driver tests were >>> doing fine for comparably long durations on same board+s/w. >>> >>> netstat indicated the following: >>> >>> $ netstat -s -f inet -p tcp | grep discarded >>> 0 discarded for bad checksums >>> 0 discarded for bad header offset fields >>> 0 discarded because packet too short >>> 7318 discarded due to memory problems >>> >>> Then, traced the "discarded due to memory problems" to the following >>> counter: >>> >>> $ sysctl -a net.inet.tcp.reass >>> net.inet.tcp.reass.overflows: 7318 >>> net.inet.tcp.reass.maxqlen: 48 >>> net.inet.tcp.reass.cursegments: 1594<--- // corresponds to >>> V_tcp_reass_qsize variable >>> net.inet.tcp.reass.maxsegments: 1600 >>> >>> Our guess for the need for reassembly (in this low-packet-loss test >>> setup) was the lack of per-flow classification in the driver, causing >>> it to spew incoming packets across the 16 h/w cpus instead of packets >>> of a flow being sent to the same cpu. While we are working on >>> addressing this driver limitation, debugged further to see how/why the >>> V_tcp_reass_qsize grew (assuming that out-of-order segments should >>> have dropped to zero at the end of the run). It was seen that this >>> counter was actually growing up from the initial runs but only when it >>> reached near to maxsgements, perf degradation was seen. Then, started >>> looking at vmstat also to see how many of the reassembly segments were >>> lost. But, there were no segments lost. We could not reconcile "no >>> lost segments" with "growth of this counter across test runs". >> >> A patch is in the works to properly autoscale the reassembly queue >> and should be comitted shortly. >> >>> $ sysctl net.inet.tcp.reass ; vmstat -z | egrep "FREE|mbuf|tcpre" >>> net.inet.tcp.reass.overflows: 0 >>> net.inet.tcp.reass.maxqlen: 48 >>> net.inet.tcp.reass.cursegments: 147 >>> net.inet.tcp.reass.maxsegments: 1600 >>> ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP >>> mbuf_packet: 256, 0, 4096, 3200, 5653833, 0, 0 >>> mbuf: 256, 0, 1, 2048, 4766910, 0, 0 >>> mbuf_cluster: 2048, 25600, 7296, 6, 7297, 0, 0 >>> mbuf_jumbo_page: 4096, 12800, 0, 0, 0, 0, 0 >>> mbuf_jumbo_9k: 9216, 6400, 0, 0, 0, 0, 0 >>> mbuf_jumbo_16k: 16384, 3200, 0, 0, 0, 0, 0 >>> mbuf_ext_refcnt: 4, 0, 0, 0, 0, 0, 0 >>> tcpreass: 20, 1690, 0, 845, 1757074, 0, 0 >>> >>> In view of these observations, my question is: is it possible for the >>> V_tcp_reass_qsize variable to be unsafely updated on SMP ? (The >>> particular flavor of XLS that was used in the test had 4 cores with 4 >>> h/w threads/core). I see that the tcp_reass function assumes some lock >>> is taken but not sure if it is the per-socket or the global tcp lock. >> >> The updating of the global counter is indeed unsafe and becomes obsolete >> with the autotuning patch. >> >> The patch is reviewed by me and ready for commit. However lstewart@ is >> currently writing his thesis and has only very little spare time. I'll >> send you the patch in private email so you can continue your testing. > > Quick update on this: patch is blocked while waiting for Jeff to review > some related UMA changes. As soon as I get the all clear I'll push > everything into head. Revision 213913 of the svn head branch finally has all patches. If you encounter any additional odd behaviour related to reassembly or notice net.inet.tcp.reass.overflows increasing, please let me know. Cheers, Lawrence