From owner-freebsd-net@FreeBSD.ORG Fri Oct 22 07:10:53 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2133F106566B; Fri, 22 Oct 2010 07:10:53 +0000 (UTC) (envelope-from gsriram@gmail.com) Received: from mail-pz0-f54.google.com (mail-pz0-f54.google.com [209.85.210.54]) by mx1.freebsd.org (Postfix) with ESMTP id DB4AD8FC1C; Fri, 22 Oct 2010 07:10:52 +0000 (UTC) Received: by pzk37 with SMTP id 37so85259pzk.13 for ; Fri, 22 Oct 2010 00:10:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=bTKqiuVAjtPR8lhlempYHrZEOffJpQ+Wli6BLwL05Ks=; b=m4eAVYJGtROxuYY+cMd+0KQldjEdMdeb/fwioDN2R8kWaEcrfrl8RkFbo8Kk79RPan m5NJWASNJiGGy8Lcrv3n6olqxqAaJC+hxQt2X2qoO+sKJ4C1/XdCz4F8Kg/TJcNKLkmE jJmrHYgmrAUCqQ95ydDyT2lHHW5jKfhwEo4wk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=XpsG0wsy/FM9Ja131eQXouJBEZqINTFneO1cseJPt3fZMRuraLQ36vEhSlAukRor4f 0BhXj0GNV7HRvyiang2tb1B6cRxWuwAa4B3hTprJoZHrVmetZogt7EqkpYfcEmF5xUtI 1rz3xKth/883pTXCm8yBeQGaQTg6U5kHKWYIg= MIME-Version: 1.0 Received: by 10.142.153.2 with SMTP id a2mr1652797wfe.380.1287731452257; Fri, 22 Oct 2010 00:10:52 -0700 (PDT) Received: by 10.143.156.8 with HTTP; Fri, 22 Oct 2010 00:10:52 -0700 (PDT) In-Reply-To: <4CBB6CE9.1030009@freebsd.org> References: <4CA5D1F0.3000307@freebsd.org> <4CA9B6AC.20403@freebsd.org> <4CBB6CE9.1030009@freebsd.org> Date: Fri, 22 Oct 2010 12:40:52 +0530 Message-ID: From: Sriram Gorti To: Lawrence Stewart Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-net@freebsd.org, Andre Oppermann Subject: Re: Question on TCP reassembly counter X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Oct 2010 07:10:53 -0000 Hi, On Mon, Oct 18, 2010 at 3:08 AM, Lawrence Stewart wr= ote: > On 10/04/10 22:12, Lawrence Stewart wrote: >> On 10/01/10 22:20, Andre Oppermann wrote: >>> On 01.10.2010 12:01, Sriram Gorti wrote: >>>> Hi, >>>> >>>> In the following is an observation when testing our XLR/XLS network >>>> driver with 16 concurrent instances of netperf on FreeBSD-CURRENT. >>>> Based on this observation, I have a question on which I hope to get >>>> some understanding from here. >>>> >>>> When running 16 concurrent netperf instances (each for about 20 >>>> seconds), it was found that after some number of runs performance >>>> degraded badly (almost by a factor of 5). All subsequent runs remained >>>> so. Started debugging this from TCP-side as other driver tests were >>>> doing fine for comparably long durations on same board+s/w. >>>> >>>> netstat indicated the following: >>>> >>>> $ netstat -s -f inet -p tcp | grep discarded >>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00 discarded for bad checksums >>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00 discarded for bad header offset f= ields >>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00 discarded because packet too shor= t >>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A07318 discarded due to memory proble= ms >>>> >>>> Then, traced the "discarded due to memory problems" to the following >>>> counter: >>>> >>>> $ sysctl -a net.inet.tcp.reass >>>> net.inet.tcp.reass.overflows: 7318 >>>> net.inet.tcp.reass.maxqlen: 48 >>>> net.inet.tcp.reass.cursegments: 1594<--- // corresponds to >>>> V_tcp_reass_qsize variable >>>> net.inet.tcp.reass.maxsegments: 1600 >>>> >>>> Our guess for the need for reassembly (in this low-packet-loss test >>>> setup) was the lack of per-flow classification in the driver, causing >>>> it to spew incoming packets across the 16 h/w cpus instead of packets >>>> of a flow being sent to the same cpu. While we are working on >>>> addressing this driver limitation, debugged further to see how/why the >>>> V_tcp_reass_qsize grew (assuming that out-of-order segments should >>>> have dropped to zero at the end of the run). It was seen that this >>>> counter was actually growing up from the initial runs but only when it >>>> reached near to maxsgements, perf degradation was seen. Then, started >>>> looking at vmstat also to see how many of the reassembly segments were >>>> lost. But, there were no segments lost. We could not reconcile "no >>>> lost segments" with "growth of this counter across test runs". >>> >>> A patch is in the works to properly autoscale the reassembly queue >>> and should be comitted shortly. >>> >>>> $ sysctl net.inet.tcp.reass ; vmstat -z | egrep "FREE|mbuf|tcpre" >>>> net.inet.tcp.reass.overflows: 0 >>>> net.inet.tcp.reass.maxqlen: 48 >>>> net.inet.tcp.reass.cursegments: 147 >>>> net.inet.tcp.reass.maxsegments: 1600 >>>> ITEM =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 SIZE =A0LIMIT =A0 =A0 USED = =A0 =A0 FREE =A0 =A0 =A0REQ FAIL SLEEP >>>> mbuf_packet: =A0 =A0 =A0 =A0 =A0 =A0256, =A0 =A0 =A00, =A0 =A04096, = =A0 =A03200, 5653833, =A0 0, =A0 0 >>>> mbuf: =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 256, =A0 =A0 =A00, =A0 =A0 = =A0 1, =A0 =A02048, 4766910, =A0 0, =A0 0 >>>> mbuf_cluster: =A0 =A0 =A0 =A0 =A02048, =A025600, =A0 =A07296, =A0 =A0 = =A0 6, =A0 =A07297, =A0 0, =A0 0 >>>> mbuf_jumbo_page: =A0 =A0 =A0 4096, =A012800, =A0 =A0 =A0 0, =A0 =A0 = =A0 0, =A0 =A0 =A0 0, =A0 0, =A0 0 >>>> mbuf_jumbo_9k: =A0 =A0 =A0 =A0 9216, =A0 6400, =A0 =A0 =A0 0, =A0 =A0 = =A0 0, =A0 =A0 =A0 0, =A0 0, =A0 0 >>>> mbuf_jumbo_16k: =A0 =A0 =A0 16384, =A0 3200, =A0 =A0 =A0 0, =A0 =A0 = =A0 0, =A0 =A0 =A0 0, =A0 0, =A0 0 >>>> mbuf_ext_refcnt: =A0 =A0 =A0 =A0 =A04, =A0 =A0 =A00, =A0 =A0 =A0 0, = =A0 =A0 =A0 0, =A0 =A0 =A0 0, =A0 0, =A0 0 >>>> tcpreass: =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A020, =A0 1690, =A0 =A0 =A0 0, = =A0 =A0 845, 1757074, =A0 0, =A0 0 >>>> >>>> In view of these observations, my question is: is it possible for the >>>> V_tcp_reass_qsize variable to be unsafely updated on SMP ? (The >>>> particular flavor of XLS that was used in the test had 4 cores with 4 >>>> h/w threads/core). I see that the tcp_reass function assumes some lock >>>> is taken but not sure if it is the per-socket or the global tcp lock. >>> >>> The updating of the global counter is indeed unsafe and becomes obsolet= e >>> with the autotuning patch. >>> >>> The patch is reviewed by me and ready for commit. =A0However lstewart@ = is >>> currently writing his thesis and has only very little spare time. =A0I'= ll >>> send you the patch in private email so you can continue your testing. >> >> Quick update on this: patch is blocked while waiting for Jeff to review >> some related UMA changes. As soon as I get the all clear I'll push >> everything into head. > > Revision 213913 of the svn head branch finally has all patches. If you > encounter any additional odd behaviour related to reassembly or notice > net.inet.tcp.reass.overflows increasing, please let me know. > Thanks for the fix. Tried it on XLR/XLS and the earlier tests pass now. net.inet.tcp.reass.overflows was always zero after the tests (and in the samples I took while the tests were running). One observation though: net.inet.tcp.reass.cursegments was non-zero (it was just 1) after 30 rounds, where each round is (as earlier) 15-concurrent instances of netperf for 20s. This was on the netserver side. And, it was zero before the netperf runs. On the other hand, Andre told me (in a separate mail) that this counter is not relevant anymore - so, should I just ignore it ? --- Sriram Gorti > Cheers, > Lawrence >