From owner-freebsd-net@FreeBSD.ORG  Mon Oct  4 11:28:30 2010
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 16DE71065670
	for <freebsd-net@freebsd.org>; Mon,  4 Oct 2010 11:28:30 +0000 (UTC)
	(envelope-from lstewart@freebsd.org)
Received: from lauren.room52.net (lauren.room52.net [210.50.193.198])
	by mx1.freebsd.org (Postfix) with ESMTP id A17198FC12
	for <freebsd-net@freebsd.org>; Mon,  4 Oct 2010 11:28:29 +0000 (UTC)
Received: from lawrence1.loshell.room52.net
	(ppp59-167-184-191.static.internode.on.net [59.167.184.191])
	by lauren.room52.net (Postfix) with ESMTPSA id 89E387E87B;
	Mon,  4 Oct 2010 22:12:54 +1100 (EST)
Message-ID: <4CA9B6AC.20403@freebsd.org>
Date: Mon, 04 Oct 2010 22:12:44 +1100
From: Lawrence Stewart <lstewart@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU;
	rv:1.9.2.9) Gecko/20100913 Lightning/1.0b2 Thunderbird/3.1.3
MIME-Version: 1.0
To: Andre Oppermann <andre@freebsd.org>
References: <AANLkTikWWmrnBy_DGgSsDbh6NAzWGKCWiFPnCRkwoDRi@mail.gmail.com>
	<4CA5D1F0.3000307@freebsd.org>
In-Reply-To: <4CA5D1F0.3000307@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY
	autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lauren.room52.net
Cc: freebsd-net@freebsd.org, Sriram Gorti <gsriram@gmail.com>
Subject: Re: Question on TCP reassembly counter
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 04 Oct 2010 11:28:30 -0000

On 10/01/10 22:20, Andre Oppermann wrote:
> On 01.10.2010 12:01, Sriram Gorti wrote:
>> Hi,
>>
>> In the following is an observation when testing our XLR/XLS network
>> driver with 16 concurrent instances of netperf on FreeBSD-CURRENT.
>> Based on this observation, I have a question on which I hope to get
>> some understanding from here.
>>
>> When running 16 concurrent netperf instances (each for about 20
>> seconds), it was found that after some number of runs performance
>> degraded badly (almost by a factor of 5). All subsequent runs remained
>> so. Started debugging this from TCP-side as other driver tests were
>> doing fine for comparably long durations on same board+s/w.
>>
>> netstat indicated the following:
>>
>> $ netstat -s -f inet -p tcp | grep discarded
>>                  0 discarded for bad checksums
>>                  0 discarded for bad header offset fields
>>                  0 discarded because packet too short
>>                  7318 discarded due to memory problems
>>
>> Then, traced the "discarded due to memory problems" to the following
>> counter:
>>
>> $ sysctl -a net.inet.tcp.reass
>> net.inet.tcp.reass.overflows: 7318
>> net.inet.tcp.reass.maxqlen: 48
>> net.inet.tcp.reass.cursegments: 1594<--- // corresponds to
>> V_tcp_reass_qsize variable
>> net.inet.tcp.reass.maxsegments: 1600
>>
>> Our guess for the need for reassembly (in this low-packet-loss test
>> setup) was the lack of per-flow classification in the driver, causing
>> it to spew incoming packets across the 16 h/w cpus instead of packets
>> of a flow being sent to the same cpu. While we are working on
>> addressing this driver limitation, debugged further to see how/why the
>> V_tcp_reass_qsize grew (assuming that out-of-order segments should
>> have dropped to zero at the end of the run). It was seen that this
>> counter was actually growing up from the initial runs but only when it
>> reached near to maxsgements, perf degradation was seen. Then, started
>> looking at vmstat also to see how many of the reassembly segments were
>> lost. But, there were no segments lost. We could not reconcile "no
>> lost segments" with "growth of this counter across test runs".
> 
> A patch is in the works to properly autoscale the reassembly queue
> and should be comitted shortly.
> 
>> $ sysctl net.inet.tcp.reass ; vmstat -z | egrep "FREE|mbuf|tcpre"
>> net.inet.tcp.reass.overflows: 0
>> net.inet.tcp.reass.maxqlen: 48
>> net.inet.tcp.reass.cursegments: 147
>> net.inet.tcp.reass.maxsegments: 1600
>> ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
>> mbuf_packet:            256,      0,    4096,    3200, 5653833,   0,   0
>> mbuf:                   256,      0,       1,    2048, 4766910,   0,   0
>> mbuf_cluster:          2048,  25600,    7296,       6,    7297,   0,   0
>> mbuf_jumbo_page:       4096,  12800,       0,       0,       0,   0,   0
>> mbuf_jumbo_9k:         9216,   6400,       0,       0,       0,   0,   0
>> mbuf_jumbo_16k:       16384,   3200,       0,       0,       0,   0,   0
>> mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0
>> tcpreass:                20,   1690,       0,     845, 1757074,   0,   0
>>
>> In view of these observations, my question is: is it possible for the
>> V_tcp_reass_qsize variable to be unsafely updated on SMP ? (The
>> particular flavor of XLS that was used in the test had 4 cores with 4
>> h/w threads/core). I see that the tcp_reass function assumes some lock
>> is taken but not sure if it is the per-socket or the global tcp lock.
> 
> The updating of the global counter is indeed unsafe and becomes obsolete
> with the autotuning patch.
> 
> The patch is reviewed by me and ready for commit.  However lstewart@ is
> currently writing his thesis and has only very little spare time.  I'll
> send you the patch in private email so you can continue your testing.

Quick update on this: patch is blocked while waiting for Jeff to review
some related UMA changes. As soon as I get the all clear I'll push
everything into head.

Cheers,
Lawrence