From owner-freebsd-net@FreeBSD.ORG  Fri Oct 22 23:59:15 2010
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3D1AB1065675;
	Fri, 22 Oct 2010 23:59:15 +0000 (UTC)
	(envelope-from lstewart@freebsd.org)
Received: from lauren.room52.net (lauren.room52.net [210.50.193.198])
	by mx1.freebsd.org (Postfix) with ESMTP id AF9828FC29;
	Fri, 22 Oct 2010 23:59:14 +0000 (UTC)
Received: from lawrence1.loshell.room52.net
	(ppp59-167-184-191.static.internode.on.net [59.167.184.191])
	by lauren.room52.net (Postfix) with ESMTPSA id 55FD27E87B;
	Sat, 23 Oct 2010 10:59:12 +1100 (EST)
Message-ID: <4CC2254C.7070104@freebsd.org>
Date: Sat, 23 Oct 2010 10:59:08 +1100
From: Lawrence Stewart <lstewart@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-AU;
	rv:1.9.2.9) Gecko/20101006 Lightning/1.0b2 Thunderbird/3.1.4
MIME-Version: 1.0
To: Sriram Gorti <gsriram@gmail.com>
References: <AANLkTikWWmrnBy_DGgSsDbh6NAzWGKCWiFPnCRkwoDRi@mail.gmail.com>	<4CA5D1F0.3000307@freebsd.org>	<4CA9B6AC.20403@freebsd.org>	<4CBB6CE9.1030009@freebsd.org>
	<AANLkTinvt4kCQNkf1ueDw0CFaYE9SELsBK8nR2yQKytZ@mail.gmail.com>
In-Reply-To: <AANLkTinvt4kCQNkf1ueDw0CFaYE9SELsBK8nR2yQKytZ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY
	autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lauren.room52.net
Cc: freebsd-net@freebsd.org, Andre Oppermann <andre@freebsd.org>
Subject: Re: Question on TCP reassembly counter
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Oct 2010 23:59:15 -0000

On 10/22/10 18:10, Sriram Gorti wrote:
> Hi,
> 
> On Mon, Oct 18, 2010 at 3:08 AM, Lawrence Stewart <lstewart@freebsd.org> wrote:
>> On 10/04/10 22:12, Lawrence Stewart wrote:
>>> On 10/01/10 22:20, Andre Oppermann wrote:
>>>> On 01.10.2010 12:01, Sriram Gorti wrote:
>>>>> Hi,
>>>>>
>>>>> In the following is an observation when testing our XLR/XLS network
>>>>> driver with 16 concurrent instances of netperf on FreeBSD-CURRENT.
>>>>> Based on this observation, I have a question on which I hope to get
>>>>> some understanding from here.
>>>>>
>>>>> When running 16 concurrent netperf instances (each for about 20
>>>>> seconds), it was found that after some number of runs performance
>>>>> degraded badly (almost by a factor of 5). All subsequent runs remained
>>>>> so. Started debugging this from TCP-side as other driver tests were
>>>>> doing fine for comparably long durations on same board+s/w.
>>>>>
>>>>> netstat indicated the following:
>>>>>
>>>>> $ netstat -s -f inet -p tcp | grep discarded
>>>>>                  0 discarded for bad checksums
>>>>>                  0 discarded for bad header offset fields
>>>>>                  0 discarded because packet too short
>>>>>                  7318 discarded due to memory problems
>>>>>
>>>>> Then, traced the "discarded due to memory problems" to the following
>>>>> counter:
>>>>>
>>>>> $ sysctl -a net.inet.tcp.reass
>>>>> net.inet.tcp.reass.overflows: 7318
>>>>> net.inet.tcp.reass.maxqlen: 48
>>>>> net.inet.tcp.reass.cursegments: 1594<--- // corresponds to
>>>>> V_tcp_reass_qsize variable
>>>>> net.inet.tcp.reass.maxsegments: 1600
>>>>>
>>>>> Our guess for the need for reassembly (in this low-packet-loss test
>>>>> setup) was the lack of per-flow classification in the driver, causing
>>>>> it to spew incoming packets across the 16 h/w cpus instead of packets
>>>>> of a flow being sent to the same cpu. While we are working on
>>>>> addressing this driver limitation, debugged further to see how/why the
>>>>> V_tcp_reass_qsize grew (assuming that out-of-order segments should
>>>>> have dropped to zero at the end of the run). It was seen that this
>>>>> counter was actually growing up from the initial runs but only when it
>>>>> reached near to maxsgements, perf degradation was seen. Then, started
>>>>> looking at vmstat also to see how many of the reassembly segments were
>>>>> lost. But, there were no segments lost. We could not reconcile "no
>>>>> lost segments" with "growth of this counter across test runs".
>>>>
>>>> A patch is in the works to properly autoscale the reassembly queue
>>>> and should be comitted shortly.
>>>>
>>>>> $ sysctl net.inet.tcp.reass ; vmstat -z | egrep "FREE|mbuf|tcpre"
>>>>> net.inet.tcp.reass.overflows: 0
>>>>> net.inet.tcp.reass.maxqlen: 48
>>>>> net.inet.tcp.reass.cursegments: 147
>>>>> net.inet.tcp.reass.maxsegments: 1600
>>>>> ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
>>>>> mbuf_packet:            256,      0,    4096,    3200, 5653833,   0,   0
>>>>> mbuf:                   256,      0,       1,    2048, 4766910,   0,   0
>>>>> mbuf_cluster:          2048,  25600,    7296,       6,    7297,   0,   0
>>>>> mbuf_jumbo_page:       4096,  12800,       0,       0,       0,   0,   0
>>>>> mbuf_jumbo_9k:         9216,   6400,       0,       0,       0,   0,   0
>>>>> mbuf_jumbo_16k:       16384,   3200,       0,       0,       0,   0,   0
>>>>> mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0
>>>>> tcpreass:                20,   1690,       0,     845, 1757074,   0,   0
>>>>>
>>>>> In view of these observations, my question is: is it possible for the
>>>>> V_tcp_reass_qsize variable to be unsafely updated on SMP ? (The
>>>>> particular flavor of XLS that was used in the test had 4 cores with 4
>>>>> h/w threads/core). I see that the tcp_reass function assumes some lock
>>>>> is taken but not sure if it is the per-socket or the global tcp lock.
>>>>
>>>> The updating of the global counter is indeed unsafe and becomes obsolete
>>>> with the autotuning patch.
>>>>
>>>> The patch is reviewed by me and ready for commit.  However lstewart@ is
>>>> currently writing his thesis and has only very little spare time.  I'll
>>>> send you the patch in private email so you can continue your testing.
>>>
>>> Quick update on this: patch is blocked while waiting for Jeff to review
>>> some related UMA changes. As soon as I get the all clear I'll push
>>> everything into head.
>>
>> Revision 213913 of the svn head branch finally has all patches. If you
>> encounter any additional odd behaviour related to reassembly or notice
>> net.inet.tcp.reass.overflows increasing, please let me know.
>>
> 
> Thanks for the fix. Tried it on XLR/XLS and the earlier tests pass
> now. net.inet.tcp.reass.overflows was always zero after the tests (and
> in the samples I took while the tests were running).

Great, thanks for testing.

> One observation though: net.inet.tcp.reass.cursegments was non-zero
> (it was just 1) after 30 rounds, where each round is (as earlier)
> 15-concurrent instances of netperf for 20s. This was on the netserver
> side. And, it was zero before the netperf runs. On the other hand,
> Andre told me (in a separate mail) that this counter is not relevant
> anymore - so, should I just ignore it ?

It's relevant, just not guaranteed to be 100% accurate at any given
point in time. The value is calculated based on synchronised access to
UMA zone stats and unsynchronised access to UMA per-cpu zone stats. The
latter is safe, but causes the overall result to potentially be
inaccurate due to use of stale data. The accuracy vs overhead tradeoff
was deemed worthwhile for informational counters like this one.

That being said, I would not expect the value to remain persistently at
1 after all TCP activity has finished on the machine. It won't affect
performance, but I'm curious to know if the calculation method has a
flaw. I'll try to reproduce locally, but can you please confirm if the
value stays at 1 even after many minutes of no TCP activity?

Cheers,
Lawrence