From owner-freebsd-net@FreeBSD.ORG  Wed Apr 17 11:54:51 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id AC3F03BA;
 Wed, 17 Apr 2013 11:54:51 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail28.syd.optusnet.com.au (mail28.syd.optusnet.com.au
 [211.29.133.169])
 by mx1.freebsd.org (Postfix) with ESMTP id 2FDCF233;
 Wed, 17 Apr 2013 11:54:50 +0000 (UTC)
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail28.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r3HBndcc032596
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Wed, 17 Apr 2013 21:49:40 +1000
Date: Wed, 17 Apr 2013 21:49:39 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Sepherosa Ziehau <sepherosa@gmail.com>
Subject: Re: bge(4) sysctl tuneables -- a blast from the past.
In-Reply-To: <CAMOc5cyrgh1szyrEKTzAEQ4-egtD4XmYxp+13SyZizVXiEhfEA@mail.gmail.com>
Message-ID: <20130417203212.K1099@besplex.bde.org>
References: <1365781568.1418.1.camel@localhost>
 <20130413200512.G1165@besplex.bde.org>
 <1366065356.1350.7.camel@localhost> <20130416152121.G904@besplex.bde.org>
 <CAMOc5cyrgh1szyrEKTzAEQ4-egtD4XmYxp+13SyZizVXiEhfEA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=S7iBW/QP c=1 sm=1 a=vYrNp6gXSs8A:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=_Js9cBt6JEEA:10
 a=OYM9qD30a7WLajOxBi8A:9 a=CjuIK1q_8ugA:10 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: "pyunyh@gmail.com" <pyunyh@gmail.com>,
 David Christensen <davidch@broadcom.com>,
 "freebsd-net@freebsd.org" <freebsd-net@FreeBSD.org>, bde <bde@FreeBSD.org>,
 Bruce Evans <brde@optusnet.com.au>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Apr 2013 11:54:51 -0000

On Tue, 16 Apr 2013, Sepherosa Ziehau wrote:

> On Tue, Apr 16, 2013 at 1:56 PM, Bruce Evans <brde@optusnet.com.au> wrote:
>>
>> Technical bugs include:
>> - wrong defaults are claimed for *coal_ticks.  The defaults are 150, but
>>   are claimed to be 150 milliseconds.  These values are dimensionless,
>>   but since ticks take 1 microsecond each, 150 gives 150 microseconds,
>>   not 150 milliseconds.
>
> The real effect of TX coalesce ticks is confusing to me; TX interrupt
> does not come at the rate you have specified, at least for several
> PCI-e bge(4) I have tested. However, RX coalesce ticks work as
> expected.

It works for me on a 5701 (PCI-X) on a PCI-33 bus.

Perhaps you are just seeing rx interrupts mixed with tx interrupts.  At
least the FreeBSD driver doesn't determine the interrupt type, so it
always processes tx activity when it gets an rx interrupt.

I had to do the following to avoid getting rx interrupts (without this
the interrupt rate increased by a factor of 3-4 with tx_coal_ticks = 150,
from ~6.7 kHz to 19-24 kHz):
- I use ttcp for testing, so on the receiving system use ttcp -u -r so
   that it doesn't echo anything (otherwise it would "echo" with icmp
   port-unreachable unless firewalled).
- Use an old receiving system that doesn't support flow control.  The
   system can't keep up, and drops about half of the packets, so if it
   did flow control then there would be a lot of rx interrupts.

> Here is how the tests were conducted:
> - Send only test, no RX
> - Each packet consume only one BD; UDP datagram, using hardware
> checksum offloading
> - TX coalesce BDs is set to 0, so only TX coalesce ticks have effect
>
> The interrupt rate I had got seemed to be related to packet size?!  I
> had tested two TX coalesce ticks settings:
> (the result I had recorded was using BCM5720)

This might be due to larger packet size causing less rx activity.

> The first setting was 1023us; the first col is UDP data size, the
> second col is rough interrupt rate
> 18B    667/s
> 64B    611/s

Oops, this doesn't look like rx activity.  We expect a rate of 977 Hz,
possibly increased significantly by tx activity.

I get 996-1004 here (1023us is actually 1000?).

> 128B    538/s
> 256B    432/s
> 512B    311/s
> 1024B    194/s
> 1472B    146/s

I get 996-1004 for all of these.

Now I remember another problem that I work around using huge ifqueues (10k
or 20k entries) and/or busy-waiting in the send() in ttcp.  It is too easy
for the tx to stop because there is nothing on the ifqueue to refill it.
Then it won't restart until the application starts sending again.  It
is normal for all the queues to fill up.  Then send returns ENOBUFS and
there is no good way for the application to handle this, since select()
on the queues not being full is broken (never supported).  Bad ways include:
- sleep for a while in the application.  It is hard to know when to wake up,
   and impossible to wake up soon enough if timeout granularity is large.
- use huge ifqueues, so that long delays in the application work
- spin trying send().

> Tecond setting was 128us; the first col is UDP data size, the second
> col is rough interrupt rate
> 18B    1647/s
> 64B    1338/s

Now you should be getting much higher interrupt rates, unless something
can't keep up.  I get 7904-7967 and 7906-7971.

> 128B    1030/s
> 256B    700/s
> 512B    430/s
> 1024B    235/s
> 1472B    169/s

I get little dependency on the packet size.  At 1472B, the packet rate is
~58900.  Eveything on the tx side can keep up with that though not much
more, so no drop is expected.

> Well, to be frank, it does not make too much sense to me.

I found timestamps and counters for bge_*xeof() good for understanding
the flow of control.  It is easy to generate too much data, so I keep
the tx and rx statistics separate and try to understand tx and rx activity
separately.  Some for tx with tx_coal_ticks = 1023 and packet size 18:

@  976 1366197879.094951 454  25 349 1366197879.094976 105
@  971 1366197879.095947 455  26 351 1366197879.095973 104
@  972 1366197879.096945 455  25 355 1366197879.096970 100
@  975 1366197879.097945 451  24 351 1366197879.097969 100
@  974 1366197879.098943 443  24 337 1366197879.098967 106

The large numbers are absolute timestamps for bge_txeof() entry and exit.

The entries are separated by almost exactly 1000 us (not 1023 us as expected).

The first numeric column gives the time in us between the previous exit and
this entry.  Not very relevant here.

The fourth numeric column gives the time in us between this entry and exit.
Not very relevant here.

The third and final numeric columns give the ring indexes on entry and
exit, and the 5th numeric column gives the difference of these.  These
are relevant here.  Ideally the ring would be almost but not quite
full whenever we start, and the difference would be almost 512, but
ttcp apparently can't generate data fast enough to keep it full, so
it has an average of 350+ entries and the packet rate is 350+kpps.  We
don't want the ring to be completely full when we start, since that
means that we are not interrupting enough to keep up with the generator
and probably also with the hardware.  This system can do 640+kpps when
ideally configured, using tx_coal_ticks = 1000000 and tx_coal_bds =
384.  With tx_coal_ticks = 1023 (1000) and tx_coal_bds = 0. it couldn't
do more than 512kpps.  Its current non-ideal configuration includes
firwalling, sharing the bge interrupt with rl, and not overclocking.
In this configuration, the above 2 tx_coal_* settings are almost equally
good (tx_coal_ticks = 1023 reduces latency for reaping descriptors,
but latency doesn't matter; tx_coal_ticks = 100000 reduces interrupts
when not under load, but when not under load interrupt overhead isn't
a problem).

Bruce