From owner-freebsd-net@FreeBSD.ORG  Thu Apr 17 09:50:12 2008
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 11774106566C;
	Thu, 17 Apr 2008 09:50:12 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx07.syd.optusnet.com.au
	(fallbackmx07.syd.optusnet.com.au [211.29.132.9])
	by mx1.freebsd.org (Postfix) with ESMTP id C44E18FC25;
	Thu, 17 Apr 2008 09:50:10 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au
	[211.29.132.191])
	by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	m3H2rbAh029099; Thu, 17 Apr 2008 12:53:37 +1000
Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au
	(c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11])
	by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	m3H2rOjs030342
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 17 Apr 2008 12:53:28 +1000
Date: Thu, 17 Apr 2008 12:53:24 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Alexander Sack <pisymbol@gmail.com>
In-Reply-To: <3c0b01820804161402u3aac4425n41172294ad33a667@mail.gmail.com>
Message-ID: <20080417112329.G47027@delplex.bde.org>
References: <3c0b01820804160929i76cc04fdy975929e2a04c0368@mail.gmail.com>
	<200804161456.20823.jkim@FreeBSD.org>
	<3c0b01820804161328m77704ca0g43077a9718d446d4@mail.gmail.com>
	<200804161654.22452.jkim@FreeBSD.org>
	<3c0b01820804161402u3aac4425n41172294ad33a667@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-net@FreeBSD.org, Dieter <freebsd@sopwith.solgatos.com>,
	Jung-uk Kim <jkim@FreeBSD.org>
Subject: Re: bge dropping packets issue
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 17 Apr 2008 09:50:12 -0000

On Wed, 16 Apr 2008, Alexander Sack wrote:

> On Wed, Apr 16, 2008 at 4:54 PM, Jung-uk Kim <jkim@freebsd.org> wrote:
>> On Wednesday 16 April 2008 04:28 pm, Alexander Sack wrote:
>> > On Wed, Apr 16, 2008 at 2:56 PM, Jung-uk Kim <jkim@freebsd.org>
>>  wrote:
>> >> [CC trimmed]
>> >>
>> >>  On Wednesday 16 April 2008 02:20 pm, Alexander Sack wrote:
>> >> > Dieter: Thanks, at 20Mbps!  That's pretty aweful.
>> >> >
>> >> > JK:  Thanks again.  Wow, I searched the list and didn't see
>> >> > much discussion with respect to bge and packet loss!  I will
>> >> > try the rest of that patch including pushing the TCP receive
>> >> > buffer up (though I don't think that's going to help in this
>> >> > case).  The above is based on just looking at code....

First stop using the DEVICE_POLLING packet lossage service...

>> >> > I guess some follow-up questions would be:
>> >> >
>> >> > 1)  Why isn't BGE_SSLOTS tunable (to a point)?  Why can't that
>> >> > be added the driver?  I noticed that CURRENT has added a lot
>> >> > more SYSCTL information.  Moreover it seems the Linux driver
>> >> > can set it up to 1024.
>> >>
>> >>  IIRC, Linux tg3 uses one ring for both standard and jumbo.
>> >
>> > I'm talking about the number of slots within the ring not the
>> > number of RX queues.
>> >
>> > I believe the bnx4 driver (thought the tg stuff was deprecated??)
>> > uses 4 rings (one for each port perhaps) and reads hardware
>> > register at ISR time to flip between them.
>>
>>  I guess you are reading wrong source, i.e., bnx4(?) is NetXtreme II
>>  driver, which totally different family.  We support them with bce(4).
>>  tg3 is still official Linux driver.
>
> You are correct, I got the names confused (this problem really stinks)!
>
> However, my point still stands:
>
> #define TG3_RX_RCB_RING_SIZE(tp) ((tp->tg3_flags2 &
> TG3_FLG2_5705_PLUS) ?  512 : 1024)
>
> Even the Linux driver uses higher number of RX descriptors than
> FreeBSD's static 256.  I think minimally making this tunable is a fair
> approach.
>
> If not, no biggie, but I think its worth it.

I use a fixed value of 512 (jkim gave a pointer to old mail containing
a fairly up to date version of my patches for this and more important
things).  This should only make a difference with DEVICE_POLLING.
Without DEVICE_POLLING, device interrupts normally occur every 150 usec
or even more frequently (too frequently if the average is much lower
than 150 usec), so 512 descriptors is more than enough for 1Gbps ethernet
(the minimum possible inter-descriptor time for tiny packets is about 0.6
usec, so the minimum number of descriptors is 150 / 0.6 = 250.  The
minimum possible inter-descriptor time is rarely achieved, so 250 is
enough even with some additional latency.  See below for more numbers).
With DEVICE_POLLING at HZ = 1000, the corresponding minimum number of
descriptors is 1000 / 0.6 = 1667.  No tuning can give this number.  You
can increase HZ to 4000 and then the fixed value of 512 is large enough.
However, HZ = 4000 is wasteful, and might not actually deliver 4000 polls
per second -- missed polls are more likely at higher frequencies.

For timeouts instead of device polls, at least on old systems it was
quite common for timeouts at a frequency of HZ not actually being
delivered, even when HZ was only 100, because some timeouts run for
too long (several msec each, possibly combining to > 10 msec occasionally).
Device polls are at a lower level, so they have a better chance of
actually keeping up with HZ.  Now the main source of timeouts that run
for too long is probably mii tick routines.  These won't combine, at
least for MPSAFE drivers, but they will block both interrupts and
device polls for their own device.  So the rx ring size needs to be
large enough to cover max(150 usec or whatever interrupt moderation time,
mii tick time) of latency plus any other latencies due to interrupt handling
or polling of for other devices.  Latencies due to interrupts on other
devices is only certain to be significant if the other devices have higher
or the same priority.

I don't understand how 1024 can be useful for 5705_PLUS.  PLUS really
means MINUS -- at pleast my plain 5705 device has about 1/4 of the
performance of my older 5701 device.  The data sheet gives a limit of
512 normal rx descriptors for both.  Non-MINUS devices also support
jumbo buffers.  These are in a separate ring and not affected by the
limit of 256 in FreeBSD.

Some numbers for [1 Gbps] ethernet:

minimum frame size = 64 bytes =    512 bits
minimum inter-frame gap =           96 bits
minimum total frame time =         608 nsec (may be off by 64)
bge descriptors per tiny frame   = 1 (1 for mbuf)
buffering provided by 256 descriptors = 256 * 608 = 155.648 usec (marginal)

normal frame size = 1518 bytes = 12144 bits
normal total frame time =        12240 nsec
bge descriptors per normal frame = 2 (1 for mbuf and 1 for mbuf cluster)
buffering provided by 256 descriptors = 256/2 * 12240 = 1556.720 usec (plenty)

The only tuning that I've found useful since writing my old mail (with
the patches) is:

     sysctl kern.random.sys.harvest.ethernet=0
     sysctl kern.random.sys.harvest.interrupt=0
     sysctl net.inet.ip.intr_queue_maxlen=1024

Killing the entropy harvester isn't very important, and IIRC the default
intr_queue_maxlen of 50 is only a problem with net.isr.direct=0 (not the
default in -current).  With an rx ring size of just 256 and small packets
so that there is only 1 descriptor per packet, a single bge interrupt or
poll can produces 5 times as many packets as will fit in the default
intr_queue.  IIRC, net.isr.direct=1 has the side affect of completely
bypassing the intr_queue_maxlen limit, so floods cause extra damage before
they are detected.  However, the old limit of 50 is nonsense for all
not so old NICs.  Another problem with the intr_queue_maxlen limit is that
packets dropped because of it don't show up in any networking statistics
utility.  They only show up in sysctl output and cannot be associated with
individual interfaces.

Bruce