From owner-freebsd-net@FreeBSD.ORG Thu Apr 17 09:50:12 2008 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 11774106566C; Thu, 17 Apr 2008 09:50:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id C44E18FC25; Thu, 17 Apr 2008 09:50:10 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m3H2rbAh029099; Thu, 17 Apr 2008 12:53:37 +1000 Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m3H2rOjs030342 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 17 Apr 2008 12:53:28 +1000 Date: Thu, 17 Apr 2008 12:53:24 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Alexander Sack In-Reply-To: <3c0b01820804161402u3aac4425n41172294ad33a667@mail.gmail.com> Message-ID: <20080417112329.G47027@delplex.bde.org> References: <3c0b01820804160929i76cc04fdy975929e2a04c0368@mail.gmail.com> <200804161456.20823.jkim@FreeBSD.org> <3c0b01820804161328m77704ca0g43077a9718d446d4@mail.gmail.com> <200804161654.22452.jkim@FreeBSD.org> <3c0b01820804161402u3aac4425n41172294ad33a667@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-net@FreeBSD.org, Dieter , Jung-uk Kim Subject: Re: bge dropping packets issue X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Apr 2008 09:50:12 -0000 On Wed, 16 Apr 2008, Alexander Sack wrote: > On Wed, Apr 16, 2008 at 4:54 PM, Jung-uk Kim wrote: >> On Wednesday 16 April 2008 04:28 pm, Alexander Sack wrote: >> > On Wed, Apr 16, 2008 at 2:56 PM, Jung-uk Kim >> wrote: >> >> [CC trimmed] >> >> >> >> On Wednesday 16 April 2008 02:20 pm, Alexander Sack wrote: >> >> > Dieter: Thanks, at 20Mbps! That's pretty aweful. >> >> > >> >> > JK: Thanks again. Wow, I searched the list and didn't see >> >> > much discussion with respect to bge and packet loss! I will >> >> > try the rest of that patch including pushing the TCP receive >> >> > buffer up (though I don't think that's going to help in this >> >> > case). The above is based on just looking at code.... First stop using the DEVICE_POLLING packet lossage service... >> >> > I guess some follow-up questions would be: >> >> > >> >> > 1) Why isn't BGE_SSLOTS tunable (to a point)? Why can't that >> >> > be added the driver? I noticed that CURRENT has added a lot >> >> > more SYSCTL information. Moreover it seems the Linux driver >> >> > can set it up to 1024. >> >> >> >> IIRC, Linux tg3 uses one ring for both standard and jumbo. >> > >> > I'm talking about the number of slots within the ring not the >> > number of RX queues. >> > >> > I believe the bnx4 driver (thought the tg stuff was deprecated??) >> > uses 4 rings (one for each port perhaps) and reads hardware >> > register at ISR time to flip between them. >> >> I guess you are reading wrong source, i.e., bnx4(?) is NetXtreme II >> driver, which totally different family. We support them with bce(4). >> tg3 is still official Linux driver. > > You are correct, I got the names confused (this problem really stinks)! > > However, my point still stands: > > #define TG3_RX_RCB_RING_SIZE(tp) ((tp->tg3_flags2 & > TG3_FLG2_5705_PLUS) ? 512 : 1024) > > Even the Linux driver uses higher number of RX descriptors than > FreeBSD's static 256. I think minimally making this tunable is a fair > approach. > > If not, no biggie, but I think its worth it. I use a fixed value of 512 (jkim gave a pointer to old mail containing a fairly up to date version of my patches for this and more important things). This should only make a difference with DEVICE_POLLING. Without DEVICE_POLLING, device interrupts normally occur every 150 usec or even more frequently (too frequently if the average is much lower than 150 usec), so 512 descriptors is more than enough for 1Gbps ethernet (the minimum possible inter-descriptor time for tiny packets is about 0.6 usec, so the minimum number of descriptors is 150 / 0.6 = 250. The minimum possible inter-descriptor time is rarely achieved, so 250 is enough even with some additional latency. See below for more numbers). With DEVICE_POLLING at HZ = 1000, the corresponding minimum number of descriptors is 1000 / 0.6 = 1667. No tuning can give this number. You can increase HZ to 4000 and then the fixed value of 512 is large enough. However, HZ = 4000 is wasteful, and might not actually deliver 4000 polls per second -- missed polls are more likely at higher frequencies. For timeouts instead of device polls, at least on old systems it was quite common for timeouts at a frequency of HZ not actually being delivered, even when HZ was only 100, because some timeouts run for too long (several msec each, possibly combining to > 10 msec occasionally). Device polls are at a lower level, so they have a better chance of actually keeping up with HZ. Now the main source of timeouts that run for too long is probably mii tick routines. These won't combine, at least for MPSAFE drivers, but they will block both interrupts and device polls for their own device. So the rx ring size needs to be large enough to cover max(150 usec or whatever interrupt moderation time, mii tick time) of latency plus any other latencies due to interrupt handling or polling of for other devices. Latencies due to interrupts on other devices is only certain to be significant if the other devices have higher or the same priority. I don't understand how 1024 can be useful for 5705_PLUS. PLUS really means MINUS -- at pleast my plain 5705 device has about 1/4 of the performance of my older 5701 device. The data sheet gives a limit of 512 normal rx descriptors for both. Non-MINUS devices also support jumbo buffers. These are in a separate ring and not affected by the limit of 256 in FreeBSD. Some numbers for [1 Gbps] ethernet: minimum frame size = 64 bytes = 512 bits minimum inter-frame gap = 96 bits minimum total frame time = 608 nsec (may be off by 64) bge descriptors per tiny frame = 1 (1 for mbuf) buffering provided by 256 descriptors = 256 * 608 = 155.648 usec (marginal) normal frame size = 1518 bytes = 12144 bits normal total frame time = 12240 nsec bge descriptors per normal frame = 2 (1 for mbuf and 1 for mbuf cluster) buffering provided by 256 descriptors = 256/2 * 12240 = 1556.720 usec (plenty) The only tuning that I've found useful since writing my old mail (with the patches) is: sysctl kern.random.sys.harvest.ethernet=0 sysctl kern.random.sys.harvest.interrupt=0 sysctl net.inet.ip.intr_queue_maxlen=1024 Killing the entropy harvester isn't very important, and IIRC the default intr_queue_maxlen of 50 is only a problem with net.isr.direct=0 (not the default in -current). With an rx ring size of just 256 and small packets so that there is only 1 descriptor per packet, a single bge interrupt or poll can produces 5 times as many packets as will fit in the default intr_queue. IIRC, net.isr.direct=1 has the side affect of completely bypassing the intr_queue_maxlen limit, so floods cause extra damage before they are detected. However, the old limit of 50 is nonsense for all not so old NICs. Another problem with the intr_queue_maxlen limit is that packets dropped because of it don't show up in any networking statistics utility. They only show up in sysctl output and cannot be associated with individual interfaces. Bruce