Date: Fri, 18 Apr 2008 11:13:39 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Alexander Sack <pisymbol@gmail.com> Cc: freebsd-net@freebsd.org, Dieter <freebsd@sopwith.solgatos.com>, Jung-uk Kim <jkim@freebsd.org> Subject: Re: bge dropping packets issue Message-ID: <20080418093328.B50187@delplex.bde.org> In-Reply-To: <3c0b01820804170643w6b771ce9jdfc2dc5b240922b@mail.gmail.com> References: <3c0b01820804160929i76cc04fdy975929e2a04c0368@mail.gmail.com> <200804161456.20823.jkim@FreeBSD.org> <3c0b01820804161328m77704ca0g43077a9718d446d4@mail.gmail.com> <200804161654.22452.jkim@FreeBSD.org> <3c0b01820804161402u3aac4425n41172294ad33a667@mail.gmail.com> <20080417112329.G47027@delplex.bde.org> <3c0b01820804170643w6b771ce9jdfc2dc5b240922b@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 17 Apr 2008, Alexander Sack wrote: > On Wed, Apr 16, 2008 at 10:53 PM, Bruce Evans <brde@optusnet.com.au> wrote: >> On Wed, 16 Apr 2008, Alexander Sack wrote: >[DEVICE_POLLING] > But why was it added to begin with if standard interrupt driven I/O is > faster? (was it the fact that historically hardware didn't do > interrupt coalescing initially) See Robert's reply. >>> However, my point still stands: >>> >>> #define TG3_RX_RCB_RING_SIZE(tp) ((tp->tg3_flags2 & >>> TG3_FLG2_5705_PLUS) ? 512 : 1024) >>> >>> Even the Linux driver uses higher number of RX descriptors than >>> FreeBSD's static 256. I think minimally making this tunable is a fair >>> approach. >>> >>> If not, no biggie, but I think its worth it. >> >> I use a fixed value of 512 (jkim gave a pointer to old mail containing >> a fairly up to date version of my patches for this and more important >> things). This should only make a difference with DEVICE_POLLING. > > Then minimally 512 could be used if DEVICE_POLLING is enabled and my > point still stands. Though in light of the other statistics you cited > I understand now that this may not make that big of an impact. em uses only 256 too (I misread it as using 2048). Someone reported that increasing this to 4096 reduced packet loss with polling. >> Without DEVICE_POLLING, device interrupts normally occur every 150 usec > > Is that the coal ticks value you are referring too? Sorry this is my > first time looking at this driver! Yes, the driver normally configures coal ticks as 150. This is a good default. >> or even more frequently (too frequently if the average is much lower >> than 150 usec), so 512 descriptors is more than enough for 1Gbps ethernet >> (the minimum possible inter-descriptor time for tiny packets is about 0.6 >> usec, > > How do you measure this number? 0.6 usec is the theoretical minimum. I actually measure a minumum of about 1.5 usec for my hardware (5701 PCI/X on plain PCI) by making timestamps in bge_rxeof() and bge_txeof(). (1.5 usec is the average for a ring full of descriptors.) > I'm assuming when you say "inter-descriptor time" you mean the time it > takes the card to fill a RX descriptor on receipt of a packet (really > the firmware latency?). No, it is part of the Ethernet spec (96 bit times for all speeds of Ethernet IIRC, so it is much shorter than it was for original Ethernet). At least my hardware takes significantly longer than this (1.5 - 0.6 usec = 900 nsec!). It is unclear where the extra time is spent, but presumably the hardware implements the Ethernet spec and is limited mainly by the bus speed (if the bus is plain PCI, otherwise DMA speed might be the limit), so if packets arrived every 0.6 usec then it would buffer many of them in fast device memory and then be forced to drop 9 in every 15 on average. >> For timeouts instead of device polls, at least on old systems it was >> quite common for timeouts at a frequency of HZ not actually being >> delivered, even when HZ was only 100, because some timeouts run for >> too long (several msec each, possibly combining to > 10 msec occasionally). >> Device polls are at a lower level, so they have a better chance of >> actually keeping up with HZ. Now the main source of timeouts that run >> for too long is probably mii tick routines. These won't combine, at >> least for MPSAFE drivers, but they will block both interrupts and >> device polls for their own device. So the rx ring size needs to be >> large enough to cover max(150 usec or whatever interrupt moderation time, >> mii tick time) of latency plus any other latencies due to interrupt >> handling >> or polling of for other devices. Latencies due to interrupts on other >> devices is only certain to be significant if the other devices have higher >> or the same priority. > > You described what I'm seeing. Couple this with the fact that the > driver uses one mtx for everything doesn't help. I'm pretty sure I'm > running into RX descriptor starvation despite the fact that > statistically speaking, 256 descriptors is enough for 1Gbps (I'm > talking 100MBps the firmware is dropping packets). The problem gets > worse if I add some kind of I/O workload on the system (my load > happens to be a gzip of a large log file in /tmp). I haven't found the mii tick latency to be a problem in practice, though I once suspected it. Oh, I just remembered that this requires working PREEMPTION so that lower-priority interrupt handlers like ata and sc get preempted. PREEMPTION wasn't the default and didn't work very well until relatively recently. But I think it works in 7.0. > I noticed that if I put ANY kind of debugging messages in bge_tick() > the drop gets much worse (for example just printing out the number of > dropped packets read from bge_stats_update() when a drop occurs causes > EVEN more drops to incur and if I had to guess its the printf just > uses up more cycles which delays the drain of RX chain and causes a > longer time to recover - this is a constant stream from a traffic > generator). Delays while holding the lock will cause problems of course. Hmm, bge_tick() is a callout, so it may itself be delayed or preempted. Delaying it shouldn't matter, and latency from preempting it is supposed to be handled by priority propagation: callout ithread runs calls bge_tick() acquires device mutex ... preempted by unrelated ithread ... preempted by bge ithread tries to acquire device mutex; blocks bge ithread priority is propagated to callout ithread preempted by callout ithread ... // now it is high priority; should be more careful not to take long releases device mutex; loses its propagated priority preempted by bge ithread acquires device mutex ... >> Some numbers for [1 Gbps] ethernet: >> >> minimum frame size = 64 bytes = 512 bits >> minimum inter-frame gap = 96 bits >> minimum total frame time = 608 nsec (may be off by 64) >> bge descriptors per tiny frame = 1 (1 for mbuf) >> buffering provided by 256 descriptors = 256 * 608 = 155.648 usec (marginal) > > So as I read this, its takes 155 usec to fill up the entre RX chain of > rx_bd's if its just small packets, correct? At least that long, depending on bus and DMA speeds. >> normal frame size = 1518 bytes = 12144 bits >> normal total frame time = 12240 nsec >> bge descriptors per normal frame = 2 (1 for mbuf and 1 for mbuf cluster) >> buffering provided by 256 descriptors = 256/2 * 12240 = 1556.720 usec >> (plenty) > > Is this based again on your own instrumentation based on the last > patch? (just curious, I believe you, I just wanted to know if this > was an artifact of you doing some tuning research or something else) This is a theoretical minimum too, but in practice even a PCI bus can almost keep up with 1Gbps ethernet in 1 direction, so I've measured average packet rates of > 81 kpps for normal frames (81 kpps = 12345 nsec per packet). Timestamps made in bge_rxeof() at a rate of only 62.7 kpps (since my em card can't go faster than this) look like this: %%% 97 1208479322.632804 13 0 7 1208479322.632804 6 104 1208479322.632908 11 0 6 1208479322.632908 5 105 1208479322.633013 9 1 5 1208479322.633014 4 64 1208479322.633078 10 0 4 1208479322.633078 6 95 1208479322.633173 11 1 6 1208479322.633174 5 %%% Here the columns give: 1st: time in usec between bge_rxeof() calls 4th: time in usec taken by this cal 5th: number of descriptors processed by this call other: raw timestamps and ring indexes The inter-rxeof time is ~100 usec since rx_coal_ticks is configured to 100. Thus there are only a few packets per interrupt at the "low" rate of 62.7 kpps. There are no latency problems in sight in this truncated output. This output is inconsistent with what I said above -- there is no sign of the factor of 2 for the mbuf+cluster split. I now think that that split only affects output. > So the million dollar question: Do you believe that if I disable > DEVICE_POLLING and use interrupt driven I/O, I could achieve zero > packet loss over a 1Gbps link? This is the main issue I need to solve > (solve means either no its not really achievable without a heavy > rewrite of the driver OR yes it is with some tuning). If the answer > is yes, then I have to understand the impact on the system in general. > I just want to be sure I'm on a viable path through the BGE maze! I think you can get close enough if the bus and memory and CPU(s) permit and you don't need to get too close to the theoretical limits. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080418093328.B50187>