From owner-freebsd-net@FreeBSD.ORG Mon Jun 19 08:04:34 2006 Return-Path: X-Original-To: freebsd-net@FreeBSD.org Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5095016A479; Mon, 19 Jun 2006 08:04:34 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id AC700440E8; Mon, 19 Jun 2006 08:04:33 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id 8C1CD329330; Mon, 19 Jun 2006 18:04:30 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k5J84QxG025511; Mon, 19 Jun 2006 18:04:27 +1000 Date: Mon, 19 Jun 2006 18:04:26 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: John-Mark Gurney In-Reply-To: <20060618194044.GC1142@funkthat.com> Message-ID: <20060619162819.F44832@delplex.bde.org> References: <20060615115738.J2512@fledge.watson.org> <20060618194044.GC1142@funkthat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-net@FreeBSD.org, Robert Watson , John Polstra Subject: Re: IF_HANDOFF vs. IFQ_HANDOFF X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Jun 2006 08:04:34 -0000 On Sun, 18 Jun 2006, John-Mark Gurney wrote: > John Polstra wrote this message on Thu, Jun 15, 2006 at 09:18 -0700: >> in the HW but have not yet completed. When the completion interrupt >> comes in, the driver is supposed to check the if_snd queue for more >> mbufs and process them. Only when the transmit side of the HW goes >> totally idle should IFF_OACTIVE be cleared again. Most of our drivers >> set the flag only when they run out of transmit descriptors (i.e., >> practically never), which is just plain wrong. > > But the problem is that for small packets, this can mean that there > will be a delay in handling the ring if we wait to process packets > once the tx ring is empty.. if we ever want to max out gige w/ 64byte > packets, we have to clear OACTIVE whenever tx approches running out > of packets before we can send this.. In most cases we don't know how > long that is (since we don't keep track of packet sizes, etc), so it's > easiest/best to clear it whenever the tx ring is not full... To max out the link without unmaxing CPU for other uses, you do have to know when the tx approaches running out of packets. This is best done using watermark stuff. There should be a nearly-complete interrupt at low water, and (only after low water is reached and the interrupt handler doesn't refill the tx ring to be above low water again) a completion interrupt at actual completion. My version of the sk driver does this. It arrange for the nearly-complete interrupt at about 32 fragments (min 128 uS) before the tx runs dry, and no other tx interrupts unless the queue length stays below 32, while the -current driver gets an interrupt after every packet. It does this mainly to reduce the tx interrupt load from 1 per packet to (under load) 1 per 480 fragments. The correct handling of OACTIVE is obtained as a side effect almost automatically. It must be decided when to interrupt (sk hardware allows interrupting or not interrupting after every fragment), and it would be obviously wrong to interrupt only after the last fragment in the ring since the tx might run dry then (even if the tx interrupt occurs when the last fragment is removed by the hardware from the ring but before it is sent, it only takes a few uS to send it so the tx would often run dry due to software latency). I'm not very familiar with NIC hardware and don't know how other NICs support timing of tx interrupts, but watermark stuff like the above is routine for serial devices/drivers. sk's support for interrupting on any fragment is too flexible to be good (it is painful to program, and there doesn't seem to be a good way to time out if there is no good fragment to interrupt on or when you program the interruption on a wrong fragment). Related serial device programming: 8250-16650 UARTs interrupt when the last character is removed from the tx "ring". This is not programmable, but the delay is long enough at low speeds (87 uS at 115200 bps). The 16950 UART has a programmable tx interrupt trigger level which defaults to 1 character time. The delay from this is too short at higher speeds (11 uS at 921600 bps...). I use 16. The "tx" ring size of a 16950 is 128 characters. Timing for characters in a UART at 921600 bps is similar to timing for normal packets in 1G bps ethernet (1G/921600 ~= 1K ~= 1500+ normal ethernet packet size), so similar ring sizes and trigger levels are good (smaller ones would be better for smaller packets). Strangely, at 921600 bps, the tx trigger levels become more critical for maxing out the device than the rx trigger levels, since rx is forced to keep up by the external device (provided that maxes out the connection and it is possible to keep up), while poorly chosen tx trigger levels ensure significant dead time when the tx runs dry. BTW, I can't see any significant effect (good or bad) from sk's interrupt moderation, at least with tx changed as above. sk's interrupt moderation is very primitive compared with that of some NICs (it's just a single timer for tx and rx). Interrupting on every packet gives too many interrupts, and my changes fix this much better than any simple timeout-based moderation could do. My changes don't help at all for rx, and interrupt moderation doesn't seem to help either. OTOH, fxp's interrupt moderation works well in practice (I don't know how) and em's interrupt moderation works well in theory (I understand its documentation but haven't used any em devices). em has several independent trigger levels and timeouts, and the problem of using them effectively for rx is one of predicting future traffic. IIRC, em has sysctls to move this problem to the user. In the current sk driver, I think keeping IFF_OACTIVE set for longer would work, and you can also keep track of the queue lenghth, because of the excessive interrupts -- you get an interrupt after every packet (modulo interrupt moderation), not just on completion, and the interrupt handler can both keep the h/w queue full while IFF_OACTIVE is set and keep track of the queue length as needed for deciding when to set IFF_OACTIVE. The CPU usage is thus large no matter whether IFF_ACTIVE is set correctly. Interrupt moderation complicates things and unmaxes the link. The interrupt moderation timeout is normally set to 100 uS. This allows significant tx-dry times (the worst case (if IFF_OACTIVE is not set incorrectly) is sending a tnygram in 4uS, idling for ~96 uS, ...) but isn't very moderate since sending or receiving a normal packet takes about 15uS. I think the interrupt moderation timeout for sk is purely periodic, while for better hardware (even 16550 UARTs!) at least rx timeouts only occur after the device (in the relevant direction) has been idle for some time. Bruce