From owner-freebsd-net Sat Oct 27 12:54:36 2001 Delivered-To: freebsd-net@freebsd.org Received: from iguana.aciri.org (iguana.aciri.org [192.150.187.36]) by hub.freebsd.org (Postfix) with ESMTP id 69D1E37B401 for ; Sat, 27 Oct 2001 12:54:27 -0700 (PDT) Received: (from rizzo@localhost) by iguana.aciri.org (8.11.3/8.11.1) id f9RJp2j81585; Sat, 27 Oct 2001 12:51:02 -0700 (PDT) (envelope-from rizzo) Date: Sat, 27 Oct 2001 12:51:02 -0700 From: Luigi Rizzo To: net@FreeBSD.ORG Subject: Polling vs Interrupts (was Re: NEW CODE: polling support...) Message-ID: <20011027125102.E80606@iguana.aciri.org> References: <20011027005905.A72758@iguana.aciri.org> <3BDA73C0.73DBCAE9@soekris.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3BDA73C0.73DBCAE9@soekris.com> User-Agent: Mutt/1.3.23i Sender: owner-freebsd-net@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org As the topic came out, i thought it would be useful to pass around a short discussion on the performance implications of interrupts and polling. In addition to the code that i already posted and which implements polling as discussed in Sec.4 below, I have another set of patches to implement device-independent interrupt mitigation as discussed near the end of Sec.3, and am finishing up some other code that lets you say "i want to reserve XX percent of the CPU to device drivers" and dynamically adapt the polling "counts" to match the requirements. cheers luigi [DISCLAIMER: what follows reflects my view of the world, and it is mostly well known stuff, documented for a long time in the literature. I am sure others have already implemented some or all or more of this, I am not claiming paternity on any of these ideas.] === Polling versus Interrupts in (network) device drivers === 1. BASIC OPERATION OF INTERRUPT Typical network device drivers use interrupts to let the network controller (NIC) communicate with the operating system: when some interesting event occurs, the NIC generates an interrupt request which is dispatched to the processor through the interrupt controller. This in turn causes the execution of a short piece of code (typically written in assembler) which in the end invokes the interrupt service routines (ISR's) for all the devices sharnig the same interrupt line. The ISR, which is typically a C function part of the device driver, say XX_intr(), normally does the following: for (;;) { if (no interrupts pending) break; } The cost of getting and servicing an interrupt is made of the following components: + hardware and software context switch + software overhead to access the interrupt controller and dispatch the request to the appropriate ISR + time spent in each of the XX_intr() associated to the physical interrupt line. The first two components can take anywhere from 1 to 20us depending on your system. These times are relatively constant, irrespective of the system load. The cost of the last component is largely variable and unpredictable, as we will see below. At low load levels (incoming packet rates), a good fraction of the interrupt cost is actually the overhead for context switch and dispatch, but this is not something to worry about, because there is enough CPU to serve all the interrupts you get. Even at 10000 packets per second, a state-of-the art CPU may spend some 5-10% of its time in overhead. 2. THE CPU BOTTLENECK As the load level increases, your interrupt rate also grows, and so does the work that you perform in each call to XX_intr(). The increase in interrupt rate is dangerous because it can make you waste a large fraction of your CPU cycles just in overhead and context switches, leading to a steep drop of the forwarding rate as the offered load increases. The growth in interrupt rate is typically sublinear in the incoming packet rate, because as the latter increases, it becomes more and more likely that you get to process a second packet before you can complete the interrupt service routine. This might seem a good thing; but unfortunately, there migth be a third packet, and a fourth, and a fifth... and so you might remain within your interrupt service routine for a very large amount of time (potentially unbounded). This is what makes the system unresponsive under load, and prevents a fair sharing of resources among devices. This regime, where you spend all of your time servicing interrupts, and not having time to do work outside the interrupt context, is often referred to as livelock. Depending on how processing is done, you might end up with some work done even in this situation, but the situation is far from being ideal. It would seem that a way to make the system more responsive (or at least, achieve better sharing of CPU time among devices) is to limit the amount of time you spend in each invocation of the interrupt service routine. Unfortunately, in an interrupt-based system, before you return from the handler you are forced to acknowledge the interrupt (typically write into the status register) and process all events that might have triggered it -- if you stop early, there might not be another interrupt that wakes you up to continue the leftover work, so you have to set a timeout by independent means, or resort to polling. (There are workarounds, but they involve not acknowledging interrupts in the ISR, which translates in new requests being generated immediately after the ISR returns. So, these can solve the fairness problem but not the responsiveness problem). So, to summarise, interrupt-based systems can be subject, under load, to high overhead; low responsiveness; and unfairness. 3. INTERRUPT MITIGATION A technique that can be used to reduce the overhead (but not to improve responsiveness or fairness) is to implement some form of "interrupt coalescing" or "interrupt mitigation": generate an interrupt only after a sufficient delay from the previous one, and/or after a minimum number of packets have arrived. The tty drivers in unix have done this for a long time. Some NICs can do interrupt mitigation by themselves -- e.g. the 21143 has a register where you can program timeout and count; the i82558 apparently can download microcode to do the same, and from my experiments, some Tulip-clones also appear to have a minimum delay between interrupts which is larger than the minimum packet interarrival time. In general, interrupt mitigation is almost a must for high speed devices. If the NIC cannot do interrupt mitigation, it is possible to simulate it in software as follows: the low level interrupt service routines (those who ultimately invoke XX_intr()) keep track of how many times a given interrupt has been generated in the last tick. When the number is above a threshold, that interrupt line is kept masked on the interrupt controller until the next clock tick. This does not require any support in the hardware or in the driver, but as a drawback, it might add some additional latency (up to 1 tick) in the response to the interrupt, for all the device sharing the same line. 4. THE BUS BOTTLENECK As the incomnig packet rate increases even further, you are likely to hit another bottleneck, which may even kick in before the CPU bottleneck: this is the bus bandwidth. The raw bandwidth of a PCI bus (33MHz, 32bit) is slightly above 1Gbit/s, and 4Gbit/s for a fast PCI bus (66MHz, 64bit). However this is only the raw bandwidth -- each transaction on the bus consumes a minimum of 4 (or is it 3?) cycles, plus an additional cycle per word. A typical network controller acting as a bus master, has to perform the following operatios on each received packet: + read the descriptor for the memory buffer + write the packet to memory + write the descriptor to memory (this might be a read-modify-write cycle); For short packets (64 bytes) the overhead can easily reduce the useful bandwidth to half of the theoretical maximum, or even less. On top of this you have to consider that forwarding a packet from one interface to another requires the payload to cross the bus twice. In order to offload some work from the CPU, some controllers autonomously poll descriptors in memory to check if there are more packets to transmit, or new receive descriptors are available for incoming packet. This helps the CPU, but generates even more traffic on the bus, further reducing its capacity to transfer data. Once the PCI bus becomes congested, it behaves exactly the same as a congested network link: those units (CPU, NICs or other devices on the bus) willing to access the bus might experience long delays waiting for it to be available. The actual wait times depend on the load of the bus, on the number of devices conflicting for access, and on the arbitration algorithm used to grant access to the bus itself. As an example, for a simple I/O read from a register on a PCI card, I have often measured times as high as 10,000 clock cycles (for our 750MHz host, this is over 10us) during which the CPU cannot do anything. Unfortunately, this time is totally unpredictable, and the CPU has no way to abort the transaction once started. To understand the impact of this, consider that an interrupt service routine typically needs to access the status register on the NIC at least 3 times (first one to fetch the status, second one to clear it, third one to make sure that there are no other pending interrupts), so you can easily see how bad this can become. 4. POLLING Polling works by disabling interrupts in the card, and in their place letting the operating system poll the status of the card when it finds it convenient. This can happen e.g. once per timer tick, in the idle loop, and when the kernel gets control because of a software interrupt. This helps because of two reasons. First of all, the amount of work that can be performed in the poll call can be easily controlled. There is no need to process all events to completion, because further events can be processed in the next poll call. This also means that, when resources are scarce, the system can postpone processing, and possibly let the hardware drop excessive incoming traffic (that would be dropped anyways by the software) with no overhead. Secondly, the poll routine does not always need to check status registers for all events. Some information (e.g. packet received or transmitted) is already available in main memory, within the buffer descriptor, so the polling does not cause further contention on the PCI bus and has a more predictable cost. Of course, every now and then you have to access the status registers to test for events that are only reported there (e.g. running out of buffers, or error conditions, etc.) but that can be done sufficiently rarely not to cause a significant additional overhead and/or contention. The drawbacks of polling are quite obvious: because you do not have an instantaneous notification of the events you are interested in, there might be some delay before you get a chance to process the events. This delay is normally limited to one clock tick (so it can be easily as low as 1ms or less), and in practice can be even more limited when your system is not heavily loaded and you poll the devices more frequently than once per tick. On the plus side, by using polling you can easily control how much time you want to dedicate to the processing of I/O events, both globally and on an individual basis. This translates in more predictable, responsive and fair system behaviour. [THE END] ----------------------------------+----------------------------------------- Luigi RIZZO, luigi@iet.unipi.it . ACIRI/ICSI (on leave from Univ. di Pisa) http://www.iet.unipi.it/~luigi/ . 1947 Center St, Berkeley CA 94704 Phone: (510) 666 2927 ----------------------------------+----------------------------------------- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message