FreeBSD Mail Archives

Date:      Sat, 27 Oct 2001 12:51:02 -0700
From:      Luigi Rizzo <rizzo@aciri.org>
To:        net@FreeBSD.ORG
Subject:   Polling vs Interrupts (was Re: NEW CODE: polling support...)
Message-ID:  <20011027125102.E80606@iguana.aciri.org>
In-Reply-To: <3BDA73C0.73DBCAE9@soekris.com>
References:  <20011027005905.A72758@iguana.aciri.org> <3BDA73C0.73DBCAE9@soekris.com>

As the topic came out, i thought it would be useful to pass around
a short discussion on the performance implications of interrupts
and polling.

In addition to the code that i already posted and which implements
polling as discussed in Sec.4 below, I have another set of patches
to implement device-independent interrupt mitigation as discussed
near the end of Sec.3, and am finishing up some other code
that lets you say "i want to reserve XX percent of the CPU to
device drivers" and dynamically adapt the polling "counts" to
match the requirements.

	cheers
	luigi


[DISCLAIMER: what follows reflects my view of the world, and it is
mostly well known stuff, documented for a long time in the literature.
I am sure others have already implemented some or all or more of
this, I am not claiming paternity on any of these ideas.]

    === Polling versus Interrupts in (network) device drivers ===

1. BASIC OPERATION OF INTERRUPT

Typical network device drivers use interrupts to let the network
controller (NIC) communicate with the operating system:  when some
interesting event occurs, the NIC generates an interrupt request
which is dispatched to the processor through the interrupt controller.
This in turn causes the execution of a short piece of code (typically
written in assembler) which in the end invokes the interrupt service
routines (ISR's) for all the devices sharnig the same interrupt line.

The ISR, which is typically a C function part of the device driver,
say XX_intr(), normally does the following:

    for (;;) {
	<fetch status register from the NIC;>
	if (no interrupts pending)
		break;
	<write to the status reg. to reset pending interrupts>
	<process reported events>
    }

The cost of getting and servicing an interrupt is made of the
following components:

 + hardware and software context switch
 + software overhead to access the interrupt controller and
   dispatch the request to the appropriate ISR
 + time spent in each of the XX_intr() associated to the
   physical interrupt line.

The first two components can take anywhere from 1 to 20us depending
on your system. These times are relatively constant, irrespective
of the system load.  The cost of the last component is largely
variable and unpredictable, as we will see below.

At low load levels (incoming packet rates), a good fraction of the
interrupt cost is actually the overhead for context switch and
dispatch, but this is not something to worry about, because there
is enough CPU to serve all the interrupts you get. Even at 10000
packets per second, a state-of-the art CPU may spend some 5-10% of
its time in overhead.

2. THE CPU BOTTLENECK

As the load level increases, your interrupt rate also grows, and
so does the work that you perform in each call to XX_intr().

The increase in interrupt rate is dangerous because it can make
you waste a large fraction of your CPU cycles just in overhead and
context switches, leading to a steep drop of the forwarding rate
as the offered load increases.

The growth in interrupt rate is typically sublinear in the incoming
packet rate, because as the latter increases, it becomes more and
more likely that you get to process a second packet before you can
complete the interrupt service routine.

This might seem a good thing; but unfortunately, there migth be a
third packet, and a fourth, and a fifth... and so you might remain
within your interrupt service routine for a very large amount of
time (potentially unbounded). This is what makes the system
unresponsive under load, and prevents a fair sharing of resources
among devices.

This regime, where you spend all of your time servicing interrupts,
and not having time to do work outside the interrupt context, is
often referred to as livelock. Depending on how processing is
done, you might end up with some work done even in this situation,
but the situation is far from being ideal.

It would seem that a way to make the system more responsive (or at
least, achieve better sharing of CPU time among devices) is to
limit the amount of time you spend in each invocation of the
interrupt service routine.

Unfortunately, in an interrupt-based system, before you return from
the handler you are forced to acknowledge the interrupt (typically
write into the status register) and process all events that might have
triggered it -- if you stop early, there might not be another
interrupt that wakes you up to continue the leftover work, so you
have to set a timeout by independent means, or resort to polling.
(There are workarounds, but they involve not acknowledging interrupts
in the ISR, which translates in new requests being generated
immediately after the ISR returns. So, these can solve the fairness
problem but not the responsiveness problem).

So, to summarise, interrupt-based systems can be subject, under
load, to high overhead; low responsiveness; and unfairness.

3. INTERRUPT MITIGATION

A technique that can be used to reduce the overhead (but not to
improve responsiveness or fairness) is to implement some form of
"interrupt coalescing" or "interrupt mitigation": generate an
interrupt only after a sufficient delay from the previous one,
and/or after a minimum number of packets have arrived.  The tty
drivers in unix have done this for a long time.

Some NICs can do interrupt mitigation by themselves -- e.g. the
21143 has a register where you can program timeout and count; the
i82558 apparently can download microcode to do the same, and from
my experiments, some Tulip-clones also appear to have a minimum
delay between interrupts which is larger than the minimum packet
interarrival time. In general, interrupt mitigation is
almost a must for high speed devices.

If the NIC cannot do interrupt mitigation, it is possible to simulate
it in software as follows:  the low level interrupt service routines
(those who ultimately invoke XX_intr()) keep track of how many
times a given interrupt has been generated in the last tick. When
the number is above a threshold, that interrupt line is kept masked
on the interrupt controller until the next clock tick.

This does not require any support in the hardware or in the driver,
but as a drawback, it might add some additional latency (up to 1 tick)
in the response to the interrupt, for all the device sharing the
same line.

4. THE BUS BOTTLENECK

As the incomnig packet rate increases even further, you are likely
to hit another bottleneck, which may even kick in before the CPU
bottleneck: this is the bus bandwidth.

The raw bandwidth of a PCI bus (33MHz, 32bit) is slightly above
1Gbit/s, and 4Gbit/s for a fast PCI bus (66MHz, 64bit).  However
this is only the raw bandwidth -- each transaction on the bus
consumes a minimum of 4 (or is it 3?) cycles, plus an additional
cycle per word. A typical network controller acting as a bus master,
has to perform the following operatios on each received packet:

 + read the descriptor for the memory buffer
 + write the packet to memory
 + write the descriptor to memory (this might be a read-modify-write
   cycle);

For short packets (64 bytes) the overhead can easily reduce the
useful bandwidth to half of the theoretical maximum, or even less.
On top of this you have to consider that forwarding a packet from
one interface to another requires the payload to cross the bus
twice.

In order to offload some work from the CPU, some controllers
autonomously poll descriptors in memory to check if there are more
packets to transmit, or new receive descriptors are available for
incoming packet. This helps the CPU, but generates even more traffic
on the bus, further reducing its capacity to transfer data.

Once the PCI bus becomes congested, it behaves exactly the same as
a congested network link: those units (CPU, NICs or other devices
on the bus) willing to access the bus might experience long delays
waiting for it to be available. The actual wait times depend on
the load of the bus, on the number of devices conflicting for
access, and on the arbitration algorithm used to grant access to
the bus itself.

As an example, for a simple I/O read from a register on a PCI card,
I have often measured times as high as 10,000 clock cycles (for
our 750MHz host, this is over 10us) during which the CPU cannot do
anything. Unfortunately, this time is totally unpredictable, and
the CPU has no way to abort the transaction once started.

To understand the impact of this, consider that an interrupt service
routine typically needs to access the status register on the NIC
at least 3 times (first one to fetch the status, second one to
clear it, third one to make sure that there are no other pending
interrupts), so you can easily see how bad this can become.

4. POLLING


Polling works by disabling interrupts in the card, and in their
place letting the operating system poll the status of the card when
it finds it convenient. This can happen e.g. once per timer tick,
in the idle loop, and when the kernel gets control because of a
software interrupt.

This helps because of two reasons.
First of all, the amount of work that can be performed in the poll
call can be easily controlled. There is no need to process all
events to completion, because further events can be processed in
the next poll call.  This also means that, when resources are
scarce, the system can postpone processing, and possibly let the
hardware drop excessive incoming traffic (that would be dropped
anyways by the software) with no overhead.

Secondly, the poll routine does not always need to check status
registers for all events. Some information (e.g. packet received
or transmitted) is already available in main memory, within the
buffer descriptor, so the polling does not cause further contention
on the PCI bus and has a more predictable cost.

Of course, every now and then you have to access the status registers
to test for events that are only reported there (e.g. running out
of buffers, or error conditions, etc.) but that can be done
sufficiently rarely not to cause a significant additional overhead
and/or contention.

The drawbacks of polling are quite obvious: because you do not have
an instantaneous notification of the events you are interested in,
there might be some delay before you get a chance to process the
events. This delay is normally limited to one clock tick (so it
can be easily as low as 1ms or less), and in practice can be even
more limited when your system is not heavily loaded and you poll
the devices more frequently than once per tick.

On the plus side, by using polling you can easily control how much
time you want to dedicate to the processing of I/O events, both
globally and on an individual basis. This translates in more
predictable, responsive and fair system behaviour.

[THE END]
----------------------------------+-----------------------------------------
 Luigi RIZZO, luigi@iet.unipi.it  . ACIRI/ICSI (on leave from Univ. di Pisa)
 http://www.iet.unipi.it/~luigi/  . 1947 Center St, Berkeley CA 94704
 Phone: (510) 666 2927
----------------------------------+-----------------------------------------

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011027125102.E80606>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation