Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 09 Aug 2001 01:39:57 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Weiguang SHI <weiguang_shi@hotmail.com>
Cc:        grog@FreeBSD.org, bmilekic@technokratis.com, dillon@earth.backplane.com, zzhang@cs.binghamton.edu, freebsd-hackers@FreeBSD.org
Subject:   Re: Allocate a page at interrupt time
Message-ID:  <3B724C5D.6CDD0063@mindspring.com>
References:  <F122JziWwSoRKbpG1Ki00002460@hotmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Weiguang SHI wrote:
> 
> I found an article on livelock at
> 
> http://www.research.compaq.com/wrl/people/mogul/mogulpubsextern.html
> 
> Just go there and search for "livelock".
> 
> But I don't agree with Terry about the interrupt-thread-is-bad
> thing, because, if I read it correctly, the authors themself
> implemented their ideas in interrupt thread of the Digital Unix.

Not quite.  These days, we are not necessarily talking about
just interrupt load limitations.

Feel free to take the following with a grain of salt; but
realize, I have personally achieved more simultaneous connections
on a FreeBSD box than anyone else out there without my code in
hand, and this was using gigabit ethernet controllers on modern
hardware, and further, this code is in shipping product today.

--

The number one way of dealing with excess load is to load-shed
it before the load causes problems.

In an interrupt threads implementaion, you can't really do
this, since the only option you have is when to schedule a
polling operation.  This leads to several inefficiencies,
all of which negatively impact the top end performance you
are going to be able to achieve.

Use of interrupt threads suffers from a drastically increased
latency in reenabling of interrupts, and can generally only
perform a single polling cycle, without running into the problem
of not making forward progress at the application level (they
run at IPL 0, which is effectively the same time at which NETISR
is currently run).  This leads to a tradeoff in increased
interrupt handling latency (e.g. the Tigon II Gigabit ethernet
driver in FreeBSD sets the Tigon II card firmware to coelesce at
most 32 interrupts), vs. the transmit starvation problem noted in
section 4.4 of the paper.

It should also be noted that, even if you have not reenabled
interrupts, the DMA engine on the card will still be DMA'ing
data into your receiver ring buffer.  The burst data rate on
a 66MHz, 64 bit PCI bus is just over 4Mbits/S, and the sustainable
data rate is much lower than that.

This means a machine acting as a switch or firewall with two
of these cards on board will not really have much time for
doing anything at all, except DMA transfers, if they are run
at full burst speed all the time (not possible).  Running an
application which requires disk activity will further eat into
the available bandwidth.

So this raises the spectre of DMA-based bus transfer livelock:
not just interrupt based livelock, if one is scheduling interrupt
threads to do event polling, instead of using one of the other
approaches outlined in the paper.

In the DEC UNIX case, they mitigated the problem by getting rid
of the IP input queue, and getting rid of NETISR (I agree that
these are required of any code with these goals).  The use of
the polling thread is really just their way of implementing the
polling approach, from section 5.3.  This does not address the
problems I noted above, and in particular, does not address the
latency vs. bus livelock tradeoff problem with modern hardware
(they were using an AMD LANCE Ethernet chip; this was a 10Mb
chip, and it doesn't support interrupt coelescing).  They also
assumed the use of a user space forwarding agent ("screend"):
a single process.

Further, I think that the feedback mechanism selected is not
really workable, without rewriting the card firmware, and
having a significant memory buffer on the card, something which
is not available on the market yet today.  This is because,
in practice, you can't stop all incoming packet processing just
because one user space program out of dozens has a full input
queue that the user space program has not processed yet.  It's
not reasonable to ignore new incoming requests to a web server,
or to disable card interrupts, or to (for example) drop all ARP
packets until TCP processing for that one application is complete:
their basic assumption -- which they admit, in section 6.6.1, is
that the screend is the only application running on the system.
This is simply not the case with a high traffic web server, a
database system, or any other work-to-do-engine model of several
process (or threads) with identical capability to service the
incoming requests.

Further, these applications use TCP, and thus have explicitly
application bound socket endpoints, and there is no way to
guarantee client load.  We could trivially DOS attack an Apache
server running SSL via mod_proxy, for example, by sending a flood
of intentionally bad packets.  The computation expense would keep
its input queue full, and therefore, the feedback mechanism noted
would starve the other Apache processes of legitimate input.
There are other obvious attacks, which are no less damaging in
their results, which attack other points in the assumption of a
single process queue feedback mechanism.

Their scheduler in section 7, which is in effect identical to the
"fixed" scheduling class in SVR4 (which was used by USL to avoid
the "move mouse, wiggle cursor" problem when using the UnixWare
linker, which mmap'ed all object files, and then seeked all over
in them, therefore thrashing all other pages out of the buffer
cache, by giving the X server a fixed portion of the available
CPU to thrash those pages back in -- a highly non-optimal fix),
really only addresses CPU contention, at the cost of pessimizing
the amount of CPU available to do packet processing when that's
all the work ther is to do, and at the expense of "chatty" network
apllications: precisely the sort of thing FreeBSD is being used
for these days: NFS servers, HTTP servers, etc.; clearly, the
quota was set from the perspective of the need to support an
interactive user load which is not really common, nor expected in
applications where FreeBSD is commonly used, these days.  Section
7.1 notes this tangentially, in a discussion of NFS.

Section 10 notes the differential rates of memory vs. CPU speed
increase, and that their approach may have some problems; my
opinion is that the polling approach is correct, and that there
is probably room for something like a weighted fair share method
for doing scheduling, and that using the 5.1 or 5.2 method,
trather than the interrupt thread method in 5.3, would yield
significantly better results in SMP and high speed interface
systems that are limited not by interfaces or interrupts (per se),
but by bis bandwidth  (recent work at various Universities seems
to confirm this opinion, FWIW).

Overall, this is a seminal paper, and identifies almost every
one of the important issues; and it admits where it may be wrong
if things change in the future -- I would argue that things have
changed.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B724C5D.6CDD0063>