Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Oct 2001 01:59:48 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        "Kenneth D. Merry" <ken@kdm.org>
Cc:        current@FreeBSD.ORG
Subject:   Re: Why do soft interrupt coelescing?
Message-ID:  <3BC40E04.D89ECB05@mindspring.com>
References:  <3BBF5E49.65AF9D8E@mindspring.com> <20011006144418.A6779@panzer.kdm.org> <3BC00ABC.20ECAAD8@mindspring.com> <20011008231046.A10472@panzer.kdm.org> <3BC34FC2.6AF8C872@mindspring.com> <20011010000604.A19388@panzer.kdm.org>

next in thread | previous in thread | raw e-mail | index | archive | help
"Kenneth D. Merry" wrote:
> eh?  The card won't write past the point that has been acked by the kernel.
> If the kernel hasn't acked the packets and one of the receive rings fills
> up, the card will hold off on sending packets up to the kernel.

Uh, eh?

You mean the card will hold off on DMA and interrupts?  This
has not been my experience.  Is this with firmware other than
the default and/or the 4.3-RELEASE FreeBSD driver?


> I agree that you can end up spending large portions of your time doing
> interrupt processing, but I haven't seen instances of "buffer overrun", at
> least not in the kernel.  The case where you'll see a "buffer overrun", at
> least with the ti(4) driver, is when you have a sender that's faster than
> the receiver.
> 
> So the receiver can't process the data in time and the card just drops
> packets.

OK, assuming you meant that the copies would stall, and the
data not be copied (which is technically the right thing to
do, assuming a source quench style livelock avoidance, which
doesn't currently exist)...

The problem is still that you end up doing interrupt processing
until you run out of mbufs, and then you have the problem of
not being able to transmit responses, for lack of mbufs.

In the ti driver case, the inability to get another mbuf to
replace the one that will be taken out of the ring means that
the mbuf gets reused for more data -- NOT that the data flow
in the form of DMA from the card ends up being halted until
mbufs become available.

The real problem here is that most received packets want a
response; for things like web servers, where the average
request is ~.5k and the average response is ~11k, this means
that you would need to establish use-based watermarking, to
seperate the mbuff pool into transmit and receive resources;
in practice, this doesn't really work, if you are getting
your content from a seperate server (e.g. an NFS server that
provides content for a web farm, etc.).


> That's a different situation from the card spamming the receive
> ring over and over again, which is what you're describing.  I've
> never seen that happen, and if it does actually happen, I'd be
> interested in seeing evidence.

Please look at what happens in the case of an allocation
failure, for any driver that does not allow shrinking the
ring of receive mbufs (the ti is one example).


> > Without hacking firmware, the best you can do is to ensure
> > that you process as much of all the traffic as you possibly
> > can, and that means avoiding livelock.
> 
> Uhh, the Tigon firmware *does* drop packets when there is no
> more room in the proper receive ring on the host side.  It
> doesn't spam things.
> 
> What gives you that idea?  You've really got some strange ideas
> about what goes on with that board.  Why would someone design
> firmware so obviously broken?

The driver does it on purpose, by not giving away the mbuf
in the receive ring, until it has an mbuf to replace it.

Maybe this should be rewritten to not mark the packet as
received, and thus allow the card to overwrite it.  There
are two problems with that approach however.  The first is
what happens when you reach mbuf exhaustion, and the only
way you can clear out received mbufs is to process the data
in a user space appication which never gets to run, and when
it does get to run, can't write a response for a request and
response protocol, such that it can't free up any mbufs?  The
second is that, in the face of a denial of service attack,
the correct approach (according to Van Jacobsen) is to do a
random drop, and rely on the fact that the attack packets,
being proportionally more of the queue contents, get dropped
with a higher probability... so while you _can_ do this, it
is really a bad idea, if you are trying to make your stack
robust against attacks.

The other thing that you appear to be missing is that the
most common case is that you have plenty of mbufs, and you
keep getting interrupts, replacing the mbufs in the receive
ring, and pushing the data into the ether input by giving
away the full mbufs.

The problem occurs when you are receiving at such a high rate
that you don't have any free cycles to run NETISR, and thus
you can not empty the queue from which ipintr is called with
data.

In other words, it's not really the card's fault that the OS
didn't run the stack at hardware interrupt.

> > What this means is that you get more benefit in the soft
> > interrupt coelescing than you otherwise would get, when
> > you are doing LRP.
> >
> > But, you do get *some* benefit from doing it anyway, even
> > if your ether input processing is light: so long as it is
> > non-zero, you get benefit.
> >
> > Note that LRP itself is not a panacea for livelock, since
> > it just moves the scheduling problem from the IRQ<->NETISR
> > scheduling into the NETISR<->process scheduling.  You end
> > up needing to implement weighted fair share or other code
> > to ensure that the user space process is permitted to run,
> > so you end up monitoring queue depth or something else,
> > and deciding not to reenable interrupts on the card until
> > you hit a low water mark, indicating processing has taken
> > place (see the papers by Druschel et. al. and Floyd et. al.).
> 
> It sounds like it might be better handled with better scheduling in
> combination with interrupt coalescing in the hardware.

Hardware interrupt coelescing stalls your processing until
the coeslecing window becomes full, or the coelescing idle
timer fires.

Really, the best way to handle it is polling, but when you
do that, there is nothing left over for useful work, other
than the work of moving packets (e.g. it's OK for L2 or L4
switching, but can't handle L7 switching, and can't handle
user processes not involved with actual networking).  I've
seen this same hardware, with rewritten firmware, do about
100,000 connections a second, with polling, but that number
at that point is no longer a figure of merit, since in
order to actually have a product, you have to be able to
get real work done (Alteon quotes ~128,000/second for their
most recent product offering).


> Is burdening the hardware interrupt handler even more the
> best solution?

I don't really understand the question: you have to do the
work sometime, in response to an external, real-world event.
Polling reduces latency in handling the event closer to when
it actually occurs, but means that you can only do a limited
set of work.  Handling the event in response to a signal
that it has taken place (i.e. an interrupt) is the best
approach to making it so that you can do other work until
your attention is required.  Polling multiple times until
there is no more work (e.g. coelescing the work into a
single interrupt) is as close as you'll get to pure polling
without actually doing pure polling.

I don't really see why doing as much work as possible by
directly coupling to the "work to do" event -- the interrupt --
is being considered a bad thing.  It goes against all the
high speed networking research of the last decade, so with
no contrary evidence, it's really hard to support that view.


> > I was chewed on before for context diffs.  As I said before,
> > I can provide tem, if that's the current coin of the realm; it
> > doesn't matter to me.
> 
> That should be the default mode -- as Peter pointed out, non-context diffs
> will work far less often and are far more difficult to decipher than
> context diffs.

Is this a request for me to resend the diffs?


> > The problem I have is that you simply can't use jumbograms
> > in a commercial product, if they can't be autonegotiated,
> > or you will burn all your profit in technical support calls
> > very quickly.
> 
> The MTU isn't something that gets negotiated -- it's something that is set
> per-network-segment.
> 
> Properly functioning routers will either fragment packets (that don't have
> the DF bit set) or send back ICMP responses telling the sender their packet
> should be smaller.

Sure.  So you set the DF bit, and then start with honking big
packets, sending progressively smaller ones, until you get
a response.


> > Intel can autonegotiate with several manufacturers, and the
> > Tigon III.  It can interiperate with the Tigon II, if you
> > statically configure it for jumbograms.
> 
> This sounds screwed up; you don't "autonegotiate" jumbo frames, at least on
> the NIC/ethernet level.  You set the MTU to whatever the maximum is for
> that network, and then you let your routing equipment handle the rest.
> 
> If you're complaining that the Intel board won't accept jumbo frames from
> the Alteon board by default (i.e. without setting the MTU to 9000) that's
> no surprise at all.  Why should a board accept a large packet when its MTU
> is set to something smaller?

It's the Alteon board that craps out.  Intel to Intel works
fine, when you "just start using them" on one end.


> I don't think there's any requirement that a card accept packets larger
> than its configured MTU.

You're right that it's not a requirement: it's just a customer
expectation that your hardware will try to operate in the most
efficient mode possible, and all other modes will be fallback
positions (which you will then complain about in your logs).


> > Another interesting thing is that it is often a much better
> > idea to negotiate an 8k MTU for jumbograms.  The reason for
> > this is that it fits evenly into 4 mbuf clusters.
> 
> Yeah, that's nice.  FreeBSD's TCP stack will send 8KB payloads
> automatically if your MTU is configured for something more than that.
> 
> That comes in handy for zero copy implementations.

Ugh.  I was unaware that FreeBSD's stack would not honor a 9k MTU.


> > There are actually some good arguments in there for having
> > non-fixed-sized mbufs...
> 
> Perhaps, but not that many.

1)	mbuf cluster headers waste most of their space.

2)	max MTU sized mbufs have no wasted space.

3)	(obsolete, sadly) the tcptempl structure is 60
	bytes, not 256, but uses a full mbuf, for a full
	> 2/3rds space wastage.

4)	9k packets

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3BC40E04.D89ECB05>