FreeBSD Mail Archives

Date:      Mon, 15 Oct 2001 00:24:07 -0600
From:      "Kenneth D. Merry" <ken@kdm.org>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        current@FreeBSD.ORG
Subject:   Re: Why do soft interrupt coelescing?
Message-ID:  <20011015002407.A59917@panzer.kdm.org>
In-Reply-To: <3BC55201.EC273414@mindspring.com>; from tlambert2@mindspring.com on Thu, Oct 11, 2001 at 01:02:09AM -0700
References:  <3BBF5E49.65AF9D8E@mindspring.com> <20011006144418.A6779@panzer.kdm.org> <3BC00ABC.20ECAAD8@mindspring.com> <20011008231046.A10472@panzer.kdm.org> <3BC34FC2.6AF8C872@mindspring.com> <20011010000604.A19388@panzer.kdm.org> <3BC40E04.D89ECB05@mindspring.com> <20011010232020.A27019@panzer.kdm.org> <3BC55201.EC273414@mindspring.com>

On Thu, Oct 11, 2001 at 01:02:09 -0700, Terry Lambert wrote:
> "Kenneth D. Merry" wrote:
> > If the receive ring for that packet size is full, it will hold off on
> > DMAs.  If all receive rings are full, there's no reason to send more
> > interrupts.
> 
> I think that this does nothing, in the FreeBSD case, since the
> data from the card will generally be drained much faster than
> it accrues, into the input queue.  Whether it gets processed
> out of there before you run out of mbufs is another matter.
> 
> [ ... ]
> 
> > Anyway, if all three rings fill up, then yes, there won't be a reason to
> > send receive interrupts.
> 
> I think this can't really happen, since interrupt processing
> has the highest priority, compared to stack processing or
> application level processing.  8-(.

Yep, it doesn't happen very often in the default case.

> > > OK, assuming you meant that the copies would stall, and the
> > > data not be copied (which is technically the right thing to
> > > do, assuming a source quench style livelock avoidance, which
> > > doesn't currently exist)...
> > 
> > The data isn't copied, it's DMAed from the card to host memory.  The card
> > will save incoming packets to a point, but once it runs out of memory to
> > store them it starts dropping packets altogether.
> 
> I think that the DMA will not be stalled, at least as the driver
> currently exists; you and I agreed on that already (see below).
> My concern in this case is that, if the card is using the bus to
> copy packets from card memory to the receive ring, then the bus
> isn't available for other work, which is bad.  It's better to
> drop the packets before putting them in card memory (FIFO drop
> fails to avoid the case where a continuous attack pushes all
> good packets out).

Dropping packets before they get into card memory would only be possible
with some sort of traffic shaper/dropping mechanism on the wire to drop
things before they get to the card at all.

> > > The problem is still that you end up doing interrupt processing
> > > until you run out of mbufs, and then you have the problem of
> > > not being able to transmit responses, for lack of mbufs.
> > 
> > In theory you would have configured your system with enough mbufs
> > to handle the situation, and the slowness of the system would cause
> > the windows on the sender to fill up, so they'll stop sending data
> > until the receiver starts responding again.  That's the whole purpose
> > of backoff and slow start -- to find a happy medium for the
> > transmitter and receiver so that data flows at a constant rate.
> 
> In practice, mbuf memory is just as overcommitted as all other
> memory, and given a connection count target, you are talking a
> full transmit and full receive window worth of data at 16k a
> pop -- 32k per connection.
> 
> Even a modest maximum connection count of ~30,000 connections --
> something even an unpatches 4.3 FreeBSD could handle -- means
> that you need 1G of RAM for the connections alone, if you disallow
> overcommit.  In practice, that would mean ~20,000 connections,
> when you count page table entries, open file table entries, vnodes,
> inpcb's, tcpcb's, etc..  And that's a generaous estimate, which
> assumes that you tweak your kernel properly.

You could always just put 4G of RAM in the machine, since memory is so
cheap now. :)

At some point you'll hit a limit in the number of connections the processor
can actually handle.

> One approach to this is to control the window sizes based on
> th amount of free reserve you have available, but this will
> actually damage overall throughput, particularly on links
> with a higher latency.

Yep.

> > > In the ti driver case, the inability to get another mbuf to
> > > replace the one that will be taken out of the ring means that
> > > the mbuf gets reused for more data -- NOT that the data flow
> > > in the form of DMA from the card ends up being halted until
> > > mbufs become available.
> > 
> > True.
> 
> This is actually very bad: you want to drop packets before you
> insert them into the queue, rather than after they are in the
> queue.  This is because you want the probability of the drop
> (assuming the queue is not maxed out: otherwise, the probabilty
> should be 100%) to be proportional to the exponential moving
> average of the queue depth, after that depth exceeds a drop
> threshold.  In other words, you want to use RED.

Which queue?  The packets are dropped before they get to ether_input().

Dropping random packets would be difficult.

> > > Please look at what happens in the case of an allocation
> > > failure, for any driver that does not allow shrinking the
> > > ring of receive mbufs (the ti is one example).
> > 
> > It doesn't spam things, which is what you were suggesting before, but
> > as you pointed out, it will effectively drop packets if it can't get new
> > mbufs.
> 
> Maybe I'm being harsh in calling it "spam'ming".  It does the
> wrong thing, by dropping the oldest unprocessed packets first.
> A FIFO drop is absolutely the wrong thing to do in an attack
> or overload case, when you want to shed load.  I consider the
> packet that is being dropped to have been "spam'med" by the
> card replacing it with another packet, rather than dropping
> the replacement packet instead.
> 
> The real place for this drop is "before it gets to card memory",
> not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc.,
> all agree on that.

As I mentioned above, how would you do that without some sort of traffic
shaper on the wire?

> > Yes, it could shrink the pool, by just not replacing those mbufs in the
> > ring (and therefore not notifying the card that that slot is available
> > again), but then it would likely need some mechanism to allow it to be
> > notified that another buffer is available for it, so it can then allocate
> > receive buffers again.
> > 
> > In practice I haven't found the number of mbufs in the system to be a
> > problem that would really make this a big issue.  I generally configure
> > the number of mbufs to be high enough that it isn't a problem in the
> > first place.
> 
> I have a nice test that I would be happy to run for you in
> the lab.  It loads a server up with 100,000 simultaneous
> long duration downloads, replacing each client with a new
> one when that client's download is complete.
> 
> In the case that you run out of mbufs, FreeBSD just locks
> up solid, unless you make modifications to the way that
> packets are processed (or maintain a "transmit mbufs free
> reserve" to endure against the deadly embrace deadlock).
> 
> If you have another approach to resolving the deadlock,
> I'm all ears... I'll repeat it for you in the lab, if you
> are willing to stare at it with me... we would probably be
> willing to put you on full time, if you were able to do
> something about the problem, other than what I've done.  8-).

My focus with gigabit ethernet was to get maximal throughput out of a small
number of connections.  Dealing with a large number of connections is a
different problem, and I'm sure it uncovers lots of nifty bugs.

> > > The driver does it on purpose, by not giving away the mbuf
> > > in the receive ring, until it has an mbuf to replace it.
> > 
> > The driver does effectively drop packets, but it doesn't spam
> > over another packet that has already gone up the stack.
> 
> It wastes a DMA, by DMA'ing over the already DMA'ed packet,
> so we eat unnecessary bus bandwidth lossage as a result (see
> above, as to why I called it "spam").

Yep.

> > > Maybe this should be rewritten to not mark the packet as
> > > received, and thus allow the card to overwrite it.
> > 
> > That wouldn't really work, since the card knows it has DMAed into that
> > slot, and won't DMA another packet in its place.  The current approach is
> > the equivalent, really.  The driver tells the card the packet is received,
> > and if it can't allocate another mbuf to replace that mbuf, it just puts
> > that mbuf back in place.  So the card will end up overwriting that packet.
> 
> I'd actually prefer to avoid the other DMA; I'd also like
> to avoid the packet receipt order change that results from
> DMA'ing over the previous contents, in the case that an mbuf
> can't be allocated.  I'd rather just let good packets in with
> a low (but non-zero) statistical probability, relative to a
> slew of bad packets, rather than letting a lot of bad packets
> from a persistant attacker push my good data out with the bad.

Easier said than done -- dumping random packets would be difficult with a
ring-based structure.  Probably what you'd have to do is have an extra pool
of mbufs lying around that would get thrown in at random times when mbufs
run out to allow some packets to get through.

The problem is, once you exhaust that pool, you're back to the same old
problem if you're completely out of mbufs.

You could probably also start shrinking the number of buffers in the ring,
but as I said before, you'd need a method for the kernel to notify the
driver that more mbufs are available.

> [ ... ]
> 
> > The main thing I would see is that when an interrupt handler takes
> > a long time to complete, it's going to hold off other devices
> > sharing that interrupt.  (Or interrupt mask, perhaps.)
> 
> If you are sharing interrupts at this network load, then
> you are doing the wrong thing in your system setup.  If
> you don't have such a high load, then it's not a problem;
> either way, you can avoid the problem, for the most part.

It's easier with a SMP system (i.e. you've got an APIC.)

> > This may have changed in -current with interrupt threads, though.
> 
> It hasn't, as far as I can tell; in fact, the move to a
> seperate kernel thread in order to process the NETISR
> makes things worse, from what I can see.
> 
> The thing you have to do to reintroduce fairness is to make
> a decision to _not_ reenable the interrupts.  The most
> simplistic way to do this is to maintain queue depth counts
> for amount of data on its way to user space, and then make
> a conscious decision at the high watermark to _not_ reenable
> interrupts.  The draining of the queue then looks for the
> low watermark, and when you hit it, reenables the interrupts.
> 
> This is a really crude form of what's called "Weighted Fair
> Share Queueing".  There's actually a good paper on it off
> the main page of the QLinux project (second hit on a Google
> search for "QLinux").
> 
> > > Is this a request for me to resend the diffs?
> > 
> > Yes.
> 
> OK, I will rediff and generate context diffs; expect them to
> be sent in 24 hours or so from now.

It's been longer than that...

> > > Sure.  So you set the DF bit, and then start with honking big
> > > packets, sending progressively smaller ones, until you get
> > > a response.
> > 
> > Generally the ICMP response tells you how big the maximum MTU is, so you
> > don't have to guess.
> 
> Maybe it's the ICMP response; I still haven't had a chance to
> hold Michael down and drag the information out of him.  8-).

Maybe what's the ICMP response?

> [ ... ]
> 
> > The two Product X boxes use TCP connections between each other, and happily
> > negotiate a MSS of 8960 or so.  They start sending data packets, but
> > nothing gets through.
> > 
> > How would product X detect this situation?  Most switches I've seen don't
> > send back ICMP packets to tell the sender to change his route MTU.  They
> > just drop the packets.  In that situation, though, you can't tell the
> > difference between the other end being down, the cable getting pulled,
> > switch getting powered off or the MTU on the switch being too small.
> 
> Cicso boxes detect "black hole" routes; I'd have to read the
> white paper, rather than just its abstract, to tell you how,
> though...

It depends on what they're trying to do with the information.  If they're
just trying to route around a problem, that's one thing.  If they're trying
to diagnose MTU problems, that's quite another.

In general, it's going to be pretty easy for routers to detect when a
packet exceeds the MTU for one of their interfaces and send back a ICMP
packet.

> > It's a lot easier to just have the user configure the MTU.
> 
> Not for the user.

Configuring the MTU is a standard part of configuring IP networks.  If your
users aren't smart enough to do it, you'll pretty much have to default to
1500 bytes for ethernet.

You can let the more clueful users increase the MTU.

If you're supplying enough of the equipment, you can make some assumptions
about the equipment setup.  This was the case with my former employer -- in
many cases we supplied the switch as well as the machines to go onto the
network, so we knew ahead of time that jumbo frames would work.  Otherwise,
we'd work with the customer to set things up with standard or jumbo frames
depending on their network architecture.

> Maybe making gigabit cards not need twisty cables to wire them
> together has just set expectations too high...  8-).

Well, they do use twisty cables, but that's hidden inside a sheath. :)

> > So, what if you did try to back off and retransmit at progressively smaller
> > sizes?  That won't work in all cases.  If you're the receiver, and the
> > sender isn't one of your boxes, you have no way of knowing whether the
> > sender is down or what, and you have no way of guessing that his problem is
> > that the switch doesn't support the large MSS you've negotiated.  There's
> > also no way for you to back off, since you're not the one transmitting the
> > data, and your acks get through just fine.
> 
> It's ugly, but possible.  You can alway detect this by dicking
> with the hop count, etc..

As I said above, if you're the receiver, you don't know there is a problem,
because your acks get back just fine.

> Actually, given gigabit speeds, you should be able to sink an
> incredible amount of processing into it, and still get done
> relatively quickly after the carrier goes on.

At least on the receiving end, processing doesn't have a lot to do with it.

> In any case, Intel cards appear to do it, and so do Tigon III's.

That's nice, but there's no reason a card has to accept packets with a
higher MTU than it is configured for.

Ken
-- 
Kenneth Merry
ken@kdm.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011015002407.A59917>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation