Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 14 Oct 2001 02:33:57 -0500 (CDT)
From:      Mike Silbersack <silby@silby.com>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        <freebsd-current@freebsd.org>
Subject:   Re: Some interrupt coalescing tests
Message-ID:  <20011014014546.L36700-100000@achilles.silby.com>
In-Reply-To: <3BC93111.E0804A37@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Sat, 13 Oct 2001, Terry Lambert wrote:

> Mike Silbersack wrote:
>
> One issue to be careful of here is that the removal of the
> tcptmpl actually causes a performance hit that wasn't there
> in the 4.3 code.  My original complaint about tcptmpl taking
> up 256 instead of 60 bytes stands, but I'm more than half
> convinced that making it take up 60 bytes is OK... or at
> least is more OK than allocating and deallocating each time,
> and I don't yet have a better answer to the problem.  4.3
> doesn't have this change, but 4.4 does.

I need benchmarks to prove the slowdown, Terry.  The testing I performed
(which is limited, of course) showed no measureable speed difference.
Remember that the only time the tcptempl mbuf ever gets allocated now is
when a keepalive is sent, which is a rare event.  The rest of the time,
it's just copying the data from the preexisting structures over to the new
packet.  If you can show me that this method is slower, I will move it
over to a zone allocated setup like you proposed.

> > I'm not sure if the number was lower because the celeron couldn't run the
> > flooder as quickly, or if the -current box was dropping packets.  I
> > suspect the latter, as the -current box was NOTICEABLY slowed down; I
> > could watch systat refresh the screen.
>
> This is unfortunate; it's an effect that I expected with
> the -current code, because of the change to the interrupt
> processing path.
>
> To clarify here, the slowdown occurred both with and without
> the patch, right?
>
> The problem here is that when you hit livelock (full CPU
> utilization), then you are pretty much unable to do anything
> at all, unless the code path goes all the way to the top of
> the stack.

Yep, the -current box livelocked with and without the patch.  I'm not sure
if -current is solely to blame, though.  My -current box is using a PNIC,
which incurs additional overhead relative to other tulip clones, according
to the driver's comments.  And the 3com in that box hasn't worked in a
while... maybe I should try debugging that so I have an additional test
point.

> > The conclusion?  I think that the dc driver does a good enough job of
> > grabbing multiple packets at once, and won't be helped by Terry's patch
> > except in a few very cases.
>
> 10% is a good improvement; my gut feeling is that it would
> have been less than that.  This is actually good news for
> me, since it means that my 30% number is bounded by the
> user space program not being run (in other words, I should
> be able to get considerably better performance, using a
> weighted fair share scheduler).  As long as it doesn't
> damage performance, I think that it's proven itself.

Hm, true, I guess the improvement is respectable.  My thought is mostly
that I'm not sure how much it's extending the performance range of a
system; testing with more varied packet loads as suggested by Alfred would
help tell us the answer to this.

> > In fact, I have a sneaky suspicion that Terry's patch may
> > increase bus traffic slightly.  I'm not sure how much of
> > an issue this is, perhaps Bill or Luigi could comment.
>
> This would be interesting to me, as well.  I gave Luigi an
> early copy of the patch to play with a while ago, and also
> copied Bill.
>
> I'm interested in how you think it could increase traffic;
> the only credible reason I've been able to come up with is
> the ability to push more packets through, when they would
> otherwise end up being dropped because of the queue full
> condition -- if this is the case, the bus traffic is real
> work, and not additonal overhead.

The extra polling of the bus in cases where there are no additional
packets to grab is what I was wondering about.  I guess in comparison to
the quantity of packet data going by, it's not a real issue.

> > In short, if we're going to try to tackle high interrupt load,
> > it should be done by disabling interrupts and going to polling
> > under high load;
>
> I would agree with this, except that it's only really a
> useful observation if FreeBSD is being used as purely a
> network processor.  Without interrupts, the polling will
> take a significant portion of the available CPU to do, and
> you can't burn that CPU if, for example, you have an SSL
> card that does your handshakes, but you need to run the SSL
> sessions themselves up in user space.

Straight polling isn't necessarily the solution I was thinking of, but
rather some form of interrupt disabling at high rates.  For example, if
the driver were to keep track of how many interrupts/second it was taking,
perhaps it could up the number of receive buffers from 64 to something
higher, then disable the card's interrupt and set a callback to run in a
short bit of time at which point interrupts would be reenabled and the
interrupt handler would be run.  Ideally, this could reduce the number of
interrupts greatly, increasing efficiency under load.  Paired with this
could be receive polling during transmit, something which does not seem to
be done at current, if I'm reading correctly.

I'm not sure of the feasibility of the above, unfortunately - it would
seem highly dependant on how short of a timeout we can realistically get
along with how many mbufs we can spare for receive buffers.

> I'd argue that the complexity is coming, no matter what.  If
> you seperate out the tx_eof and rx_eof entry points, and
> externalize them into the ethernet driver interface, in order
> to enable polling, you are going to need to have a return
> value on them, as well.
>
> To implement scheduling, this return value is going to need
> to give a packet count, so that you can forego polling every
> N packets (N > fair share threshold), or else you are not
> going to be able to do any additional processing.

True.

> NB: If you are interested in pure connection rate, and you
> want to protect against SYN flood, then your best bet is
> actually to simply put a SYN-cookie implementation into the
> firmware for the card, and deal with connection setup that
> way.  With that approach, you should be able to easily
> support a quarter million connections a second.

True.  Note that in my test I wasn't actually simulating a syn flood, even
though I was using a syn flooder:  I just pointed it at a closed port.  In
the case of most NICs, offloading syn cookies to them isn't an option.

> > I suppose this would all change if we were using LRP and doing lots of
> > processing in the interrupt handler... but we aren't.
>
> This is probably a back-handed poke at me not making the
> code available.  I have to either clear it with my employer,
> or reimplement it from scratch (not hard; I could probably
> improve it significantly, were I to do this, knowing what I
> now know).  I'm in the process of getting an approval list
> together.

It's a statement of reality more than anything.  Even if you were (able)
to release the LRP code, I'm not sure if getting it integrated would be
possible unless it's -current compatible.  Of course, I guess one could
argue that if it increases performance then -current should be made
compatible with it... let's stick to discussing interrupts for now.

Mike "Silby" Silbersack


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20011014014546.L36700-100000>