Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 18 Nov 2002 17:12:56 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Luigi Rizzo <rizzo@icir.org>
Cc:        David Gilbert <dgilbert@velocet.ca>, dolemite@wuli.nu, freebsd-hackers@FreeBSD.ORG, freebsd-net@FreeBSD.ORG
Subject:   Re: Small initial LRP processing patch vs. -current
Message-ID:  <3DD99018.73B703A@mindspring.com>
References:  <20021109180321.GA559@unknown.nycap.rr.com> <3DCD8761.5763AAB2@mindspring.com> <15823.51640.68022.555852@canoe.velocet.net> <3DD1865E.B9C72DF5@mindspring.com> <15826.24074.605709.966155@canoe.velocet.net> <3DD2F33E.BE136568@mindspring.com> <3DD96FC0.B77331A1@mindspring.com> <20021118151109.B19767@xorpc.icir.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Luigi Rizzo wrote:
> Strictly speaking it is not necessary to get rid of the ipintr()
> code, it will just remain unused, so the relevant part of the patch is
> the direct call of ip_input() instead of schednetisr().
> 
> This patch will not make any difference if you have device_polling
> enabled, because polling already does this -- queues a small number
> of packets (default is max 5 per card) and calls ip_input on them
> right away.

The problem with this is that it introduces a livelock point at
the queue.  I understand that this is tunable, but it is still a
problem.

> The increase on the peak performance will be, i guess, largely
> dependent on the load and on caching effects -- one expects that
> processing packets right away will cause the cache to be warm
> and thus the overall processing be faster.

Actually, the increase in peak performance should be the result
of immediately calling the ip_input routine with interrupts
disabled, and then calling it again, while there are packets
pending.

As the name of this patch implies, there is more to a complete
implementation than the patch I have posted, which only gets
rid of the NETISR latency.

The polling code runs at hard clock, etc., when ether_poll() is
called, and so it still suffers the latency in the DEVICE_POLLING
case.

> I do not understand this claim:
> 
> > The basic theory here is that ipintr processing can be delayed
> > indefinitely, if interrupt load is high enough, and there will
> > be a maximum latency of 10ms for IP processing after ether_input(),
> > in the normal stack case, without the patches.
> 
> because netisr are not timer driven to the best of my knowledge --
> they just fire right after the cards' interrupts are complete.

That's almost right.  The soft interrupt handlers run when you
splx() out of a raised priority level.  In fact, this happens at
the end of clockintr, so NETISR *is* timer driven, until you hit
an interrupt saturation point, where no soft interrupts run.

If your interrupt load is high enough that you receive another
interrupt before you are done processing the previous interrupt,
then you never get a chance to run soft interrupts, at least until
you run out of mbufs.  That's why the normal packet processing
throughput curve looks like:

|
|       ...  <- Receiver livelock point
|     .    .
|   .      .
|  .       .
| .        .
| .        .
|.          .
|.           .
|.            .........................
+--------------------------------------------

Polling changes this somewhat.  The top end is reduced, in exchange
for not dropping off as badly

|
|       ...  <- Receiver livelock point
|     * +++*
|   *      .++ ++++ ++++ ++++ ++++ +++
|  *       .  +    +    +    +    +
| *        .
| *        .
|*          .
|*           .
|*            .........................
+--------------------------------------------

The reason the top end is depressed is that the polling is not load
adaptive.  Instead, it is fixed overhead relative (e.g. the use of
the hardclock to effect calls to ether_poll(), which then calls into
all drivers, whether they have data available or not, to fetch state
across the PCI bus to ask the cards if there is data available.

In the LRP case (or at least the case of the patch I posted), there
is the same disabling of the interrupts so that the stack can run,
and the poll occurs under high load, but (1) only the cards that have
data are polled (through hardware interrupt handlers plus soft
interrupt coelescing).  The net effect is that you gain a better
efficiency under extreme load, because it switches from interrupt
to polling, and then back again, based on the load.

The assumption here is that network processing is as important as
the interrupts which result in network processing.

There's also a cache line win, as long as your stack code that
you end up running plus the driver all fits into L1.

The thing that polling buys you that this patch does not is the
ability to run to completion up to the socket layer.  As I've
already pointed out, I have patches for that, too, I just want a
measured performanc delta with running the IP stack at hardware
interrupt vs. NETISR.

Basically, this means that LRP will have two less latency barriers,
and one less stall barrier than polling.

Since the overall measure of networking equipment is how many
packets you can pump through the thing in time "T", anything you
do to increase that number is good.

In any case, I'd like to see someone who can load to livelock try:

1)	Unmodified FreeBSD-current without polling
2)	"" with polling
3)	"" without polling, plus this patch
4)	"" with polling, plus this patch

FWIW, I expect to still hit livelock at the high end, since the
processing up through the socket layer still happens at NETISR
(the patch only affects IP stack operation).  But I expect the
removal of the latency in processing to push up the high end
before the livelock occurs.

With polling enabled, I expect only a minor change.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3DD99018.73B703A>