Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Jan 2012 10:28:24 +0000
From:      "Robert N. M. Watson" <rwatson@FreeBSD.org>
To:        =?utf-8?B?0JrQvtC90YzQutC+0LIg0JXQstCz0LXQvdC40Lk=?= <kes-kes@yandex.ru>
Cc:        freebsd-bugs@FreeBSD.org, bz@FreeBSD.org
Subject:   Re: misc/164130: broken netisr initialization
Message-ID:  <A8A57BF5-3EF7-43A3-8106-ED93A82C71F1@FreeBSD.org>
In-Reply-To: <154594163.20120117194113@yandex.ru>
References:  <201201142126.q0ELQVbZ087496@freefall.freebsd.org> <68477246.20120115000025@yandex.ru> <737885D7-5DC2-4A0D-A5DF-4A380D035648@FreeBSD.org> <154594163.20120117194113@yandex.ru>

next in thread | previous in thread | raw e-mail | index | archive | help

On 17 Jan 2012, at 17:41, =CA=EE=ED=FC=EA=EE=E2 =C5=E2=E3=E5=ED=E8=E9 =
wrote:

> Loads only netisr3.
> and question: ip works over ethernet. How you can distinguish ip and =
ether???

netstat -Q is showing you per-protocol (layer) processing statistics. An =
IP packet arriving via ethernet will typically be counted twice: once =
for ethernet input/decapsulation, and once for IP-layer processing. =
Netisr dispatch serves a number of purposes, not least preventing =
excessive stack depth/recursion and load balancing.

There has been a historic tension between deferred (queued) dispatch to =
a separate worker and direct dispatch ("process to completion"). The =
former offers more opportunities for parallelism and reduces latency =
during interrupt-layer processing. However, the latter reduces overhead =
and overall packet latency for higher-level parallelism by avoiding =
queueing/scheduling overheads, as well as avoiding packets migration =
between caches, reducing cache coherency traffic. Our general experience =
is that many common configurations, especially lower-end systems *and* =
systems with multi-queue 10gbps cards, prefer direct dispatch. However, =
there are forwarding scenarios or ones in which CPU count significantly =
outnumbers NIC input queue count, where queuing to additional workers =
can markedly improve performance.

In FreeBSD 9.0 we've attempted to improve the vocabulary of expressible =
policies in netisr so that we can explore which work best in various =
scenarios, giving users more flexibility but also attempting to =
determine a better longer-term model. Ideally, as with the VM system, =
these features would be to some extent self-tuning, but we don't have =
enough information and experience to decide how best to do that yet.

>     NETISR_POLICY_FLOW    netisr should maintain flow ordering as =
defined by
>                           the mbuf header flow ID field.  If the =
protocol
>                           implements nh_m2flow, then netisr will query =
the
>                           protocol in the event that the mbuf doesn't =
have a
>                           flow ID, falling back on source ordering.
>=20
>     NETISR_POLICY_CPU     netisr will entirely delegate all work =
placement
>                           decisions to the protocol, querying =
nh_m2cpuid for
>                           each packet.
>=20
> _FLOW: description says that cpuid discovered by flow.
> _CPU: here decision to choose CPU is deligated to protocol. maybe it
> will be clear to name it as: NETISR_POLICY_PROTO ???

The name has to do with the nature of the information returned by the =
netisr protocol handler -- in the former case, the protocol returns a =
flow identifier, which is used by netisr to calculate an affinity. In =
the latter case, the protocol returns a CPU affinity directly.

> and BIG QUESTION: why you allow to somebody (flow, proto) to make any
> decisions??? That is wrong: because of bad their
> implementation/decision may cause to schedule packets only to some =
CPU.
> So one CPU will overloaded (0%idle) other will be free. (100%idle)

I think you're confusing policy and mechanism. The above KPIs are about =
providing the mechanism to implement a variety of policies. Many of the =
policies we are interested in are not yet implemented, or available only =
as patches. Keep in mind that workloads and systems are highly variable, =
with variable costs for work dispatch, etc. We run on high-end Intel =
servers, where individual CPUs tend to be very powerful but not all that =
plentiful, but also embedded multi-threadd MIPS devices with many =
threads, each individually quite weak. Deferred dispatch is a better =
choice for the latter, where there are optimised handoff primitives to =
help avoid queueing overhead, whereas in the former case you really want =
NIC-backed work dispatch, which will generally mean you want direct =
dispatch with multiple ithreads (one per queue) rather than multiple =
netisr threads. Using deferred dispatch in Intel-style environments is =
generally unproductive, since high-end configurations will support =
multi-queue input already, and CPUs are quite powerful.


>> * Enforcing ordering limits the opportunity for concurrency, but =
maintains
>> * the strong ordering requirements found in some protocols, such as =
TCP.
> TCP do not require strong ordering requiremets!!! Maybe you mean UDP?

I think most people would disagree with this. Reordering TCP segments =
leads to extremely poor TCP behaviour -- there is an extensive research =
literature on this, and maintaining ordering for TCP flows is a critical =
network stack design goal.

> To get full concurency you must put new flowid to free CPU and
> remember cpuid for that flow.

Stateful assignment of flows to CPUs is of significant interest to use, =
although currently we only support hash-based assignment without state. =
In large part, that decision is a good one, as multi-queue network cards =
are highly variable in terms of the size of their state tables for =
offloading flow-specific affinity policies. For example, lower-end =
10gbps cards may support state tables with 32 entries. High-end cards =
may support state tables with tens of thousands of entries.

> Just hash packetflow to then number of thrreads: net.isr.numthreads
> nws_array[flowid]=3D hash( flowid, sourceid, ifp->if_index, source )
> if( cpuload( nws_array[flowid] )>99 )
> nws_array[flowid]++;  //queue packet to other CPU
>=20
> that will be just ten lines of conde instead of 50 in your case.

We support a more complex KPI because we need to support future policies =
that are more complex. For example, there are out-of-tree changes that =
align TCP-level and netisr-level per-CPU data structures and affinity =
with NIC RSS support. The algorithm you've suggested above explicitly =
introduces reordering, which would significant damage network =
performance, even though it appears to balance CPU load better.

> Also nitice you have:
> /*
> * Utility routines for protocols that implement their own mapping of =
flows
> * to CPUs.
> */
> u_int
> netisr_get_cpucount(void)
> {
>=20
>        return (nws_count);
> }
>=20
> but you do not use it! that break incapsulation.

This is a public symbol for use outside of the netisr framework -- for =
example, in the uncommitted RSS code.

> Also I want to ask you: help me please where I can find documention
> about scheduling netisr and full packetflow through kernel:
> packetinput->kernel->packetoutput
> but more description what is going on with packet while it is passing
> router.

Unfortunately, this code is currently largely self-documenting. The =
Stevens' books are getting quite outdated, as are McKusick/Neville-Neil =
-- however, they at least offer structural guides which may be of use to =
you. Refreshes of these books would be extremely helpful.

Robert=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A8A57BF5-3EF7-43A3-8106-ED93A82C71F1>