Date: Mon, 30 Jan 2012 10:28:24 +0000 From: "Robert N. M. Watson" <rwatson@FreeBSD.org> To: =?utf-8?B?0JrQvtC90YzQutC+0LIg0JXQstCz0LXQvdC40Lk=?= <kes-kes@yandex.ru> Cc: freebsd-bugs@FreeBSD.org, bz@FreeBSD.org Subject: Re: misc/164130: broken netisr initialization Message-ID: <A8A57BF5-3EF7-43A3-8106-ED93A82C71F1@FreeBSD.org> In-Reply-To: <154594163.20120117194113@yandex.ru> References: <201201142126.q0ELQVbZ087496@freefall.freebsd.org> <68477246.20120115000025@yandex.ru> <737885D7-5DC2-4A0D-A5DF-4A380D035648@FreeBSD.org> <154594163.20120117194113@yandex.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On 17 Jan 2012, at 17:41, =CA=EE=ED=FC=EA=EE=E2 =C5=E2=E3=E5=ED=E8=E9 = wrote: > Loads only netisr3. > and question: ip works over ethernet. How you can distinguish ip and = ether??? netstat -Q is showing you per-protocol (layer) processing statistics. An = IP packet arriving via ethernet will typically be counted twice: once = for ethernet input/decapsulation, and once for IP-layer processing. = Netisr dispatch serves a number of purposes, not least preventing = excessive stack depth/recursion and load balancing. There has been a historic tension between deferred (queued) dispatch to = a separate worker and direct dispatch ("process to completion"). The = former offers more opportunities for parallelism and reduces latency = during interrupt-layer processing. However, the latter reduces overhead = and overall packet latency for higher-level parallelism by avoiding = queueing/scheduling overheads, as well as avoiding packets migration = between caches, reducing cache coherency traffic. Our general experience = is that many common configurations, especially lower-end systems *and* = systems with multi-queue 10gbps cards, prefer direct dispatch. However, = there are forwarding scenarios or ones in which CPU count significantly = outnumbers NIC input queue count, where queuing to additional workers = can markedly improve performance. In FreeBSD 9.0 we've attempted to improve the vocabulary of expressible = policies in netisr so that we can explore which work best in various = scenarios, giving users more flexibility but also attempting to = determine a better longer-term model. Ideally, as with the VM system, = these features would be to some extent self-tuning, but we don't have = enough information and experience to decide how best to do that yet. > NETISR_POLICY_FLOW netisr should maintain flow ordering as = defined by > the mbuf header flow ID field. If the = protocol > implements nh_m2flow, then netisr will query = the > protocol in the event that the mbuf doesn't = have a > flow ID, falling back on source ordering. >=20 > NETISR_POLICY_CPU netisr will entirely delegate all work = placement > decisions to the protocol, querying = nh_m2cpuid for > each packet. >=20 > _FLOW: description says that cpuid discovered by flow. > _CPU: here decision to choose CPU is deligated to protocol. maybe it > will be clear to name it as: NETISR_POLICY_PROTO ??? The name has to do with the nature of the information returned by the = netisr protocol handler -- in the former case, the protocol returns a = flow identifier, which is used by netisr to calculate an affinity. In = the latter case, the protocol returns a CPU affinity directly. > and BIG QUESTION: why you allow to somebody (flow, proto) to make any > decisions??? That is wrong: because of bad their > implementation/decision may cause to schedule packets only to some = CPU. > So one CPU will overloaded (0%idle) other will be free. (100%idle) I think you're confusing policy and mechanism. The above KPIs are about = providing the mechanism to implement a variety of policies. Many of the = policies we are interested in are not yet implemented, or available only = as patches. Keep in mind that workloads and systems are highly variable, = with variable costs for work dispatch, etc. We run on high-end Intel = servers, where individual CPUs tend to be very powerful but not all that = plentiful, but also embedded multi-threadd MIPS devices with many = threads, each individually quite weak. Deferred dispatch is a better = choice for the latter, where there are optimised handoff primitives to = help avoid queueing overhead, whereas in the former case you really want = NIC-backed work dispatch, which will generally mean you want direct = dispatch with multiple ithreads (one per queue) rather than multiple = netisr threads. Using deferred dispatch in Intel-style environments is = generally unproductive, since high-end configurations will support = multi-queue input already, and CPUs are quite powerful. >> * Enforcing ordering limits the opportunity for concurrency, but = maintains >> * the strong ordering requirements found in some protocols, such as = TCP. > TCP do not require strong ordering requiremets!!! Maybe you mean UDP? I think most people would disagree with this. Reordering TCP segments = leads to extremely poor TCP behaviour -- there is an extensive research = literature on this, and maintaining ordering for TCP flows is a critical = network stack design goal. > To get full concurency you must put new flowid to free CPU and > remember cpuid for that flow. Stateful assignment of flows to CPUs is of significant interest to use, = although currently we only support hash-based assignment without state. = In large part, that decision is a good one, as multi-queue network cards = are highly variable in terms of the size of their state tables for = offloading flow-specific affinity policies. For example, lower-end = 10gbps cards may support state tables with 32 entries. High-end cards = may support state tables with tens of thousands of entries. > Just hash packetflow to then number of thrreads: net.isr.numthreads > nws_array[flowid]=3D hash( flowid, sourceid, ifp->if_index, source ) > if( cpuload( nws_array[flowid] )>99 ) > nws_array[flowid]++; //queue packet to other CPU >=20 > that will be just ten lines of conde instead of 50 in your case. We support a more complex KPI because we need to support future policies = that are more complex. For example, there are out-of-tree changes that = align TCP-level and netisr-level per-CPU data structures and affinity = with NIC RSS support. The algorithm you've suggested above explicitly = introduces reordering, which would significant damage network = performance, even though it appears to balance CPU load better. > Also nitice you have: > /* > * Utility routines for protocols that implement their own mapping of = flows > * to CPUs. > */ > u_int > netisr_get_cpucount(void) > { >=20 > return (nws_count); > } >=20 > but you do not use it! that break incapsulation. This is a public symbol for use outside of the netisr framework -- for = example, in the uncommitted RSS code. > Also I want to ask you: help me please where I can find documention > about scheduling netisr and full packetflow through kernel: > packetinput->kernel->packetoutput > but more description what is going on with packet while it is passing > router. Unfortunately, this code is currently largely self-documenting. The = Stevens' books are getting quite outdated, as are McKusick/Neville-Neil = -- however, they at least offer structural guides which may be of use to = you. Refreshes of these books would be extremely helpful. Robert=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A8A57BF5-3EF7-43A3-8106-ED93A82C71F1>