Date: Wed, 28 Aug 2013 12:37:10 -0700 From: Jack Vogel <jfvogel@gmail.com> To: "Alexander V. Chernikov" <melifaro@yandex-team.ru> Cc: Adrian Chadd <adrian@freebsd.org>, Andre Oppermann <andre@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org>, FreeBSD Net <net@freebsd.org>, Luigi Rizzo <luigi@freebsd.org>, "Andrey V. Elsukov" <ae@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, freebsd-arch@freebsd.org Subject: Re: Network stack changes Message-ID: <CAFOYbcnbcp4z60SeDXTQ%2BacPGC55DCYfhZZuRvHvu7HhyWTang@mail.gmail.com> In-Reply-To: <521E41CB.30700@yandex-team.ru> References: <521E41CB.30700@yandex-team.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
Very interesting material Alexander, only had time to glance at it now, will look in more depth later, thanks! Jack On Wed, Aug 28, 2013 at 11:30 AM, Alexander V. Chernikov < melifaro@yandex-team.ru> wrote: > Hello list! > > There is a lot constantly raising discussions related to networking stack > performance/changes. > > I'll try to summarize current problems and possible solutions from my > point of view. > (Generally this is one problem: stack is slooooooooooooooooooooooooooow**, > but we need to know why and what to do). > > Let's start with current IPv4 packet flow on a typical router: > http://static.ipfw.ru/images/**freebsd_ipv4_flow.png<http://static.ipfw.ru/images/freebsd_ipv4_flow.png> > > (I'm sorry I can't provide this as text since Visio don't have any > 'ascii-art' exporter). > > Note that we are using process-to-completion model, e.g. process any > packet in ISR until it is either > consumed by L4+ stack or dropped or put to egress NIC queue. > > (There is also deferred ISR model implemented inside netisr but it does > not change much: > it can help to do more fine-grained hashing (for GRE or other similar > traffic), but > 1) it uses per-packet mutex locking which kills all performance > 2) it currently does not have _any_ hashing functions (see absence of > flags in `netstat -Q`) > People using http://static.ipfw.ru/patches/**netisr_ip_flowid.diff<http://static.ipfw.ru/patches/netisr_ip_flowid.diff>(or modified PPPoe/GRE version) > report some profit, but without fixing (1) it can't help much > ) > > So, let's start: > > 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since > there is nearly no contention > (the only thing that can happen is driver reconfiguration which is rare > and, more signifficant, we do this once > for the batch of packets received in given interrupt). However, due to > some (im)possible deadlocks current code > does per-packet ring unlock/lock (see ixgbe_rx_input()). > There was a discussion ended with nothing: http://lists.freebsd.org/** > pipermail/freebsd-net/2012-**October/033520.html<http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html> > > 1*) Possible BPF users. Here we have one rlock if there are any readers > present > (and mutex for any matching packets, but this is more or less OK. > Additionally, there is WIP to implement multiqueue BPF > and there is chance that we can reduce lock contention there). There is > also an "optimize_writers" hack permitting applications > like CDP to use BPF as writers but not registering them as receivers > (which implies rlock) > > 2/3) Virtual interfaces (laggs/vlans over lagg and other simular > constructions). > Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more > funny - we use complex vlan_hash with another rlock to > get vlan interface from underlying one. > > This is definitely not like things should be done and this can be changed > more or less easily. > > There are some useful terms/techniques in world of software/hardware > routing: they have clear 'control plane' and 'data plane' separation. > Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg > hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with > options, destined to hosts without ARP/NDP record, and similar). Latter one > is done in hardware (or effective software implementation). > Control plane is responsible to provide data for efficient data plane > operations. This is the point we are missing nearly everywhere. > > What I want to say is: lagg is pure control-plane stuff and vlan is nearly > the same. We can't apply this approach to complex cases like > lagg-over-vlans-over-vlans-**over-(pppoe_ng0-and_wifi0) > but we definitely can do this for most common setups like (igb* or ix* in > lagg with or without vlans on top of lagg). > > We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add > some more. We even have per-driver hooks to program HW filtering. > > One small step to do is to throw packet to vlan interface directly (P1), > proof-of-concept(working in production): > http://lists.freebsd.org/**pipermail/freebsd-net/2013-**April/035270.html<http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html> > > Another is to change lagg packet accounting: http://lists.freebsd.org/** > pipermail/svn-src-all/2013-**April/067570.html<http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html> > Again, this is more like HW boxes do (aggregate all counters including > errors) (and I can't imagine what real error we can get from _lagg_). > > 4) If we are router, we can do either slooow ip_input() -> ip_forward() -> > ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow' > path for multicast/options/local traffic (e.g. works exactly like 'data > plane' part). > (Btw, we can consider net.inet.ip.fastforwarding to be turned on by > default at least for non-IPSEC kernels) > > Here we have to determine if this is local packet or not, e.g. F(dst_ip) > returning 1 or 0. Currently we are simply using standard rlock + hash of > iface addresses. > (And some consumers like ipfw(4) do the same, but without lock). > We don't need to do this! We can build sorted array of IPv4 addresses or > other efficient structure on every address change and use it unlocked with > delayed garbage collection (proof-of-concept attached) > (There is another thing to discuss: maybe we can do this once somewhere in > ip_input and mark mbuf as 'local/non-local' ? ) > > 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks. > This is OK. > > However, 6) and 7) are not. > Firewall can use the same pfil lock as reader protection without imposing > its own lock. currently pfil&ipfw code is ready to do this. > > 8) Radix/rt* api. This is probably the worst place in entire stack. It is > toooo generic, tooo slow and buggy (do you use IPv6? you definitely know > what I'm talking about). > A) It really is too generic and assumption that it can be (effectively) > used for every family is wrong. Two examples: > we don't need to lookup all 128 bits of IPv6 address. Subnets with mask > >/64 are not used widely (actually the only reason to use them are p2p > links due to ND potential problems). > One of common solutions is to lookup 64bits, and build another trie (or > other structure) in case of collision. > Another example is MPLS where we can simply do direct array lookup based > on ingress label. > > B) It is terribly slow (AFAIR luigi@ did some performance management, > numbers available in one of netmap pdfs) > C) It is not multipath-capable. Stateful (and non-working) multipath is > definitely not the right way. > > 8*) rtentry > We are doing it wrong. > Currently _every_ lookup locks/unlocks given rte twice. > First lock is related to and old-old story for trusting IP redirects (and > auto-adding host routes for them). Hopefully currently it is disabled > automatically when you turn forwarding on. > The second one is much more complicated: we are assuming that rte's with > non-zero refcount value can stop egress interface from being destroyed. > This is wrong (but widely used) assumption. > > We can use delayed GC instead of locking for rte's and this won't break > things more than they are broken now (patch attached). > We can't do the same for ifp structures since > a) virtual ones can assume some state in underlying physical NIC > b) physical ones just _can_ be destroyed (maybe regardless of user wants > this or not, like: SFP being unplugged from NIC) or simply lead to kernel > crash due to SW/HW inconsistency > > One of possible solution is to implement stable refcounts based on PCPU > counters, and apply thos counters to ifp, but seem to be non-trivial. > > > Another rtalloc(9) problem is the fact that radix is used as both 'control > plane' and 'data plane' structure/api. Some users always want to put more > information in rte, while others > want to make rte more compact. We just need _different_ structures for > that. > Feature-rich, lot-of-data control plane one (to store everything we want > to store, including, for example, PID of process originating the route) - > current radix can be modified to do this. > And address-family-depended another structure (array, trie, or anything) > which contains _only_ data necessary to put packet on the wire. > > 11) arpresolve. Currently (this was decoupled in 8.x) we have > a) ifaddr rlock > b) lle rlock. > > We don't need those locks. > We need to > a) make lle layer per-interface instead of global (and this can also solve > multiple fibs and L2 mappings done in fib.0 issue) > b) use rtalloc(9)-provided lock instead of separate locking > c) actually, we need to do rewrite this layer because > d) lle actually is the place to do real multipath: > > briefly, > you have rte pointing to some special nexthop structure pointing to lle, > which has the following data: > num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend > to header > Separate post will follow. > > With the following, we can achieve lagg traffic distribution without > actually using lagg_transmit and similar stuff (at least in most common > scenarious) > (for example, TCP output definitely can benefit from this, since we can > account flowid once for TCP session and use in in every mbuf) > > > So. Imagine we have done all this. How we can estimate the difference? > > There was a thread, started a year ago, describing 'stock' performance and > difference for various modifications. > It is done on 8.x, however I've got similar results on recent 9.x > > http://lists.freebsd.org/**pipermail/freebsd-net/2012-**July/032680.html<http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html> > > Briefly: > > 2xE5645 @ Intel 82599 NIC. > Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, > no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte > IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to destinations in > vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured for all > destination addresses. Traffic level is slightly above or slightly below > system performance. > > we start from 1.4MPPS (if we are using several routes to minimize mutex > contention). > > My 'current' result for the same test, on same HW, with the following > modifications: > > * 1) ixgbe per-packet ring unlock removed > * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used) > * 4) separate lockless in_localip() version > * 6) - using existing pfil lock > * 7) using lockless version > * 8) radix converted to use rmlock instead of rlock. Delayed GC is used > instead of mutexes > * 10) - using existing pfil lock > * 11) using radix lock to do arpresolve(). Not using lle rlock > > (so the rmlocks are the only locks used on data path). > > Additionally, ipstat counters are converted to PCPU (no real performance > implications). > ixgbe does not do per-packet accounting (as in head). > if_vlan counters are converted to PCPU > lagg is converted to rmlock, per-packet accounting is removed (using stat > from underlying interfaces) > lle hash size is bumped to 1024 instead of 32 (not applicable here, but > slows things down for large L2 domains) > > The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16 > cores), nearly the same for HT on and 22 cores. > > .. > while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on > the same-class hardware and _userland_ forwarding. > > One of key features making all such products possible (DPDK, netmap, > packetshader, Cisco SW forwarding) - is use of batching instead of > process-to-completion model. > Batching mitigates locking cost, batching does not wash out CPU cache, and > so on. > > So maybe we can consider passing batches from NIC to at least L2 layer > with netisr? or even up to ip_input() ? > > Another question is about making some sort of reliable GC like ("passive > serialization" or other similar not-to-pronounce-words about Linux and > lockless objects). > > > P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how > can this be done and what benefit can be achieved. > > > > > > > > > > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFOYbcnbcp4z60SeDXTQ%2BacPGC55DCYfhZZuRvHvu7HhyWTang>