From owner-freebsd-arch@FreeBSD.ORG Wed Aug 28 19:37:12 2013 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BDB86FF7; Wed, 28 Aug 2013 19:37:12 +0000 (UTC) (envelope-from jfvogel@gmail.com) Received: from mail-ve0-x235.google.com (mail-ve0-x235.google.com [IPv6:2607:f8b0:400c:c01::235]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DEA2221E4; Wed, 28 Aug 2013 19:37:11 +0000 (UTC) Received: by mail-ve0-f181.google.com with SMTP id jz10so4684066veb.12 for ; Wed, 28 Aug 2013 12:37:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=dxnKa8hM+AefrXStvsj3F1WeY+ZAph3Wgx8EEc4Juck=; b=nDn2BH6n6WYTCIyZd96zHN8qhwNcQ0pJVWb4lKM7AK7rcBXSJRUP5XpTA3vnpkd5Qz 5uPda4ecEq7EqxRDWBYRA4Schfm5gAeoGby4K41DMofd8RpKrdlnWj+7ZT8JND8Hedh5 tAPciWe84X9MKbEc4HINMV7Yku+OAZn2/zgpeaye7vPXzAsGnHwUEpFOcWWPTR05qs43 8usgrgTfZx3ua2xF9o1tCACdJTt7edXUX0o9mGvYSsaCSTDTJOTVjltYPOmw716R9Fwl 6k+XnaZw8cUQmXH6qpoWM4ijNVJEv5LgiNCZ6A6JWdDuv0U76pIIt5Z0f7hkNytI40AL Phew== MIME-Version: 1.0 X-Received: by 10.58.235.69 with SMTP id uk5mr27194246vec.17.1377718630983; Wed, 28 Aug 2013 12:37:10 -0700 (PDT) Received: by 10.220.159.141 with HTTP; Wed, 28 Aug 2013 12:37:10 -0700 (PDT) In-Reply-To: <521E41CB.30700@yandex-team.ru> References: <521E41CB.30700@yandex-team.ru> Date: Wed, 28 Aug 2013 12:37:10 -0700 Message-ID: Subject: Re: Network stack changes From: Jack Vogel To: "Alexander V. Chernikov" Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Adrian Chadd , Andre Oppermann , FreeBSD Hackers , FreeBSD Net , Luigi Rizzo , "Andrey V. Elsukov" , Gleb Smirnoff , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Aug 2013 19:37:12 -0000 Very interesting material Alexander, only had time to glance at it now, will look in more depth later, thanks! Jack On Wed, Aug 28, 2013 at 11:30 AM, Alexander V. Chernikov < melifaro@yandex-team.ru> wrote: > Hello list! > > There is a lot constantly raising discussions related to networking stack > performance/changes. > > I'll try to summarize current problems and possible solutions from my > point of view. > (Generally this is one problem: stack is slooooooooooooooooooooooooooow**, > but we need to know why and what to do). > > Let's start with current IPv4 packet flow on a typical router: > http://static.ipfw.ru/images/**freebsd_ipv4_flow.png > > (I'm sorry I can't provide this as text since Visio don't have any > 'ascii-art' exporter). > > Note that we are using process-to-completion model, e.g. process any > packet in ISR until it is either > consumed by L4+ stack or dropped or put to egress NIC queue. > > (There is also deferred ISR model implemented inside netisr but it does > not change much: > it can help to do more fine-grained hashing (for GRE or other similar > traffic), but > 1) it uses per-packet mutex locking which kills all performance > 2) it currently does not have _any_ hashing functions (see absence of > flags in `netstat -Q`) > People using http://static.ipfw.ru/patches/**netisr_ip_flowid.diff(or modified PPPoe/GRE version) > report some profit, but without fixing (1) it can't help much > ) > > So, let's start: > > 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since > there is nearly no contention > (the only thing that can happen is driver reconfiguration which is rare > and, more signifficant, we do this once > for the batch of packets received in given interrupt). However, due to > some (im)possible deadlocks current code > does per-packet ring unlock/lock (see ixgbe_rx_input()). > There was a discussion ended with nothing: http://lists.freebsd.org/** > pipermail/freebsd-net/2012-**October/033520.html > > 1*) Possible BPF users. Here we have one rlock if there are any readers > present > (and mutex for any matching packets, but this is more or less OK. > Additionally, there is WIP to implement multiqueue BPF > and there is chance that we can reduce lock contention there). There is > also an "optimize_writers" hack permitting applications > like CDP to use BPF as writers but not registering them as receivers > (which implies rlock) > > 2/3) Virtual interfaces (laggs/vlans over lagg and other simular > constructions). > Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more > funny - we use complex vlan_hash with another rlock to > get vlan interface from underlying one. > > This is definitely not like things should be done and this can be changed > more or less easily. > > There are some useful terms/techniques in world of software/hardware > routing: they have clear 'control plane' and 'data plane' separation. > Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg > hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with > options, destined to hosts without ARP/NDP record, and similar). Latter one > is done in hardware (or effective software implementation). > Control plane is responsible to provide data for efficient data plane > operations. This is the point we are missing nearly everywhere. > > What I want to say is: lagg is pure control-plane stuff and vlan is nearly > the same. We can't apply this approach to complex cases like > lagg-over-vlans-over-vlans-**over-(pppoe_ng0-and_wifi0) > but we definitely can do this for most common setups like (igb* or ix* in > lagg with or without vlans on top of lagg). > > We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add > some more. We even have per-driver hooks to program HW filtering. > > One small step to do is to throw packet to vlan interface directly (P1), > proof-of-concept(working in production): > http://lists.freebsd.org/**pipermail/freebsd-net/2013-**April/035270.html > > Another is to change lagg packet accounting: http://lists.freebsd.org/** > pipermail/svn-src-all/2013-**April/067570.html > Again, this is more like HW boxes do (aggregate all counters including > errors) (and I can't imagine what real error we can get from _lagg_). > > 4) If we are router, we can do either slooow ip_input() -> ip_forward() -> > ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow' > path for multicast/options/local traffic (e.g. works exactly like 'data > plane' part). > (Btw, we can consider net.inet.ip.fastforwarding to be turned on by > default at least for non-IPSEC kernels) > > Here we have to determine if this is local packet or not, e.g. F(dst_ip) > returning 1 or 0. Currently we are simply using standard rlock + hash of > iface addresses. > (And some consumers like ipfw(4) do the same, but without lock). > We don't need to do this! We can build sorted array of IPv4 addresses or > other efficient structure on every address change and use it unlocked with > delayed garbage collection (proof-of-concept attached) > (There is another thing to discuss: maybe we can do this once somewhere in > ip_input and mark mbuf as 'local/non-local' ? ) > > 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks. > This is OK. > > However, 6) and 7) are not. > Firewall can use the same pfil lock as reader protection without imposing > its own lock. currently pfil&ipfw code is ready to do this. > > 8) Radix/rt* api. This is probably the worst place in entire stack. It is > toooo generic, tooo slow and buggy (do you use IPv6? you definitely know > what I'm talking about). > A) It really is too generic and assumption that it can be (effectively) > used for every family is wrong. Two examples: > we don't need to lookup all 128 bits of IPv6 address. Subnets with mask > >/64 are not used widely (actually the only reason to use them are p2p > links due to ND potential problems). > One of common solutions is to lookup 64bits, and build another trie (or > other structure) in case of collision. > Another example is MPLS where we can simply do direct array lookup based > on ingress label. > > B) It is terribly slow (AFAIR luigi@ did some performance management, > numbers available in one of netmap pdfs) > C) It is not multipath-capable. Stateful (and non-working) multipath is > definitely not the right way. > > 8*) rtentry > We are doing it wrong. > Currently _every_ lookup locks/unlocks given rte twice. > First lock is related to and old-old story for trusting IP redirects (and > auto-adding host routes for them). Hopefully currently it is disabled > automatically when you turn forwarding on. > The second one is much more complicated: we are assuming that rte's with > non-zero refcount value can stop egress interface from being destroyed. > This is wrong (but widely used) assumption. > > We can use delayed GC instead of locking for rte's and this won't break > things more than they are broken now (patch attached). > We can't do the same for ifp structures since > a) virtual ones can assume some state in underlying physical NIC > b) physical ones just _can_ be destroyed (maybe regardless of user wants > this or not, like: SFP being unplugged from NIC) or simply lead to kernel > crash due to SW/HW inconsistency > > One of possible solution is to implement stable refcounts based on PCPU > counters, and apply thos counters to ifp, but seem to be non-trivial. > > > Another rtalloc(9) problem is the fact that radix is used as both 'control > plane' and 'data plane' structure/api. Some users always want to put more > information in rte, while others > want to make rte more compact. We just need _different_ structures for > that. > Feature-rich, lot-of-data control plane one (to store everything we want > to store, including, for example, PID of process originating the route) - > current radix can be modified to do this. > And address-family-depended another structure (array, trie, or anything) > which contains _only_ data necessary to put packet on the wire. > > 11) arpresolve. Currently (this was decoupled in 8.x) we have > a) ifaddr rlock > b) lle rlock. > > We don't need those locks. > We need to > a) make lle layer per-interface instead of global (and this can also solve > multiple fibs and L2 mappings done in fib.0 issue) > b) use rtalloc(9)-provided lock instead of separate locking > c) actually, we need to do rewrite this layer because > d) lle actually is the place to do real multipath: > > briefly, > you have rte pointing to some special nexthop structure pointing to lle, > which has the following data: > num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend > to header > Separate post will follow. > > With the following, we can achieve lagg traffic distribution without > actually using lagg_transmit and similar stuff (at least in most common > scenarious) > (for example, TCP output definitely can benefit from this, since we can > account flowid once for TCP session and use in in every mbuf) > > > So. Imagine we have done all this. How we can estimate the difference? > > There was a thread, started a year ago, describing 'stock' performance and > difference for various modifications. > It is done on 8.x, however I've got similar results on recent 9.x > > http://lists.freebsd.org/**pipermail/freebsd-net/2012-**July/032680.html > > Briefly: > > 2xE5645 @ Intel 82599 NIC. > Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, > no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte > IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to destinations in > vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured for all > destination addresses. Traffic level is slightly above or slightly below > system performance. > > we start from 1.4MPPS (if we are using several routes to minimize mutex > contention). > > My 'current' result for the same test, on same HW, with the following > modifications: > > * 1) ixgbe per-packet ring unlock removed > * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used) > * 4) separate lockless in_localip() version > * 6) - using existing pfil lock > * 7) using lockless version > * 8) radix converted to use rmlock instead of rlock. Delayed GC is used > instead of mutexes > * 10) - using existing pfil lock > * 11) using radix lock to do arpresolve(). Not using lle rlock > > (so the rmlocks are the only locks used on data path). > > Additionally, ipstat counters are converted to PCPU (no real performance > implications). > ixgbe does not do per-packet accounting (as in head). > if_vlan counters are converted to PCPU > lagg is converted to rmlock, per-packet accounting is removed (using stat > from underlying interfaces) > lle hash size is bumped to 1024 instead of 32 (not applicable here, but > slows things down for large L2 domains) > > The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16 > cores), nearly the same for HT on and 22 cores. > > .. > while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on > the same-class hardware and _userland_ forwarding. > > One of key features making all such products possible (DPDK, netmap, > packetshader, Cisco SW forwarding) - is use of batching instead of > process-to-completion model. > Batching mitigates locking cost, batching does not wash out CPU cache, and > so on. > > So maybe we can consider passing batches from NIC to at least L2 layer > with netisr? or even up to ip_input() ? > > Another question is about making some sort of reliable GC like ("passive > serialization" or other similar not-to-pronounce-words about Linux and > lockless objects). > > > P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how > can this be done and what benefit can be achieved. > > > > > > > > > > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >