From owner-freebsd-arch@FreeBSD.ORG Wed Aug 28 18:30:46 2013 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 624D2B52; Wed, 28 Aug 2013 18:30:46 +0000 (UTC) (envelope-from melifaro@yandex-team.ru) Received: from forward-corp1e.mail.yandex.net (forward-corp1e.mail.yandex.net [IPv6:2a02:6b8:0:202::10]) by mx1.freebsd.org (Postfix) with ESMTP id 3E5232D9C; Wed, 28 Aug 2013 18:30:45 +0000 (UTC) Received: from smtpcorp4.mail.yandex.net (smtpcorp4.mail.yandex.net [95.108.252.2]) by forward-corp1e.mail.yandex.net (Yandex) with ESMTP id 1CAD064006D; Wed, 28 Aug 2013 22:30:42 +0400 (MSK) Received: from smtpcorp4.mail.yandex.net (localhost [127.0.0.1]) by smtpcorp4.mail.yandex.net (Yandex) with ESMTP id 018FC2C0173; Wed, 28 Aug 2013 22:30:41 +0400 (MSK) Received: from dhcp170-36-red.yandex.net (dhcp170-36-red.yandex.net [95.108.170.36]) by smtpcorp4.mail.yandex.net (nwsmtp/Yandex) with ESMTP id fyMijftbk1-UfD0Ynrx; Wed, 28 Aug 2013 22:30:41 +0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1377714641; bh=AW6ufh6kIvAxiM+MsIJACctEm8ZbDDYBUX9jB61JP/4=; h=Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject: Content-Type; b=u8yrzEOZBJx9qwddK0RtoZ49tkgibPrkr2JZ+Jf85+Jaj0CrmBmy77f0747mqE+AS w7TDf3gcCt+QbncXKKcIFCxas4ITtcPy4R4fhXsELRhDjBGz7WyBYNLhnt1KibM3je uhHqth/NVDpzXMr5toFTgvyh0qUB0BRcXtFa/P2I= Authentication-Results: smtpcorp4.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Message-ID: <521E41CB.30700@yandex-team.ru> Date: Wed, 28 Aug 2013 22:30:35 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130418 Thunderbird/17.0.5 MIME-Version: 1.0 To: FreeBSD Net , freebsd-hackers@freebsd.org, freebsd-arch@freebsd.org Subject: Network stack changes Content-Type: multipart/mixed; boundary="------------010308000904000207080306" Cc: ae@FreeBSD.org, adrian@freebsd.org, Gleb Smirnoff , andre@freebsd.org, luigi@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Aug 2013 18:30:46 -0000 This is a multi-part message in MIME format. --------------010308000904000207080306 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hello list! There is a lot constantly raising discussions related to networking stack performance/changes. I'll try to summarize current problems and possible solutions from my point of view. (Generally this is one problem: stack is slooooooooooooooooooooooooooow, but we need to know why and what to do). Let's start with current IPv4 packet flow on a typical router: http://static.ipfw.ru/images/freebsd_ipv4_flow.png (I'm sorry I can't provide this as text since Visio don't have any 'ascii-art' exporter). Note that we are using process-to-completion model, e.g. process any packet in ISR until it is either consumed by L4+ stack or dropped or put to egress NIC queue. (There is also deferred ISR model implemented inside netisr but it does not change much: it can help to do more fine-grained hashing (for GRE or other similar traffic), but 1) it uses per-packet mutex locking which kills all performance 2) it currently does not have _any_ hashing functions (see absence of flags in `netstat -Q`) People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or modified PPPoe/GRE version) report some profit, but without fixing (1) it can't help much ) So, let's start: 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since there is nearly no contention (the only thing that can happen is driver reconfiguration which is rare and, more signifficant, we do this once for the batch of packets received in given interrupt). However, due to some (im)possible deadlocks current code does per-packet ring unlock/lock (see ixgbe_rx_input()). There was a discussion ended with nothing: http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html 1*) Possible BPF users. Here we have one rlock if there are any readers present (and mutex for any matching packets, but this is more or less OK. Additionally, there is WIP to implement multiqueue BPF and there is chance that we can reduce lock contention there). There is also an "optimize_writers" hack permitting applications like CDP to use BPF as writers but not registering them as receivers (which implies rlock) 2/3) Virtual interfaces (laggs/vlans over lagg and other simular constructions). Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more funny - we use complex vlan_hash with another rlock to get vlan interface from underlying one. This is definitely not like things should be done and this can be changed more or less easily. There are some useful terms/techniques in world of software/hardware routing: they have clear 'control plane' and 'data plane' separation. Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with options, destined to hosts without ARP/NDP record, and similar). Latter one is done in hardware (or effective software implementation). Control plane is responsible to provide data for efficient data plane operations. This is the point we are missing nearly everywhere. What I want to say is: lagg is pure control-plane stuff and vlan is nearly the same. We can't apply this approach to complex cases like lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0) but we definitely can do this for most common setups like (igb* or ix* in lagg with or without vlans on top of lagg). We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add some more. We even have per-driver hooks to program HW filtering. One small step to do is to throw packet to vlan interface directly (P1), proof-of-concept(working in production): http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html Another is to change lagg packet accounting: http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html Again, this is more like HW boxes do (aggregate all counters including errors) (and I can't imagine what real error we can get from _lagg_). 4) If we are router, we can do either slooow ip_input() -> ip_forward() -> ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow' path for multicast/options/local traffic (e.g. works exactly like 'data plane' part). (Btw, we can consider net.inet.ip.fastforwarding to be turned on by default at least for non-IPSEC kernels) Here we have to determine if this is local packet or not, e.g. F(dst_ip) returning 1 or 0. Currently we are simply using standard rlock + hash of iface addresses. (And some consumers like ipfw(4) do the same, but without lock). We don't need to do this! We can build sorted array of IPv4 addresses or other efficient structure on every address change and use it unlocked with delayed garbage collection (proof-of-concept attached) (There is another thing to discuss: maybe we can do this once somewhere in ip_input and mark mbuf as 'local/non-local' ? ) 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks. This is OK. However, 6) and 7) are not. Firewall can use the same pfil lock as reader protection without imposing its own lock. currently pfil&ipfw code is ready to do this. 8) Radix/rt* api. This is probably the worst place in entire stack. It is toooo generic, tooo slow and buggy (do you use IPv6? you definitely know what I'm talking about). A) It really is too generic and assumption that it can be (effectively) used for every family is wrong. Two examples: we don't need to lookup all 128 bits of IPv6 address. Subnets with mask >/64 are not used widely (actually the only reason to use them are p2p links due to ND potential problems). One of common solutions is to lookup 64bits, and build another trie (or other structure) in case of collision. Another example is MPLS where we can simply do direct array lookup based on ingress label. B) It is terribly slow (AFAIR luigi@ did some performance management, numbers available in one of netmap pdfs) C) It is not multipath-capable. Stateful (and non-working) multipath is definitely not the right way. 8*) rtentry We are doing it wrong. Currently _every_ lookup locks/unlocks given rte twice. First lock is related to and old-old story for trusting IP redirects (and auto-adding host routes for them). Hopefully currently it is disabled automatically when you turn forwarding on. The second one is much more complicated: we are assuming that rte's with non-zero refcount value can stop egress interface from being destroyed. This is wrong (but widely used) assumption. We can use delayed GC instead of locking for rte's and this won't break things more than they are broken now (patch attached). We can't do the same for ifp structures since a) virtual ones can assume some state in underlying physical NIC b) physical ones just _can_ be destroyed (maybe regardless of user wants this or not, like: SFP being unplugged from NIC) or simply lead to kernel crash due to SW/HW inconsistency One of possible solution is to implement stable refcounts based on PCPU counters, and apply thos counters to ifp, but seem to be non-trivial. Another rtalloc(9) problem is the fact that radix is used as both 'control plane' and 'data plane' structure/api. Some users always want to put more information in rte, while others want to make rte more compact. We just need _different_ structures for that. Feature-rich, lot-of-data control plane one (to store everything we want to store, including, for example, PID of process originating the route) - current radix can be modified to do this. And address-family-depended another structure (array, trie, or anything) which contains _only_ data necessary to put packet on the wire. 11) arpresolve. Currently (this was decoupled in 8.x) we have a) ifaddr rlock b) lle rlock. We don't need those locks. We need to a) make lle layer per-interface instead of global (and this can also solve multiple fibs and L2 mappings done in fib.0 issue) b) use rtalloc(9)-provided lock instead of separate locking c) actually, we need to do rewrite this layer because d) lle actually is the place to do real multipath: briefly, you have rte pointing to some special nexthop structure pointing to lle, which has the following data: num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend to header Separate post will follow. With the following, we can achieve lagg traffic distribution without actually using lagg_transmit and similar stuff (at least in most common scenarious) (for example, TCP output definitely can benefit from this, since we can account flowid once for TCP session and use in in every mbuf) So. Imagine we have done all this. How we can estimate the difference? There was a thread, started a year ago, describing 'stock' performance and difference for various modifications. It is done on 8.x, however I've got similar results on recent 9.x http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html Briefly: 2xE5645 @ Intel 82599 NIC. Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to destinations in vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured for all destination addresses. Traffic level is slightly above or slightly below system performance. we start from 1.4MPPS (if we are using several routes to minimize mutex contention). My 'current' result for the same test, on same HW, with the following modifications: * 1) ixgbe per-packet ring unlock removed * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used) * 4) separate lockless in_localip() version * 6) - using existing pfil lock * 7) using lockless version * 8) radix converted to use rmlock instead of rlock. Delayed GC is used instead of mutexes * 10) - using existing pfil lock * 11) using radix lock to do arpresolve(). Not using lle rlock (so the rmlocks are the only locks used on data path). Additionally, ipstat counters are converted to PCPU (no real performance implications). ixgbe does not do per-packet accounting (as in head). if_vlan counters are converted to PCPU lagg is converted to rmlock, per-packet accounting is removed (using stat from underlying interfaces) lle hash size is bumped to 1024 instead of 32 (not applicable here, but slows things down for large L2 domains) The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16 cores), nearly the same for HT on and 22 cores. .. while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the same-class hardware and _userland_ forwarding. One of key features making all such products possible (DPDK, netmap, packetshader, Cisco SW forwarding) - is use of batching instead of process-to-completion model. Batching mitigates locking cost, batching does not wash out CPU cache, and so on. So maybe we can consider passing batches from NIC to at least L2 layer with netisr? or even up to ip_input() ? Another question is about making some sort of reliable GC like ("passive serialization" or other similar not-to-pronounce-words about Linux and lockless objects). P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how can this be done and what benefit can be achieved. --------------010308000904000207080306 Content-Type: text/plain; charset=UTF-8; name="1_ixgbe_unlock.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="1_ixgbe_unlock.diff" commit 20a52503455c80cd149d2232bdc0d37e14381178 Author: Charlie Root Date: Tue Oct 23 21:20:13 2012 +0000 Remove RX ring unlock/lock before calling if_input() from ixgbe drivers. diff --git a/sys/dev/ixgbe/ixgbe.c b/sys/dev/ixgbe/ixgbe.c index 5d8752b..fc1491e 100644 --- a/sys/dev/ixgbe/ixgbe.c +++ b/sys/dev/ixgbe/ixgbe.c @@ -4171,9 +4171,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *ifp, struct mbuf *m, u32 ptype if (tcp_lro_rx(&rxr->lro, m, 0) == 0) return; } - IXGBE_RX_UNLOCK(rxr); (*ifp->if_input)(ifp, m); - IXGBE_RX_LOCK(rxr); } static __inline void --------------010308000904000207080306 Content-Type: text/plain; charset=UTF-8; name="2_ixgbe_vlans2.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="2_ixgbe_vlans2.diff" Index: sys/dev/ixgbe/ixgbe.c =================================================================== --- sys/dev/ixgbe/ixgbe.c (revision 248704) +++ sys/dev/ixgbe/ixgbe.c (working copy) @@ -2880,6 +2880,14 @@ ixgbe_allocate_queues(struct adapter *adapter) error = ENOMEM; goto err_rx_desc; } + + if ((rxr->vlans = malloc(sizeof(struct ifvlans), M_DEVBUF, + M_NOWAIT | M_ZERO)) == NULL) { + device_printf(dev, + "Critical Failure setting up vlan index\n"); + error = ENOMEM; + goto err_rx_desc; + } } /* @@ -4271,6 +4279,11 @@ ixgbe_free_receive_buffers(struct rx_ring *rxr) rxr->ptag = NULL; } + if (rxr->vlans != NULL) { + free(rxr->vlans, M_DEVBUF); + rxr->vlans = NULL; + } + return; } @@ -4303,7 +4316,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet * return; } IXGBE_RX_UNLOCK(rxr); - (*ifp->if_input)(ifp, m); + (*ifp->if_input)(m->m_pkthdr.rcvif, m); IXGBE_RX_LOCK(rxr); } @@ -4360,6 +4373,7 @@ ixgbe_rxeof(struct ix_queue *que) u16 count = rxr->process_limit; union ixgbe_adv_rx_desc *cur; struct ixgbe_rx_buf *rbuf, *nbuf; + struct ifnet *ifp_dst; IXGBE_RX_LOCK(rxr); @@ -4522,9 +4536,19 @@ ixgbe_rxeof(struct ix_queue *que) (staterr & IXGBE_RXD_STAT_VP)) vtag = le16toh(cur->wb.upper.vlan); if (vtag) { - sendmp->m_pkthdr.ether_vtag = vtag; - sendmp->m_flags |= M_VLANTAG; - } + ifp_dst = rxr->vlans->idx[EVL_VLANOFTAG(vtag)]; + + if (ifp_dst != NULL) { + ifp_dst->if_ipackets++; + sendmp->m_pkthdr.rcvif = ifp_dst; + } else { + sendmp->m_pkthdr.ether_vtag = vtag; + sendmp->m_flags |= M_VLANTAG; + sendmp->m_pkthdr.rcvif = ifp; + } + } else + sendmp->m_pkthdr.rcvif = ifp; + if ((ifp->if_capenable & IFCAP_RXCSUM) != 0) ixgbe_rx_checksum(staterr, sendmp, ptype); #if __FreeBSD_version >= 800000 @@ -4625,7 +4649,32 @@ ixgbe_rx_checksum(u32 staterr, struct mbuf * mp, u return; } +/* + * This routine gets real vlan ifp based on + * underlying ifp and vlan tag. + */ +static struct ifnet * +ixgbe_get_vlan(struct ifnet *ifp, uint16_t vtag) +{ + /* XXX: IFF_MONITOR */ +#if 0 + struct lagg_port *lp = ifp->if_lagg; + struct lagg_softc *sc = lp->lp_softc; + + /* Skip lagg nesting */ + while (ifp->if_type == IFT_IEEE8023ADLAG) { + lp = ifp->if_lagg; + sc = lp->lp_softc; + ifp = sc->sc_ifp; + } +#endif + /* Get vlan interface based on tag */ + ifp = VLAN_DEVAT(ifp, vtag); + + return (ifp); +} + /* ** This routine is run via an vlan config EVENT, ** it enables us to use the HW Filter table since @@ -4637,7 +4686,9 @@ static void ixgbe_register_vlan(void *arg, struct ifnet *ifp, u16 vtag) { struct adapter *adapter = ifp->if_softc; - u16 index, bit; + u16 index, bit, j; + struct rx_ring *rxr; + struct ifnet *ifv; if (ifp->if_softc != arg) /* Not our event */ return; @@ -4645,7 +4696,20 @@ ixgbe_register_vlan(void *arg, struct ifnet *ifp, if ((vtag == 0) || (vtag > 4095)) /* Invalid */ return; + ifv = ixgbe_get_vlan(ifp, vtag); + IXGBE_CORE_LOCK(adapter); + + if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) { + rxr = adapter->rx_rings; + + for (j = 0; j < adapter->num_queues; j++, rxr++) { + IXGBE_RX_LOCK(rxr); + rxr->vlans->idx[vtag] = ifv; + IXGBE_RX_UNLOCK(rxr); + } + } + index = (vtag >> 5) & 0x7F; bit = vtag & 0x1F; adapter->shadow_vfta[index] |= (1 << bit); @@ -4663,7 +4727,8 @@ static void ixgbe_unregister_vlan(void *arg, struct ifnet *ifp, u16 vtag) { struct adapter *adapter = ifp->if_softc; - u16 index, bit; + u16 index, bit, j; + struct rx_ring *rxr; if (ifp->if_softc != arg) return; @@ -4672,6 +4737,15 @@ ixgbe_unregister_vlan(void *arg, struct ifnet *ifp return; IXGBE_CORE_LOCK(adapter); + + rxr = adapter->rx_rings; + + for (j = 0; j < adapter->num_queues; j++, rxr++) { + IXGBE_RX_LOCK(rxr); + rxr->vlans->idx[vtag] = NULL; + IXGBE_RX_UNLOCK(rxr); + } + index = (vtag >> 5) & 0x7F; bit = vtag & 0x1F; adapter->shadow_vfta[index] &= ~(1 << bit); @@ -4686,8 +4760,8 @@ ixgbe_setup_vlan_hw_support(struct adapter *adapte { struct ifnet *ifp = adapter->ifp; struct ixgbe_hw *hw = &adapter->hw; + u32 ctrl, j; struct rx_ring *rxr; - u32 ctrl; /* @@ -4713,6 +4787,15 @@ ixgbe_setup_vlan_hw_support(struct adapter *adapte if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) { ctrl &= ~IXGBE_VLNCTRL_CFIEN; ctrl |= IXGBE_VLNCTRL_VFE; + } else { + /* Zero vlan table */ + rxr = adapter->rx_rings; + + for (j = 0; j < adapter->num_queues; j++, rxr++) { + IXGBE_RX_LOCK(rxr); + memset(rxr->vlans->idx, 0, sizeof(struct ifvlans)); + IXGBE_RX_UNLOCK(rxr); + } } if (hw->mac.type == ixgbe_mac_82598EB) ctrl |= IXGBE_VLNCTRL_VME; Index: sys/dev/ixgbe/ixgbe.h =================================================================== --- sys/dev/ixgbe/ixgbe.h (revision 248704) +++ sys/dev/ixgbe/ixgbe.h (working copy) @@ -284,6 +284,11 @@ struct ix_queue { u64 irqs; }; +struct ifvlans { + struct ifnet *idx[4096]; +}; + + /* * The transmit ring, one per queue */ @@ -307,7 +312,6 @@ struct tx_ring { } queue_status; u32 txd_cmd; bus_dma_tag_t txtag; - char mtx_name[16]; #ifndef IXGBE_LEGACY_TX struct buf_ring *br; struct task txq_task; @@ -324,6 +328,7 @@ struct tx_ring { unsigned long no_tx_dma_setup; u64 no_desc_avail; u64 total_packets; + char mtx_name[16]; }; @@ -346,8 +351,8 @@ struct rx_ring { u16 num_desc; u16 mbuf_sz; u16 process_limit; - char mtx_name[16]; struct ixgbe_rx_buf *rx_buffers; + struct ifvlans *vlans; bus_dma_tag_t ptag; u32 bytes; /* Used for AIM calc */ @@ -363,6 +368,7 @@ struct rx_ring { #ifdef IXGBE_FDIR u64 flm; #endif + char mtx_name[16]; }; /* Our adapter structure */ --------------010308000904000207080306 Content-Type: text/plain; charset=UTF-8; name="3_in_localip_fast.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="3_in_localip_fast.diff" commit 7f1103ac622881182642b2d3ae17b6ff484c1293 Author: Charlie Root Date: Sun Apr 7 23:50:26 2013 +0000 Use lockles in_localip_fast() function. diff --git a/sys/net/route.h b/sys/net/route.h index 4d9371b..f588f03 100644 --- a/sys/net/route.h +++ b/sys/net/route.h @@ -365,6 +365,7 @@ void rt_maskedcopy(struct sockaddr *, struct sockaddr *, struct sockaddr *); */ #define RTGC_ROUTE 1 #define RTGC_IF 3 +#define RTGC_IFADDR 4 int rtexpunge(struct rtentry *); diff --git a/sys/netinet/in.c b/sys/netinet/in.c index 5341918..a83b8a9 100644 --- a/sys/netinet/in.c +++ b/sys/netinet/in.c @@ -93,6 +93,20 @@ VNET_DECLARE(struct inpcbinfo, ripcbinfo); VNET_DECLARE(struct arpstat, arpstat); /* ARP statistics, see if_arp.h */ #define V_arpstat VNET(arpstat) +struct in_ifaddrf { + struct in_ifaddrf *next; + struct in_addr addr; +}; + +struct in_ifaddrhashf { + uint32_t hmask; + uint32_t count; + struct in_ifaddrf **hash; +}; + +VNET_DEFINE(struct in_ifaddrhashf *, in_ifaddrhashtblf) = NULL; /* inet addr fast hash table */ +#define V_in_ifaddrhashtblf VNET(in_ifaddrhashtblf) + /* * Return 1 if an internet address is for a ``local'' host * (one to which we have a connection). If subnetsarelocal @@ -145,6 +159,120 @@ in_localip(struct in_addr in) return (0); } +int +in_localip_fast(struct in_addr in) +{ + struct in_ifaddrf *rec; + struct in_ifaddrhashf *f; + + if ((f = V_in_ifaddrhashtblf) == NULL) + return (0); + + rec = f->hash[INADDR_HASHVAL(in) & f->hmask]; + + while (rec != NULL && rec->addr.s_addr != in.s_addr) + rec = rec->next; + + if (rec != NULL) + return (1); + + return (0); +} + +struct in_ifaddrhashf * +in_hash_alloc(int additional) +{ + int count, hsize, i; + struct in_ifaddr *ia; + struct in_ifaddrhashf *new; + + count = additional + 1; + + IN_IFADDR_RLOCK(); + for (i = 0; i < INADDR_NHASH; i++) { + LIST_FOREACH(ia, &V_in_ifaddrhashtbl[i], ia_hash) + count++; + } + IN_IFADDR_RUNLOCK(); + + /* roundup to the next power of 2 */ + hsize = (1UL << flsl(count - 1)); + + new = malloc(sizeof(struct in_ifaddrhashf) + + sizeof(void *) * hsize + + sizeof(struct in_ifaddrf) * count, M_IFADDR, + M_NOWAIT | M_ZERO); + + if (new == NULL) + return (NULL); + + new->count = count; + new->hmask = hsize - 1; + new->hash = (struct in_ifaddrf **)(new + 1); + + return (new); +} + +int +in_hash_build(struct in_ifaddrhashf *new) +{ + struct in_ifaddr *ia; + int i, j, count, hsize, r; + struct in_ifaddrhashf *old; + struct in_ifaddrf *rec, *tmp; + + count = new->count - 1; + hsize = new->hmask + 1; + rec = (struct in_ifaddrf *)&new->hash[hsize]; + + IN_IFADDR_RLOCK(); + for (i = 0; i < INADDR_NHASH; i++) { + LIST_FOREACH(ia, &V_in_ifaddrhashtbl[i], ia_hash) { + rec->addr.s_addr = IA_SIN(ia)->sin_addr.s_addr; + + j = INADDR_HASHVAL(rec->addr) & new->hmask; + if ((tmp = new->hash[j]) == NULL) + new->hash[j] = rec; + else { + while (tmp->next) + tmp = tmp->next; + tmp->next = rec; + } + + rec++; + count--; + + /* End of memory */ + if (count < 0) + break; + } + + /* End of memory */ + if (count < 0) + break; + } + IN_IFADDR_RUNLOCK(); + + /* If count >0 then we succeeded in building hash. Stop cycle */ + + if (count >= 0) { + old = V_in_ifaddrhashtblf; + V_in_ifaddrhashtblf = new; + + rtgc_free(RTGC_IFADDR, old, 0); + + return (1); + } + + /* Fail. */ + if (new) + free(new, M_IFADDR); + + return (0); +} + + + /* * Determine whether an IP address is in a reserved set of addresses * that may not be forwarded, or whether datagrams to that destination @@ -239,6 +367,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, struct sockaddr_in oldaddr; int error, hostIsNew, iaIsNew, maskIsNew; int iaIsFirst; + struct in_ifaddrhashf *new_hash; ia = NULL; iaIsFirst = 0; @@ -405,6 +534,11 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, goto out; } + if ((new_hash = in_hash_alloc(1)) == NULL) { + error = ENOBUFS; + goto out; + } + ifa = &ia->ia_ifa; ifa_init(ifa); ifa->ifa_addr = (struct sockaddr *)&ia->ia_addr; @@ -427,6 +561,8 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, IN_IFADDR_WLOCK(); TAILQ_INSERT_TAIL(&V_in_ifaddrhead, ia, ia_link); IN_IFADDR_WUNLOCK(); + + in_hash_build(new_hash); iaIsNew = 1; } break; @@ -649,6 +785,8 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, ifa_free(&if_ia->ia_ifa); } else IN_IFADDR_WUNLOCK(); + if ((new_hash = in_hash_alloc(0)) != NULL) + in_hash_build(new_hash); ifa_free(&ia->ia_ifa); /* in_ifaddrhead */ out: if (ia != NULL) @@ -852,6 +990,7 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin, register u_long i = ntohl(sin->sin_addr.s_addr); struct sockaddr_in oldaddr; int s = splimp(), flags = RTF_UP, error = 0; + struct in_ifaddrhashf *new_hash; oldaddr = ia->ia_addr; if (oldaddr.sin_family == AF_INET) @@ -862,6 +1001,9 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin, LIST_INSERT_HEAD(INADDR_HASH(ia->ia_addr.sin_addr.s_addr), ia, ia_hash); IN_IFADDR_WUNLOCK(); + + if ((new_hash = in_hash_alloc(1)) != NULL) + in_hash_build(new_hash); } /* * Give the interface a chance to initialize @@ -887,6 +1029,8 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin, */ LIST_REMOVE(ia, ia_hash); IN_IFADDR_WUNLOCK(); + if ((new_hash = in_hash_alloc(1)) != NULL) + in_hash_build(new_hash); return (error); } } diff --git a/sys/netinet/in.h b/sys/netinet/in.h index b03e74c..948938a 100644 --- a/sys/netinet/in.h +++ b/sys/netinet/in.h @@ -741,6 +741,7 @@ int in_broadcast(struct in_addr, struct ifnet *); int in_canforward(struct in_addr); int in_localaddr(struct in_addr); int in_localip(struct in_addr); +int in_localip_fast(struct in_addr); int inet_aton(const char *, struct in_addr *); /* in libkern */ char *inet_ntoa(struct in_addr); /* in libkern */ char *inet_ntoa_r(struct in_addr ina, char *buf); /* in libkern */ diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c index 692e3e5..f7734a9 100644 --- a/sys/netinet/ip_fastfwd.c +++ b/sys/netinet/ip_fastfwd.c @@ -347,7 +347,7 @@ ip_fastforward(struct mbuf *m) /* * Is it for a local address on this host? */ - if (in_localip(ip->ip_dst)) + if (in_localip_fast(ip->ip_dst)) return m; //IPSTAT_INC(ips_total); @@ -390,7 +390,7 @@ ip_fastforward(struct mbuf *m) /* * Is it now for a local address on this host? */ - if (in_localip(dest)) + if (in_localip_fast(dest)) goto forwardlocal; /* * Go on with new destination address @@ -479,7 +479,7 @@ passin: /* * Is it now for a local address on this host? */ - if (m->m_flags & M_FASTFWD_OURS || in_localip(dest)) { + if (m->m_flags & M_FASTFWD_OURS || in_localip_fast(dest)) { forwardlocal: /* * Return packet for processing by ip_input(). diff --git a/sys/netinet/ipfw/ip_fw2.c b/sys/netinet/ipfw/ip_fw2.c index b76a638..53f6e97 100644 --- a/sys/netinet/ipfw/ip_fw2.c +++ b/sys/netinet/ipfw/ip_fw2.c @@ -1450,10 +1450,7 @@ do { \ case O_IP_SRC_ME: if (is_ipv4) { - struct ifnet *tif; - - INADDR_TO_IFP(src_ip, tif); - match = (tif != NULL); + match = in_localip_fast(src_ip); break; } #ifdef INET6 @@ -1490,10 +1487,7 @@ do { \ case O_IP_DST_ME: if (is_ipv4) { - struct ifnet *tif; - - INADDR_TO_IFP(dst_ip, tif); - match = (tif != NULL); + match = in_localip_fast(dst_ip); break; } #ifdef INET6 diff --git a/sys/netinet/ipfw/ip_fw_pfil.c b/sys/netinet/ipfw/ip_fw_pfil.c index a21f501..bdf8beb 100644 --- a/sys/netinet/ipfw/ip_fw_pfil.c +++ b/sys/netinet/ipfw/ip_fw_pfil.c @@ -184,7 +184,7 @@ again: bcopy(args.next_hop, (fwd_tag+1), sizeof(struct sockaddr_in)); m_tag_prepend(*m0, fwd_tag); - if (in_localip(args.next_hop->sin_addr)) + if (in_localip_fast(args.next_hop->sin_addr)) (*m0)->m_flags |= M_FASTFWD_OURS; } #endif /* INET || INET6 */ --------------010308000904000207080306 Content-Type: text/plain; charset=UTF-8; name="80_use_rtgc.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="80_use_rtgc.diff" commit 67a74d91a7b4a47a83fcfa5e79a6c6f0b4b1122d Author: Charlie Root Date: Fri Oct 26 17:10:52 2012 +0000 Remove rte locking for IPv4. Remove one of 2 locks from IPv6 rtes diff --git a/sys/net/if.c b/sys/net/if.c index a875326..eb6a723 100644 --- a/sys/net/if.c +++ b/sys/net/if.c @@ -487,6 +487,13 @@ if_alloc(u_char type) return (ifp); } + +void +if_free_real(struct ifnet *ifp) +{ + free(ifp, M_IFNET); +} + /* * Do the actual work of freeing a struct ifnet, and layer 2 common * structure. This call is made when the last reference to an @@ -499,6 +506,15 @@ if_free_internal(struct ifnet *ifp) KASSERT((ifp->if_flags & IFF_DYING), ("if_free_internal: interface not dying")); + if (rtgc_is_enabled()) { + /* + * FIXME: Sleep some time to permit packets + * using fastforwarding routine without locking + * die withour side effects. + */ + pause("if_free_gc", hz / 20); /* Sleep 50 milliseconds */ + } + if (if_com_free[ifp->if_alloctype] != NULL) if_com_free[ifp->if_alloctype](ifp->if_l2com, ifp->if_alloctype); @@ -511,7 +527,10 @@ if_free_internal(struct ifnet *ifp) IF_AFDATA_DESTROY(ifp); IF_ADDR_LOCK_DESTROY(ifp); ifq_delete(&ifp->if_snd); - free(ifp, M_IFNET); + if (rtgc_is_enabled()) + rtgc_free(RTGC_IF, ifp, 0); + else + if_free_real(ifp); } /* diff --git a/sys/net/if_var.h b/sys/net/if_var.h index 39c499f..5ef6264 100644 --- a/sys/net/if_var.h +++ b/sys/net/if_var.h @@ -857,6 +857,7 @@ void if_down(struct ifnet *); struct ifmultiaddr * if_findmulti(struct ifnet *, struct sockaddr *); void if_free(struct ifnet *); +void if_free_real(struct ifnet *); void if_free_type(struct ifnet *, u_char); void if_initname(struct ifnet *, const char *, int); void if_link_state_change(struct ifnet *, int); diff --git a/sys/net/route.c b/sys/net/route.c index 3059f5a..97965b3 100644 --- a/sys/net/route.c +++ b/sys/net/route.c @@ -142,6 +142,175 @@ VNET_DEFINE(int, rttrash); /* routes not in table but not freed */ static VNET_DEFINE(uma_zone_t, rtzone); /* Routing table UMA zone. */ #define V_rtzone VNET(rtzone) +SYSCTL_NODE(_net, OID_AUTO, gc, CTLFLAG_RW, 0, "Garbage collector"); + +MALLOC_DEFINE(M_RTGC, "rtgc", "route GC"); +void rtgc_func(void *_unused); +void rtfree_real(struct rtentry *rt); + +int _rtgc_default_enabled = 1; +TUNABLE_INT("net.gc.enable", &_rtgc_default_enabled); + +#define RTGC_CALLOUT_DELAY 1 +#define RTGC_EXPIRE_DELAY 3 + +VNET_DEFINE(struct mtx, rtgc_mtx); +#define V_rtgc_mtx VNET(rtgc_mtx) +VNET_DEFINE(struct callout, rtgc_callout); +#define V_rtgc_callout VNET(rtgc_callout) +VNET_DEFINE(int, rtgc_enabled); +#define V_rtgc_enabled VNET(rtgc_enabled) +SYSCTL_VNET_INT(_net_gc, OID_AUTO, enable, CTLFLAG_RW, + &VNET_NAME(rtgc_enabled), 1, + "Enable garbage collector"); +VNET_DEFINE(int, rtgc_expire_delay) = RTGC_EXPIRE_DELAY; +#define V_rtgc_expire_delay VNET(rtgc_expire_delay) +SYSCTL_VNET_INT(_net_gc, OID_AUTO, expire, CTLFLAG_RW, + &VNET_NAME(rtgc_expire_delay), 1, + "Object expiration delay"); +VNET_DEFINE(int, rtgc_numfailures); +#define V_rtgc_numfailures VNET(rtgc_numfailures) +SYSCTL_VNET_INT(_net_gc, OID_AUTO, failures, CTLFLAG_RD, + &VNET_NAME(rtgc_numfailures), 0, + "Number of objects leaked from route garbage collector"); +VNET_DEFINE(int, rtgc_numqueued); +#define V_rtgc_numqueued VNET(rtgc_numqueued) +SYSCTL_VNET_INT(_net_gc, OID_AUTO, queued, CTLFLAG_RD, + &VNET_NAME(rtgc_numqueued), 0, + "Number of objects queued for deletion"); +VNET_DEFINE(int, rtgc_numfreed); +#define V_rtgc_numfreed VNET(rtgc_numfreed) +SYSCTL_VNET_INT(_net_gc, OID_AUTO, freed, CTLFLAG_RD, + &VNET_NAME(rtgc_numfreed), 0, + "Number of objects deleted"); +VNET_DEFINE(int, rtgc_numinvoked); +#define V_rtgc_numinvoked VNET(rtgc_numinvoked) +SYSCTL_VNET_INT(_net_gc, OID_AUTO, invoked, CTLFLAG_RD, + &VNET_NAME(rtgc_numinvoked), 0, + "Number of times GC was invoked"); + +struct rtgc_item { + time_t expire; /* Whe we can delete this entry */ + int etype; /* Entry type */ + void *data; /* data to free */ + TAILQ_ENTRY(rtgc_item) items; +}; + +VNET_DEFINE(TAILQ_HEAD(, rtgc_item), rtgc_queue); +#define V_rtgc_queue VNET(rtgc_queue) + +int +rtgc_is_enabled() +{ + return V_rtgc_enabled; +} + +void +rtgc_func(void *_unused) +{ + struct rtgc_item *item, *temp_item; + TAILQ_HEAD(, rtgc_item) rtgc_tq; + int empty, deleted; + + CTR2(KTR_NET, "%s: started with %d objects", __func__, V_rtgc_numqueued); + + TAILQ_INIT(&rtgc_tq); + + /* Move all contents of current queue to new empty queue */ + mtx_lock(&V_rtgc_mtx); + V_rtgc_numinvoked++; + TAILQ_SWAP(&rtgc_queue, &rtgc_tq, rtgc_item, items); + mtx_unlock(&V_rtgc_mtx); + + deleted = 0; + + /* Dispatch as much as we can */ + TAILQ_FOREACH_SAFE(item, &rtgc_tq, items, temp_item) { + if (item->expire > time_uptime) + break; + + /* We can definitely delete this item */ + TAILQ_REMOVE(&rtgc_tq, item, items); + + switch (item->etype) { + case RTGC_ROUTE: + CTR1(KTR_NET, "Freeing route structure %p", item->data); + rtfree_real((struct rtentry *)item->data); + break; + case RTGC_IF: + CTR1(KTR_NET, "Freeing iface structure %p", item->data); + if_free_real((struct ifnet *)item->data); + break; + default: + CTR2(KTR_NET, "Unknown type: %d %p", item->etype, item->data); + break; + } + + /* Remove item itself */ + free(item, M_RTGC); + deleted++; + } + + /* + * Add remaining data back to mail queue. + * Note items are still sorted by time_uptime after merge. + */ + + mtx_lock(&V_rtgc_mtx); + /* Add new items to the end of our temporary queue */ + TAILQ_CONCAT(&rtgc_tq, &rtgc_queue, items); + /* Move items back to stable storage */ + TAILQ_SWAP(&rtgc_queue, &rtgc_tq, rtgc_item, items); + /* Check if we need to run callout another time */ + empty = TAILQ_EMPTY(&rtgc_queue); + /* Update counters */ + V_rtgc_numfreed += deleted; + V_rtgc_numqueued -= deleted; + mtx_unlock(&V_rtgc_mtx); + + CTR4(KTR_NET, "%s: ended with %d object(s) (%d deleted), callout: %s", + __func__, V_rtgc_numqueued, deleted, empty ? "stopped" : "sheduled"); + /* Schedule ourself iff there are items to delete */ + if (!empty) + callout_reset(&V_rtgc_callout, hz * RTGC_CALLOUT_DELAY, rtgc_func, NULL); +} + +void +rtgc_free(int etype, void *data, int can_sleep) +{ + struct rtgc_item *item; + + item = malloc(sizeof(struct rtgc_item), M_RTGC, (can_sleep ? M_WAITOK : M_NOWAIT) | M_ZERO); + if (item == NULL) { + V_rtgc_numfailures++; /* XXX: locking */ + return; /* Skip route freeing. Memory leak is much better than panic */ + } + + item->expire = time_uptime + V_rtgc_expire_delay; + item->etype = etype; + item->data = data; + + if ((!can_sleep) && (mtx_trylock(&V_rtgc_mtx) == 0)) { + /* Fail to acquire lock. Add another leak */ + free(item, M_RTGC); + V_rtgc_numfailures++; /* XXX: locking */ + return; + } + + if (can_sleep) + mtx_lock(&V_rtgc_mtx); + + TAILQ_INSERT_TAIL(&rtgc_queue, item, items); + V_rtgc_numqueued++; + + mtx_unlock(&V_rtgc_mtx); + + /* Schedule callout if not running */ + if (!callout_pending(&V_rtgc_callout)) + callout_reset(&V_rtgc_callout, hz * RTGC_CALLOUT_DELAY, rtgc_func, NULL); +} + + /* * handler for net.my_fibnum */ @@ -241,6 +410,17 @@ vnet_route_init(const void *unused __unused) dom->dom_rtattach((void **)rnh, dom->dom_rtoffset); } } + + /* Init garbage collector */ + mtx_init(&V_rtgc_mtx, "routeGC", NULL, MTX_DEF); + /* Init queue */ + TAILQ_INIT(&V_rtgc_queue); + /* Init garbage callout */ + memset(&V_rtgc_callout, 0, sizeof(rtgc_callout)); + callout_init(&V_rtgc_callout, 1); + /* Set default from loader tunable */ + V_rtgc_enabled = _rtgc_default_enabled; + //callout_reset(&V_rtgc_callout, 3 * hz, &rtgc_func, NULL); } VNET_SYSINIT(vnet_route_init, SI_SUB_PROTO_DOMAIN, SI_ORDER_FOURTH, vnet_route_init, 0); @@ -351,6 +531,74 @@ rtalloc1(struct sockaddr *dst, int report, u_long ignflags) } struct rtentry * +rtalloc1_fib_nolock(struct sockaddr *dst, int report, u_long ignflags, + u_int fibnum) +{ + struct radix_node_head *rnh; + struct radix_node *rn; + struct rtentry *newrt; + struct rt_addrinfo info; + int err = 0, msgtype = RTM_MISS; + int needlock; + + KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum")); + switch (dst->sa_family) { + case AF_INET6: + case AF_INET: + /* We support multiple FIBs. */ + break; + default: + fibnum = RT_DEFAULT_FIB; + break; + } + rnh = rt_tables_get_rnh(fibnum, dst->sa_family); + newrt = NULL; + if (rnh == NULL) + goto miss; + + /* + * Look up the address in the table for that Address Family + */ + needlock = !(ignflags & RTF_RNH_LOCKED); + if (needlock) + RADIX_NODE_HEAD_RLOCK(rnh); +#ifdef INVARIANTS + else + RADIX_NODE_HEAD_LOCK_ASSERT(rnh); +#endif + rn = rnh->rnh_matchaddr(dst, rnh); + if (rn && ((rn->rn_flags & RNF_ROOT) == 0)) { + newrt = RNTORT(rn); + if (needlock) + RADIX_NODE_HEAD_RUNLOCK(rnh); + goto done; + + } else if (needlock) + RADIX_NODE_HEAD_RUNLOCK(rnh); + + /* + * Either we hit the root or couldn't find any match, + * Which basically means + * "caint get there frm here" + */ +miss: + V_rtstat.rts_unreach++; + + if (report) { + /* + * If required, report the failure to the supervising + * Authorities. + * For a delete, this is not an error. (report == 0) + */ + bzero(&info, sizeof(info)); + info.rti_info[RTAX_DST] = dst; + rt_missmsg_fib(msgtype, &info, 0, err, fibnum); + } +done: + return (newrt); +} + +struct rtentry * rtalloc1_fib(struct sockaddr *dst, int report, u_long ignflags, u_int fibnum) { @@ -422,6 +670,23 @@ done: return (newrt); } + +void +rtfree_real(struct rtentry *rt) +{ + /* + * The key is separatly alloc'd so free it (see rt_setgate()). + * This also frees the gateway, as they are always malloc'd + * together. + */ + Free(rt_key(rt)); + + /* + * and the rtentry itself of course + */ + uma_zfree(V_rtzone, rt); +} + /* * Remove a reference count from an rtentry. * If the count gets low enough, take it out of the routing table @@ -484,18 +749,13 @@ rtfree(struct rtentry *rt) */ if (rt->rt_ifa) ifa_free(rt->rt_ifa); - /* - * The key is separatly alloc'd so free it (see rt_setgate()). - * This also frees the gateway, as they are always malloc'd - * together. - */ - Free(rt_key(rt)); - /* - * and the rtentry itself of course - */ RT_LOCK_DESTROY(rt); - uma_zfree(V_rtzone, rt); + + if (V_rtgc_enabled) + rtgc_free(RTGC_ROUTE, rt, 0); + else + rtfree_real(rt); return; } done: diff --git a/sys/net/route.h b/sys/net/route.h index b26ac44..3aa694d 100644 --- a/sys/net/route.h +++ b/sys/net/route.h @@ -363,9 +363,14 @@ void rt_maskedcopy(struct sockaddr *, struct sockaddr *, struct sockaddr *); * * RTFREE() uses an unlocked entry. */ +#define RTGC_ROUTE 1 +#define RTGC_IF 3 + int rtexpunge(struct rtentry *); void rtfree(struct rtentry *); +void rtgc_free(int etype, void *data, int can_sleep); +int rtgc_is_enabled(void); int rt_check(struct rtentry **, struct rtentry **, struct sockaddr *); /* XXX MRT COMPAT VERSIONS THAT SET UNIVERSE to 0 */ @@ -394,6 +399,7 @@ int rt_getifa_fib(struct rt_addrinfo *, u_int fibnum); void rtalloc_ign_fib(struct route *ro, u_long ignflags, u_int fibnum); void rtalloc_fib(struct route *ro, u_int fibnum); struct rtentry *rtalloc1_fib(struct sockaddr *, int, u_long, u_int); +struct rtentry *rtalloc1_fib_nolock(struct sockaddr *, int, u_long, u_int); int rtioctl_fib(u_long, caddr_t, u_int); void rtredirect_fib(struct sockaddr *, struct sockaddr *, struct sockaddr *, int, struct sockaddr *, u_int); diff --git a/sys/netinet/in_rmx.c b/sys/netinet/in_rmx.c index 1389873..1c9d9db 100644 --- a/sys/netinet/in_rmx.c +++ b/sys/netinet/in_rmx.c @@ -122,12 +122,12 @@ in_matroute(void *v_arg, struct radix_node_head *head) struct rtentry *rt = (struct rtentry *)rn; if (rt) { - RT_LOCK(rt); +// RT_LOCK(rt); if (rt->rt_flags & RTPRF_OURS) { rt->rt_flags &= ~RTPRF_OURS; rt->rt_rmx.rmx_expire = 0; } - RT_UNLOCK(rt); +// RT_UNLOCK(rt); } return rn; } @@ -365,7 +365,7 @@ in_inithead(void **head, int off) rnh = *head; rnh->rnh_addaddr = in_addroute; - rnh->rnh_matchaddr = in_matroute; + rnh->rnh_matchaddr = rn_match; rnh->rnh_close = in_clsroute; if (_in_rt_was_here == 0 ) { callout_init(&V_rtq_timer, CALLOUT_MPSAFE); diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c index d7fe411..d2b98b3 100644 --- a/sys/netinet/ip_fastfwd.c +++ b/sys/netinet/ip_fastfwd.c @@ -112,6 +112,22 @@ static VNET_DEFINE(int, ipfastforward_active); SYSCTL_VNET_INT(_net_inet_ip, OID_AUTO, fastforwarding, CTLFLAG_RW, &VNET_NAME(ipfastforward_active), 0, "Enable fast IP forwarding"); +void +rtalloc_ign_fib_nolock(struct route *ro, u_long ignore, u_int fibnum); + +void +rtalloc_ign_fib_nolock(struct route *ro, u_long ignore, u_int fibnum) +{ + struct rtentry *rt; + + if ((rt = ro->ro_rt) != NULL) { + if (rt->rt_ifp != NULL && rt->rt_flags & RTF_UP) + return; + ro->ro_rt = NULL; + } + ro->ro_rt = rtalloc1_fib_nolock(&ro->ro_dst, 1, ignore, fibnum); +} + static struct sockaddr_in * ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m) { @@ -126,7 +142,7 @@ ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m) dst->sin_family = AF_INET; dst->sin_len = sizeof(*dst); dst->sin_addr.s_addr = dest.s_addr; - in_rtalloc_ign(ro, 0, M_GETFIB(m)); + rtalloc_ign_fib_nolock(ro, 0, M_GETFIB(m)); /* * Route there and interface still up? @@ -140,8 +156,10 @@ ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m) } else { IPSTAT_INC(ips_noroute); IPSTAT_INC(ips_cantforward); +#if 0 if (rt) RTFREE(rt); +#endif icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_HOST, 0, 0); return NULL; } @@ -334,10 +352,11 @@ ip_fastforward(struct mbuf *m) if (in_localip(ip->ip_dst)) return m; - IPSTAT_INC(ips_total); + //IPSTAT_INC(ips_total); /* * Step 3: incoming packet firewall processing + in_rtalloc_ign(ro, 0, M_GETFIB(m)); */ /* @@ -476,8 +495,10 @@ forwardlocal: * "ours"-label. */ m->m_flags |= M_FASTFWD_OURS; +/* if (ro.ro_rt) RTFREE(ro.ro_rt); +*/ return m; } /* @@ -490,7 +511,7 @@ forwardlocal: m_tag_delete(m, fwd_tag); } #endif /* IPFIREWALL_FORWARD */ - RTFREE(ro.ro_rt); +// RTFREE(ro.ro_rt); if ((dst = ip_findroute(&ro, dest, m)) == NULL) return NULL; /* icmp unreach already sent */ ifp = ro.ro_rt->rt_ifp; @@ -601,17 +622,21 @@ passout: if (error != 0) IPSTAT_INC(ips_odropped); else { +#if 0 ro.ro_rt->rt_rmx.rmx_pksent++; IPSTAT_INC(ips_forward); IPSTAT_INC(ips_fastforward); +#endif } consumed: - RTFREE(ro.ro_rt); +// RTFREE(ro.ro_rt); return NULL; drop: if (m) m_freem(m); +/* if (ro.ro_rt) RTFREE(ro.ro_rt); +*/ return NULL; } diff --git a/sys/netinet6/in6_rmx.c b/sys/netinet6/in6_rmx.c index b526030..9aabe63 100644 --- a/sys/netinet6/in6_rmx.c +++ b/sys/netinet6/in6_rmx.c @@ -195,12 +195,12 @@ in6_matroute(void *v_arg, struct radix_node_head *head) struct rtentry *rt = (struct rtentry *)rn; if (rt) { - RT_LOCK(rt); + //RT_LOCK(rt); if (rt->rt_flags & RTPRF_OURS) { rt->rt_flags &= ~RTPRF_OURS; rt->rt_rmx.rmx_expire = 0; } - RT_UNLOCK(rt); + //RT_UNLOCK(rt); } return rn; } @@ -440,7 +440,7 @@ in6_inithead(void **head, int off) rnh = *head; rnh->rnh_addaddr = in6_addroute; - rnh->rnh_matchaddr = in6_matroute; + rnh->rnh_matchaddr = rn_match; if (V__in6_rt_was_here == 0) { callout_init(&V_rtq_timer6, CALLOUT_MPSAFE); --------------010308000904000207080306 Content-Type: text/plain; charset=UTF-8; name="81_radix_rmlock.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="81_radix_rmlock.diff" commit 0e7cebd1753c3b77bdc00d728fbd5910c2d2afec Author: Charlie Root Date: Mon Apr 8 15:35:00 2013 +0000 Make radix use rmlock. diff --git a/sys/contrib/ipfilter/netinet/ip_compat.h b/sys/contrib/ipfilter/netinet/ip_compat.h index 31e5b11..5e74da4 100644 --- a/sys/contrib/ipfilter/netinet/ip_compat.h +++ b/sys/contrib/ipfilter/netinet/ip_compat.h @@ -870,6 +870,7 @@ typedef u_int32_t u_32_t; # if (__FreeBSD_version >= 500043) # include # if (__FreeBSD_version > 700014) +# include # include # define KRWLOCK_T struct rwlock # ifdef _KERNEL diff --git a/sys/contrib/pf/net/pf_table.c b/sys/contrib/pf/net/pf_table.c index 40c9f67..b1dd703 100644 --- a/sys/contrib/pf/net/pf_table.c +++ b/sys/contrib/pf/net/pf_table.c @@ -44,6 +44,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #ifdef __FreeBSD__ #include diff --git a/sys/kern/subr_witness.c b/sys/kern/subr_witness.c index e565d01..f913d27 100644 --- a/sys/kern/subr_witness.c +++ b/sys/kern/subr_witness.c @@ -508,7 +508,7 @@ static struct witness_order_list_entry order_lists[] = { * Routing */ { "so_rcv", &lock_class_mtx_sleep }, - { "radix node head", &lock_class_rw }, + { "radix node head", &lock_class_rm }, { "rtentry", &lock_class_mtx_sleep }, { "ifaddr", &lock_class_mtx_sleep }, { NULL, NULL }, diff --git a/sys/kern/sys_socket.c b/sys/kern/sys_socket.c index 4cbae74..fea12d0 100644 --- a/sys/kern/sys_socket.c +++ b/sys/kern/sys_socket.c @@ -50,6 +50,8 @@ __FBSDID("$FreeBSD$"); #include #include +#include +#include #include #include diff --git a/sys/kern/vfs_export.c b/sys/kern/vfs_export.c index 4185211..848c232 100644 --- a/sys/kern/vfs_export.c +++ b/sys/kern/vfs_export.c @@ -47,7 +47,7 @@ __FBSDID("$FreeBSD$"); #include #include #include -#include +#include #include #include #include @@ -427,6 +427,7 @@ vfs_export_lookup(struct mount *mp, struct sockaddr *nam) register struct netcred *np; register struct radix_node_head *rnh; struct sockaddr *saddr; + RADIX_NODE_HEAD_READER; nep = mp->mnt_export; if (nep == NULL) diff --git a/sys/net/if.c b/sys/net/if.c index 5ecde8c..351e046 100644 --- a/sys/net/if.c +++ b/sys/net/if.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include #include diff --git a/sys/net/radix.c b/sys/net/radix.c index 33fcf82..d8d1e8b 100644 --- a/sys/net/radix.c +++ b/sys/net/radix.c @@ -37,7 +37,7 @@ #ifdef _KERNEL #include #include -#include +#include #include #include #include diff --git a/sys/net/radix.h b/sys/net/radix.h index 29659b5..2d130f0 100644 --- a/sys/net/radix.h +++ b/sys/net/radix.h @@ -36,7 +36,7 @@ #ifdef _KERNEL #include #include -#include +#include #endif #ifdef MALLOC_DECLARE @@ -133,7 +133,7 @@ struct radix_node_head { struct radix_node rnh_nodes[3]; /* empty tree for common case */ int rnh_multipath; /* multipath capable ? */ #ifdef _KERNEL - struct rwlock rnh_lock; /* locks entire radix tree */ + struct rmlock rnh_lock; /* locks entire radix tree */ #endif }; @@ -146,18 +146,21 @@ struct radix_node_head { #define R_Zalloc(p, t, n) (p = (t) malloc((unsigned long)(n), M_RTABLE, M_NOWAIT | M_ZERO)) #define Free(p) free((caddr_t)p, M_RTABLE); +#define RADIX_NODE_HEAD_READER struct rm_priotracker tracker #define RADIX_NODE_HEAD_LOCK_INIT(rnh) \ - rw_init_flags(&(rnh)->rnh_lock, "radix node head", 0) -#define RADIX_NODE_HEAD_LOCK(rnh) rw_wlock(&(rnh)->rnh_lock) -#define RADIX_NODE_HEAD_UNLOCK(rnh) rw_wunlock(&(rnh)->rnh_lock) -#define RADIX_NODE_HEAD_RLOCK(rnh) rw_rlock(&(rnh)->rnh_lock) -#define RADIX_NODE_HEAD_RUNLOCK(rnh) rw_runlock(&(rnh)->rnh_lock) -#define RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh) rw_try_upgrade(&(rnh)->rnh_lock) - - -#define RADIX_NODE_HEAD_DESTROY(rnh) rw_destroy(&(rnh)->rnh_lock) -#define RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_LOCKED) -#define RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_WLOCKED) + rm_init(&(rnh)->rnh_lock, "radix node head") +#define RADIX_NODE_HEAD_LOCK(rnh) rm_wlock(&(rnh)->rnh_lock) +#define RADIX_NODE_HEAD_UNLOCK(rnh) rm_wunlock(&(rnh)->rnh_lock) +#define RADIX_NODE_HEAD_RLOCK(rnh) rm_rlock(&(rnh)->rnh_lock, &tracker) +#define RADIX_NODE_HEAD_RUNLOCK(rnh) rm_runlock(&(rnh)->rnh_lock, &tracker) +//#define RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh) rw_try_upgrade(&(rnh)->rnh_lock) + + +#define RADIX_NODE_HEAD_DESTROY(rnh) rm_destroy(&(rnh)->rnh_lock) +#define RADIX_NODE_HEAD_LOCK_ASSERT(rnh) +#define RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) +//#define RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_LOCKED) +//#define RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_WLOCKED) #endif /* _KERNEL */ void rn_init(int); diff --git a/sys/net/radix_mpath.c b/sys/net/radix_mpath.c index ee7826f..c69888e 100644 --- a/sys/net/radix_mpath.c +++ b/sys/net/radix_mpath.c @@ -45,6 +45,8 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include +#include #include #include #include diff --git a/sys/net/route.c b/sys/net/route.c index 5d56688..2cf6ea5 100644 --- a/sys/net/route.c +++ b/sys/net/route.c @@ -52,6 +52,8 @@ #include #include #include +#include +#include #include #include @@ -544,6 +546,7 @@ rtalloc1_fib_nolock(struct sockaddr *dst, int report, u_long ignflags, struct rtentry *newrt; struct rt_addrinfo info; int err = 0, msgtype = RTM_MISS; + RADIX_NODE_HEAD_READER; int needlock; KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum")); @@ -612,6 +615,7 @@ rtalloc1_fib(struct sockaddr *dst, int report, u_long ignflags, struct rtentry *newrt; struct rt_addrinfo info; int err = 0, msgtype = RTM_MISS; + RADIX_NODE_HEAD_READER; int needlock; KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum")); @@ -799,6 +803,7 @@ rtredirect_fib(struct sockaddr *dst, struct rt_addrinfo info; struct ifaddr *ifa; struct radix_node_head *rnh; + RADIX_NODE_HEAD_READER; ifa = NULL; rnh = rt_tables_get_rnh(fibnum, dst->sa_family); diff --git a/sys/net/rtsock.c b/sys/net/rtsock.c index 58c46a6..18d3e06 100644 --- a/sys/net/rtsock.c +++ b/sys/net/rtsock.c @@ -45,6 +45,7 @@ #include #include #include +#include #include #include #include @@ -577,6 +578,7 @@ route_output(struct mbuf *m, struct socket *so) struct ifnet *ifp = NULL; union sockaddr_union saun; sa_family_t saf = AF_UNSPEC; + RADIX_NODE_HEAD_READER; #define senderr(e) { error = e; goto flush;} if (m == NULL || ((m->m_len < sizeof(long)) && @@ -1818,6 +1820,7 @@ sysctl_rtsock(SYSCTL_HANDLER_ARGS) int i, lim, error = EINVAL; u_char af; struct walkarg w; + RADIX_NODE_HEAD_READER; name ++; namelen--; diff --git a/sys/netinet/in_rmx.c b/sys/netinet/in_rmx.c index 1c9d9db..775ba5a 100644 --- a/sys/netinet/in_rmx.c +++ b/sys/netinet/in_rmx.c @@ -53,6 +53,8 @@ __FBSDID("$FreeBSD$"); #include #include +#include +#include #include #include diff --git a/sys/netinet6/in6_ifattach.c b/sys/netinet6/in6_ifattach.c index 80eb022..cbfe1d8 100644 --- a/sys/netinet6/in6_ifattach.c +++ b/sys/netinet6/in6_ifattach.c @@ -42,6 +42,8 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include +#include #include #include diff --git a/sys/netinet6/in6_rmx.c b/sys/netinet6/in6_rmx.c index 9aabe63..a291db2 100644 --- a/sys/netinet6/in6_rmx.c +++ b/sys/netinet6/in6_rmx.c @@ -84,6 +84,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include #include diff --git a/sys/netinet6/nd6_rtr.c b/sys/netinet6/nd6_rtr.c index 687d84d..7737d47 100644 --- a/sys/netinet6/nd6_rtr.c +++ b/sys/netinet6/nd6_rtr.c @@ -45,6 +45,7 @@ __FBSDID("$FreeBSD: stable/8/sys/netinet6/nd6_rtr.c 233201 2012-03-19 20:49:42Z #include #include #include +#include #include #include #include --------------010308000904000207080306 Content-Type: text/plain; charset=UTF-8; name="11_no_lle_rlock.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="11_no_lle_rlock.diff" commit 963196095589c03880ddd13a5c16f9e50cf6d7ce Author: Charlie Root Date: Sun Nov 4 15:52:50 2012 +0000 Do not require locking arp lle diff --git a/sys/net/if_llatbl.h b/sys/net/if_llatbl.h index 9f6531b..c1b2af9 100644 --- a/sys/net/if_llatbl.h +++ b/sys/net/if_llatbl.h @@ -169,6 +169,7 @@ MALLOC_DECLARE(M_LLTABLE); #define LLE_PUB 0x0020 /* publish entry ??? */ #define LLE_DELETE 0x4000 /* delete on a lookup - match LLE_IFADDR */ #define LLE_CREATE 0x8000 /* create on a lookup miss */ +#define LLE_UNLOCKED 0x1000 /* return lle unlocked */ #define LLE_EXCLUSIVE 0x2000 /* return lle xlocked */ #define LLATBL_HASH(key, mask) \ diff --git a/sys/netinet/if_ether.c b/sys/netinet/if_ether.c index f61b803..ecb9b8e 100644 --- a/sys/netinet/if_ether.c +++ b/sys/netinet/if_ether.c @@ -283,10 +283,10 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0, struct mbuf *m, struct sockaddr *dst, u_char *desten, struct llentry **lle) { struct llentry *la = 0; - u_int flags = 0; + u_int flags = LLE_UNLOCKED; struct mbuf *curr = NULL; struct mbuf *next = NULL; - int error, renew; + int error, renew = 0; *lle = NULL; if (m != NULL) { @@ -307,7 +307,41 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0, struct mbuf *m, retry: IF_AFDATA_RLOCK(ifp); la = lla_lookup(LLTABLE(ifp), flags, dst); + + /* + * Fast path. Do not require rlock on llentry. + */ + if ((la != NULL) && (flags & LLE_UNLOCKED)) { + if ((la->la_flags & LLE_VALID) && + ((la->la_flags & LLE_STATIC) || la->la_expire > time_uptime)) { + bcopy(&la->ll_addr, desten, ifp->if_addrlen); + /* + * If entry has an expiry time and it is approaching, + * see if we need to send an ARP request within this + * arpt_down interval. + */ + if (!(la->la_flags & LLE_STATIC) && + time_uptime + la->la_preempt > la->la_expire) { + renew = 1; + la->la_preempt--; + } + + IF_AFDATA_RUNLOCK(ifp); + if (renew != 0) + arprequest(ifp, NULL, &SIN(dst)->sin_addr, NULL); + + return (0); + } + + /* Revert to normal path for other cases */ + *lle = la; + LLE_RLOCK(la); + } + + flags &= ~LLE_UNLOCKED; + IF_AFDATA_RUNLOCK(ifp); + if ((la == NULL) && ((flags & LLE_EXCLUSIVE) == 0) && ((ifp->if_flags & (IFF_NOARP | IFF_STATICARP)) == 0)) { flags |= (LLE_CREATE | LLE_EXCLUSIVE); @@ -324,27 +358,6 @@ retry: return (EINVAL); } - if ((la->la_flags & LLE_VALID) && - ((la->la_flags & LLE_STATIC) || la->la_expire > time_second)) { - bcopy(&la->ll_addr, desten, ifp->if_addrlen); - /* - * If entry has an expiry time and it is approaching, - * see if we need to send an ARP request within this - * arpt_down interval. - */ - if (!(la->la_flags & LLE_STATIC) && - time_second + la->la_preempt > la->la_expire) { - arprequest(ifp, NULL, - &SIN(dst)->sin_addr, IF_LLADDR(ifp)); - - la->la_preempt--; - } - - *lle = la; - error = 0; - goto done; - } - if (la->la_flags & LLE_STATIC) { /* should not happen! */ log(LOG_DEBUG, "arpresolve: ouch, empty static llinfo for %s\n", inet_ntoa(SIN(dst)->sin_addr)); diff --git a/sys/netinet/in.c b/sys/netinet/in.c index eaba4e5..5341918 100644 --- a/sys/netinet/in.c +++ b/sys/netinet/in.c @@ -1561,7 +1561,7 @@ in_lltable_lookup(struct lltable *llt, u_int flags, const struct sockaddr *l3add if (LLE_IS_VALID(lle)) { if (flags & LLE_EXCLUSIVE) LLE_WLOCK(lle); - else + else if (!(flags & LLE_UNLOCKED)) LLE_RLOCK(lle); } done: --------------010308000904000207080306-- From owner-freebsd-arch@FreeBSD.ORG Wed Aug 28 19:37:12 2013 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BDB86FF7; Wed, 28 Aug 2013 19:37:12 +0000 (UTC) (envelope-from jfvogel@gmail.com) Received: from mail-ve0-x235.google.com (mail-ve0-x235.google.com [IPv6:2607:f8b0:400c:c01::235]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id DEA2221E4; Wed, 28 Aug 2013 19:37:11 +0000 (UTC) Received: by mail-ve0-f181.google.com with SMTP id jz10so4684066veb.12 for ; Wed, 28 Aug 2013 12:37:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=dxnKa8hM+AefrXStvsj3F1WeY+ZAph3Wgx8EEc4Juck=; b=nDn2BH6n6WYTCIyZd96zHN8qhwNcQ0pJVWb4lKM7AK7rcBXSJRUP5XpTA3vnpkd5Qz 5uPda4ecEq7EqxRDWBYRA4Schfm5gAeoGby4K41DMofd8RpKrdlnWj+7ZT8JND8Hedh5 tAPciWe84X9MKbEc4HINMV7Yku+OAZn2/zgpeaye7vPXzAsGnHwUEpFOcWWPTR05qs43 8usgrgTfZx3ua2xF9o1tCACdJTt7edXUX0o9mGvYSsaCSTDTJOTVjltYPOmw716R9Fwl 6k+XnaZw8cUQmXH6qpoWM4ijNVJEv5LgiNCZ6A6JWdDuv0U76pIIt5Z0f7hkNytI40AL Phew== MIME-Version: 1.0 X-Received: by 10.58.235.69 with SMTP id uk5mr27194246vec.17.1377718630983; Wed, 28 Aug 2013 12:37:10 -0700 (PDT) Received: by 10.220.159.141 with HTTP; Wed, 28 Aug 2013 12:37:10 -0700 (PDT) In-Reply-To: <521E41CB.30700@yandex-team.ru> References: <521E41CB.30700@yandex-team.ru> Date: Wed, 28 Aug 2013 12:37:10 -0700 Message-ID: Subject: Re: Network stack changes From: Jack Vogel To: "Alexander V. Chernikov" Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Adrian Chadd , Andre Oppermann , FreeBSD Hackers , FreeBSD Net , Luigi Rizzo , "Andrey V. Elsukov" , Gleb Smirnoff , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Aug 2013 19:37:12 -0000 Very interesting material Alexander, only had time to glance at it now, will look in more depth later, thanks! Jack On Wed, Aug 28, 2013 at 11:30 AM, Alexander V. Chernikov < melifaro@yandex-team.ru> wrote: > Hello list! > > There is a lot constantly raising discussions related to networking stack > performance/changes. > > I'll try to summarize current problems and possible solutions from my > point of view. > (Generally this is one problem: stack is slooooooooooooooooooooooooooow**, > but we need to know why and what to do). > > Let's start with current IPv4 packet flow on a typical router: > http://static.ipfw.ru/images/**freebsd_ipv4_flow.png > > (I'm sorry I can't provide this as text since Visio don't have any > 'ascii-art' exporter). > > Note that we are using process-to-completion model, e.g. process any > packet in ISR until it is either > consumed by L4+ stack or dropped or put to egress NIC queue. > > (There is also deferred ISR model implemented inside netisr but it does > not change much: > it can help to do more fine-grained hashing (for GRE or other similar > traffic), but > 1) it uses per-packet mutex locking which kills all performance > 2) it currently does not have _any_ hashing functions (see absence of > flags in `netstat -Q`) > People using http://static.ipfw.ru/patches/**netisr_ip_flowid.diff(or modified PPPoe/GRE version) > report some profit, but without fixing (1) it can't help much > ) > > So, let's start: > > 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since > there is nearly no contention > (the only thing that can happen is driver reconfiguration which is rare > and, more signifficant, we do this once > for the batch of packets received in given interrupt). However, due to > some (im)possible deadlocks current code > does per-packet ring unlock/lock (see ixgbe_rx_input()). > There was a discussion ended with nothing: http://lists.freebsd.org/** > pipermail/freebsd-net/2012-**October/033520.html > > 1*) Possible BPF users. Here we have one rlock if there are any readers > present > (and mutex for any matching packets, but this is more or less OK. > Additionally, there is WIP to implement multiqueue BPF > and there is chance that we can reduce lock contention there). There is > also an "optimize_writers" hack permitting applications > like CDP to use BPF as writers but not registering them as receivers > (which implies rlock) > > 2/3) Virtual interfaces (laggs/vlans over lagg and other simular > constructions). > Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more > funny - we use complex vlan_hash with another rlock to > get vlan interface from underlying one. > > This is definitely not like things should be done and this can be changed > more or less easily. > > There are some useful terms/techniques in world of software/hardware > routing: they have clear 'control plane' and 'data plane' separation. > Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg > hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with > options, destined to hosts without ARP/NDP record, and similar). Latter one > is done in hardware (or effective software implementation). > Control plane is responsible to provide data for efficient data plane > operations. This is the point we are missing nearly everywhere. > > What I want to say is: lagg is pure control-plane stuff and vlan is nearly > the same. We can't apply this approach to complex cases like > lagg-over-vlans-over-vlans-**over-(pppoe_ng0-and_wifi0) > but we definitely can do this for most common setups like (igb* or ix* in > lagg with or without vlans on top of lagg). > > We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add > some more. We even have per-driver hooks to program HW filtering. > > One small step to do is to throw packet to vlan interface directly (P1), > proof-of-concept(working in production): > http://lists.freebsd.org/**pipermail/freebsd-net/2013-**April/035270.html > > Another is to change lagg packet accounting: http://lists.freebsd.org/** > pipermail/svn-src-all/2013-**April/067570.html > Again, this is more like HW boxes do (aggregate all counters including > errors) (and I can't imagine what real error we can get from _lagg_). > > 4) If we are router, we can do either slooow ip_input() -> ip_forward() -> > ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow' > path for multicast/options/local traffic (e.g. works exactly like 'data > plane' part). > (Btw, we can consider net.inet.ip.fastforwarding to be turned on by > default at least for non-IPSEC kernels) > > Here we have to determine if this is local packet or not, e.g. F(dst_ip) > returning 1 or 0. Currently we are simply using standard rlock + hash of > iface addresses. > (And some consumers like ipfw(4) do the same, but without lock). > We don't need to do this! We can build sorted array of IPv4 addresses or > other efficient structure on every address change and use it unlocked with > delayed garbage collection (proof-of-concept attached) > (There is another thing to discuss: maybe we can do this once somewhere in > ip_input and mark mbuf as 'local/non-local' ? ) > > 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks. > This is OK. > > However, 6) and 7) are not. > Firewall can use the same pfil lock as reader protection without imposing > its own lock. currently pfil&ipfw code is ready to do this. > > 8) Radix/rt* api. This is probably the worst place in entire stack. It is > toooo generic, tooo slow and buggy (do you use IPv6? you definitely know > what I'm talking about). > A) It really is too generic and assumption that it can be (effectively) > used for every family is wrong. Two examples: > we don't need to lookup all 128 bits of IPv6 address. Subnets with mask > >/64 are not used widely (actually the only reason to use them are p2p > links due to ND potential problems). > One of common solutions is to lookup 64bits, and build another trie (or > other structure) in case of collision. > Another example is MPLS where we can simply do direct array lookup based > on ingress label. > > B) It is terribly slow (AFAIR luigi@ did some performance management, > numbers available in one of netmap pdfs) > C) It is not multipath-capable. Stateful (and non-working) multipath is > definitely not the right way. > > 8*) rtentry > We are doing it wrong. > Currently _every_ lookup locks/unlocks given rte twice. > First lock is related to and old-old story for trusting IP redirects (and > auto-adding host routes for them). Hopefully currently it is disabled > automatically when you turn forwarding on. > The second one is much more complicated: we are assuming that rte's with > non-zero refcount value can stop egress interface from being destroyed. > This is wrong (but widely used) assumption. > > We can use delayed GC instead of locking for rte's and this won't break > things more than they are broken now (patch attached). > We can't do the same for ifp structures since > a) virtual ones can assume some state in underlying physical NIC > b) physical ones just _can_ be destroyed (maybe regardless of user wants > this or not, like: SFP being unplugged from NIC) or simply lead to kernel > crash due to SW/HW inconsistency > > One of possible solution is to implement stable refcounts based on PCPU > counters, and apply thos counters to ifp, but seem to be non-trivial. > > > Another rtalloc(9) problem is the fact that radix is used as both 'control > plane' and 'data plane' structure/api. Some users always want to put more > information in rte, while others > want to make rte more compact. We just need _different_ structures for > that. > Feature-rich, lot-of-data control plane one (to store everything we want > to store, including, for example, PID of process originating the route) - > current radix can be modified to do this. > And address-family-depended another structure (array, trie, or anything) > which contains _only_ data necessary to put packet on the wire. > > 11) arpresolve. Currently (this was decoupled in 8.x) we have > a) ifaddr rlock > b) lle rlock. > > We don't need those locks. > We need to > a) make lle layer per-interface instead of global (and this can also solve > multiple fibs and L2 mappings done in fib.0 issue) > b) use rtalloc(9)-provided lock instead of separate locking > c) actually, we need to do rewrite this layer because > d) lle actually is the place to do real multipath: > > briefly, > you have rte pointing to some special nexthop structure pointing to lle, > which has the following data: > num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend > to header > Separate post will follow. > > With the following, we can achieve lagg traffic distribution without > actually using lagg_transmit and similar stuff (at least in most common > scenarious) > (for example, TCP output definitely can benefit from this, since we can > account flowid once for TCP session and use in in every mbuf) > > > So. Imagine we have done all this. How we can estimate the difference? > > There was a thread, started a year ago, describing 'stock' performance and > difference for various modifications. > It is done on 8.x, however I've got similar results on recent 9.x > > http://lists.freebsd.org/**pipermail/freebsd-net/2012-**July/032680.html > > Briefly: > > 2xE5645 @ Intel 82599 NIC. > Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, > no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte > IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to destinations in > vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured for all > destination addresses. Traffic level is slightly above or slightly below > system performance. > > we start from 1.4MPPS (if we are using several routes to minimize mutex > contention). > > My 'current' result for the same test, on same HW, with the following > modifications: > > * 1) ixgbe per-packet ring unlock removed > * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used) > * 4) separate lockless in_localip() version > * 6) - using existing pfil lock > * 7) using lockless version > * 8) radix converted to use rmlock instead of rlock. Delayed GC is used > instead of mutexes > * 10) - using existing pfil lock > * 11) using radix lock to do arpresolve(). Not using lle rlock > > (so the rmlocks are the only locks used on data path). > > Additionally, ipstat counters are converted to PCPU (no real performance > implications). > ixgbe does not do per-packet accounting (as in head). > if_vlan counters are converted to PCPU > lagg is converted to rmlock, per-packet accounting is removed (using stat > from underlying interfaces) > lle hash size is bumped to 1024 instead of 32 (not applicable here, but > slows things down for large L2 domains) > > The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16 > cores), nearly the same for HT on and 22 cores. > > .. > while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on > the same-class hardware and _userland_ forwarding. > > One of key features making all such products possible (DPDK, netmap, > packetshader, Cisco SW forwarding) - is use of batching instead of > process-to-completion model. > Batching mitigates locking cost, batching does not wash out CPU cache, and > so on. > > So maybe we can consider passing batches from NIC to at least L2 layer > with netisr? or even up to ip_input() ? > > Another question is about making some sort of reliable GC like ("passive > serialization" or other similar not-to-pronounce-words about Linux and > lockless objects). > > > P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how > can this be done and what benefit can be achieved. > > > > > > > > > > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Wed Aug 28 22:25:04 2013 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 84C03D3D for ; Wed, 28 Aug 2013 22:25:04 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 7E4412CCF for ; Wed, 28 Aug 2013 22:25:03 +0000 (UTC) Received: (qmail 22174 invoked from network); 28 Aug 2013 23:06:41 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 28 Aug 2013 23:06:41 -0000 Message-ID: <521E78B0.6080709@freebsd.org> Date: Thu, 29 Aug 2013 00:24:48 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: "Alexander V. Chernikov" Subject: Re: Network stack changes References: <521E41CB.30700@yandex-team.ru> In-Reply-To: <521E41CB.30700@yandex-team.ru> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: adrian@freebsd.org, freebsd-hackers@freebsd.org, FreeBSD Net , luigi@freebsd.org, ae@FreeBSD.org, Gleb Smirnoff , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Aug 2013 22:25:04 -0000 On 28.08.2013 20:30, Alexander V. Chernikov wrote: > Hello list! Hello Alexander, you sent quite a few things in the same email. I'll try to respond as much as I can right now. Later you should split it up to have more in-depth discussions on the individual parts. If you could make it to the EuroBSDcon 2013 DevSummit that would be even more awesome. Most of the active network stack people will be there too. > There is a lot constantly raising discussions related to networking stack performance/changes. > > I'll try to summarize current problems and possible solutions from my point of view. > (Generally this is one problem: stack is slooooooooooooooooooooooooooow, but we need to know why and > what to do). Compared to others its not thaaaaaaat slow. ;) > Let's start with current IPv4 packet flow on a typical router: > http://static.ipfw.ru/images/freebsd_ipv4_flow.png > > (I'm sorry I can't provide this as text since Visio don't have any 'ascii-art' exporter). > > Note that we are using process-to-completion model, e.g. process any packet in ISR until it is either > consumed by L4+ stack or dropped or put to egress NIC queue. > > (There is also deferred ISR model implemented inside netisr but it does not change much: > it can help to do more fine-grained hashing (for GRE or other similar traffic), but > 1) it uses per-packet mutex locking which kills all performance > 2) it currently does not have _any_ hashing functions (see absence of flags in `netstat -Q`) > People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or modified PPPoe/GRE version) > report some profit, but without fixing (1) it can't help much > ) > > So, let's start: > > 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since there is nearly no contention > (the only thing that can happen is driver reconfiguration which is rare and, more signifficant, we > do this once > for the batch of packets received in given interrupt). However, due to some (im)possible deadlocks > current code > does per-packet ring unlock/lock (see ixgbe_rx_input()). > There was a discussion ended with nothing: > http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html > > 1*) Possible BPF users. Here we have one rlock if there are any readers present > (and mutex for any matching packets, but this is more or less OK. Additionally, there is WIP to > implement multiqueue BPF > and there is chance that we can reduce lock contention there). Rlock to rmlock? > There is also an "optimize_writers" hack permitting applications > like CDP to use BPF as writers but not registering them as receivers (which implies rlock) I believe longer term we should solve this with a protocol type "ethernet" so that one can send/receive ethernet frames through a normal socket. > 2/3) Virtual interfaces (laggs/vlans over lagg and other simular constructions). > Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more funny - we use complex > vlan_hash with another rlock to > get vlan interface from underlying one. > > This is definitely not like things should be done and this can be changed more or less easily. Indeed. > There are some useful terms/techniques in world of software/hardware routing: they have clear > 'control plane' and 'data plane' separation. > Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg hellos, ARP/NDP, etc..) and > some data traffic (packets with TTL=1, with options, destined to hosts without ARP/NDP record, and > similar). Latter one is done in hardware (or effective software implementation). > Control plane is responsible to provide data for efficient data plane operations. This is the point > we are missing nearly everywhere. ACK. > What I want to say is: lagg is pure control-plane stuff and vlan is nearly the same. We can't apply > this approach to complex cases like lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0) > but we definitely can do this for most common setups like (igb* or ix* in lagg with or without vlans > on top of lagg). ACK. > We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add some more. We even have > per-driver hooks to program HW filtering. We could. Though for vlan it looks like it would be easier to remove the hardware vlan tag stripping and insertion. It only adds complexity in all drivers for no gain. > One small step to do is to throw packet to vlan interface directly (P1), proof-of-concept(working in > production): > http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html > > Another is to change lagg packet accounting: > http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html > Again, this is more like HW boxes do (aggregate all counters including errors) (and I can't imagine > what real error we can get from _lagg_). > > 4) If we are router, we can do either slooow ip_input() -> ip_forward() -> ip_output() cycle or use > optimized ip_fastfwd() which falls back to 'slow' path for multicast/options/local traffic (e.g. > works exactly like 'data plane' part). > (Btw, we can consider net.inet.ip.fastforwarding to be turned on by default at least for non-IPSEC > kernels) ACK. > Here we have to determine if this is local packet or not, e.g. F(dst_ip) returning 1 or 0. Currently > we are simply using standard rlock + hash of iface addresses. > (And some consumers like ipfw(4) do the same, but without lock). > We don't need to do this! We can build sorted array of IPv4 addresses or other efficient structure > on every address change and use it unlocked with delayed garbage collection (proof-of-concept attached) I'm a bit uneasy with unlocked access. On very weakly ordered architectures this could trip over cache coherency issues. A rmlock is essentially for free in the read case. > (There is another thing to discuss: maybe we can do this once somewhere in ip_input and mark mbuf as > 'local/non-local' ? ) The problem is packet filters may change the destination address and thus can invalidate such a lookup. > 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks. This is OK. > > However, 6) and 7) are not. > Firewall can use the same pfil lock as reader protection without imposing its own lock. currently > pfil&ipfw code is ready to do this. The problem with the global pfil rmlock is the comparatively long time it is held in a locked state. Also packet filters may have to acquire additional locks when they have to modify state tables. Rmlocks are not made for that because they pin the thread to the cpu they're currently on. This is what Gleb is complaining about. My idea is to hold the pfil rmlock only for the lookup of the first/next packet filter that will run, not for the entire duration. That would solve the problem. However packets filter then have to use their own locks again, which could be rmlock too. > 8) Radix/rt* api. This is probably the worst place in entire stack. It is toooo generic, tooo slow > and buggy (do you use IPv6? you definitely know what I'm talking about). > A) It really is too generic and assumption that it can be (effectively) used for every family is > wrong. Two examples: > we don't need to lookup all 128 bits of IPv6 address. Subnets with mask >/64 are not used widely > (actually the only reason to use them are p2p links due to ND potential problems). > One of common solutions is to lookup 64bits, and build another trie (or other structure) in case of > collision. > Another example is MPLS where we can simply do direct array lookup based on ingress label. Yes. While we shouldn't throw it out, it should be run as RIB and allow a much more protocol specific FIB for the hot packet path. > B) It is terribly slow (AFAIR luigi@ did some performance management, numbers available in one of > netmap pdfs) Again not thaaaat slow but inefficient enough. > C) It is not multipath-capable. Stateful (and non-working) multipath is definitely not the right way. Indeed. > 8*) rtentry > We are doing it wrong. > Currently _every_ lookup locks/unlocks given rte twice. > First lock is related to and old-old story for trusting IP redirects (and auto-adding host routes > for them). Hopefully currently it is disabled automatically when you turn forwarding on. They're disabled. > The second one is much more complicated: we are assuming that rte's with non-zero refcount value can > stop egress interface from being destroyed. > This is wrong (but widely used) assumption. Not really. The reason for the refcount is not the ifp reference but other code parts that may hold direct pointers to the rtentry and do direct dereferencing to access information in it. > We can use delayed GC instead of locking for rte's and this won't break things more than they are > broken now (patch attached). Nope. Delayed GC is not the way to go here. To do away with rtentry locking and refcounting we have change rtalloc(9) to return the information the caller wants (e.g. ifp, ia, others) and not the rtentry address anymore. So instead of rtalloc() we have rtlookup(). > We can't do the same for ifp structures since > a) virtual ones can assume some state in underlying physical NIC > b) physical ones just _can_ be destroyed (maybe regardless of user wants this or not, like: SFP > being unplugged from NIC) or simply lead to kernel crash due to SW/HW inconsistency Here I actually believe we can do a GC or stable storage based approach. Ifp pointers are kept in too many places and properly refcounting it is very (too) hard. So whenever an interface gets destroyed or disappears it's callable function pointers are replaced with dummies returning an error. The ifp in memory will stay for some time and even may be reused for another new interface later again (Cisco does it that way in their IOS). > One of possible solution is to implement stable refcounts based on PCPU counters, and apply thos > counters to ifp, but seem to be non-trivial. > > > Another rtalloc(9) problem is the fact that radix is used as both 'control plane' and 'data plane' > structure/api. Some users always want to put more information in rte, while others > want to make rte more compact. We just need _different_ structures for that. ACK. > Feature-rich, lot-of-data control plane one (to store everything we want to store, including, for > example, PID of process originating the route) - current radix can be modified to do this. > And address-family-depended another structure (array, trie, or anything) which contains _only_ data > necessary to put packet on the wire. ACK. > 11) arpresolve. Currently (this was decoupled in 8.x) we have > a) ifaddr rlock > b) lle rlock. > > We don't need those locks. > We need to > a) make lle layer per-interface instead of global (and this can also solve multiple fibs and L2 > mappings done in fib.0 issue) Yes! > b) use rtalloc(9)-provided lock instead of separate locking No. Interface rmlock. > c) actually, we need to do rewrite this layer because > d) lle actually is the place to do real multipath: No, you can do multipath through more than one interface. If lle is per interface that wont work and is not the right place. > briefly, > you have rte pointing to some special nexthop structure pointing to lle, which has the following data: > num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend to header > Separate post will follow. This should be part of the RIB/FIB and select on of the ifp+nexthops to return on lookup. > With the following, we can achieve lagg traffic distribution without actually using lagg_transmit > and similar stuff (at least in most common scenarious) This seems to be a rather nasty layering violation. > (for example, TCP output definitely can benefit from this, since we can account flowid once for TCP > session and use in in every mbuf) > > So. Imagine we have done all this. How we can estimate the difference? > > There was a thread, started a year ago, describing 'stock' performance and difference for various > modifications. > It is done on 8.x, however I've got similar results on recent 9.x > > http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html > > Briefly: > > 2xE5645 @ Intel 82599 NIC. > Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, no firewallIxia XM2 > (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - > 10.100.0.156) to destinations in vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured > for all destination addresses. Traffic level is slightly above or slightly below system performance. > > we start from 1.4MPPS (if we are using several routes to minimize mutex contention). > > My 'current' result for the same test, on same HW, with the following modifications: > > * 1) ixgbe per-packet ring unlock removed > * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used) > * 4) separate lockless in_localip() version > * 6) - using existing pfil lock > * 7) using lockless version > * 8) radix converted to use rmlock instead of rlock. Delayed GC is used instead of mutexes > * 10) - using existing pfil lock > * 11) using radix lock to do arpresolve(). Not using lle rlock > > (so the rmlocks are the only locks used on data path). > > Additionally, ipstat counters are converted to PCPU (no real performance implications). > ixgbe does not do per-packet accounting (as in head). > if_vlan counters are converted to PCPU > lagg is converted to rmlock, per-packet accounting is removed (using stat from underlying interfaces) > lle hash size is bumped to 1024 instead of 32 (not applicable here, but slows things down for large > L2 domains) > > The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16 cores), nearly the same > for HT on and 22 cores. That's quite good, but we want more. ;) > .. > while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the same-class hardware and > _userland_ forwarding. Those numbers sound a bit far out. Maybe if the packet isn't touched or looked at at all in a pure netmap interface to interface bridging scenario. I don't believe these numbers. > One of key features making all such products possible (DPDK, netmap, packetshader, Cisco SW > forwarding) - is use of batching instead of process-to-completion model. > Batching mitigates locking cost, batching does not wash out CPU cache, and so on. The work has to be done eventually. Batching doesn't relieve from it. IMHO batch moving is only the last step would should look at. It makes the stack rather complicated and introduces other issues like packet latency. > So maybe we can consider passing batches from NIC to at least L2 layer with netisr? or even up to > ip_input() ? And then? You probably won't win much in the end (if the lock path is optimized). > Another question is about making some sort of reliable GC like ("passive serialization" or other > similar not-to-pronounce-words about Linux and lockless objects). Rmlocks are our secret weapon and just as good. > P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how can this be done and what > benefit can be achieved. -- Andre From owner-freebsd-arch@FreeBSD.ORG Thu Aug 29 01:30:37 2013 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id A19FC88D; Thu, 29 Aug 2013 01:30:37 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) by mx1.freebsd.org (Postfix) with ESMTP id 5B62426E5; Thu, 29 Aug 2013 01:30:37 +0000 (UTC) Received: from slw by zxy.spb.ru with local (Exim 4.69 (FreeBSD)) (envelope-from ) id 1VEr6H-000In5-Q7; Thu, 29 Aug 2013 05:32:41 +0400 Date: Thu, 29 Aug 2013 05:32:41 +0400 From: Slawa Olhovchenkov To: Andre Oppermann Subject: Re: Network stack changes Message-ID: <20130829013241.GB70584@zxy.spb.ru> References: <521E41CB.30700@yandex-team.ru> <521E78B0.6080709@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <521E78B0.6080709@freebsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false Cc: "Alexander V. Chernikov" , adrian@freebsd.org, freebsd-hackers@freebsd.org, freebsd-arch@freebsd.org, luigi@freebsd.org, ae@FreeBSD.org, Gleb Smirnoff , FreeBSD Net X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Aug 2013 01:30:37 -0000 On Thu, Aug 29, 2013 at 12:24:48AM +0200, Andre Oppermann wrote: > > .. > > while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the same-class hardware and > > _userland_ forwarding. > > Those numbers sound a bit far out. Maybe if the packet isn't touched > or looked at at all in a pure netmap interface to interface bridging > scenario. I don't believe these numbers. 80*64*8 = 40.960 Gb/s May be DCA? And use CPU with 40 PCIe lane and 4 memory chanell. From owner-freebsd-arch@FreeBSD.ORG Thu Aug 29 06:46:54 2013 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 0F26FE7D; Thu, 29 Aug 2013 06:46:54 +0000 (UTC) (envelope-from bryanv@daemoninthecloset.org) Received: from torment.daemoninthecloset.org (torment.daemoninthecloset.org [94.242.209.234]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id C098F2912; Thu, 29 Aug 2013 06:46:53 +0000 (UTC) Received: from sage.daemoninthecloset.org (unknown [70.114.209.60]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "sage.daemoninthecloset.org", Issuer "daemoninthecloset.org" (verified OK)) by torment.daemoninthecloset.org (Postfix) with ESMTPS id DFBE342C08C6; Thu, 29 Aug 2013 08:52:03 +0200 (CEST) X-Virus-Scanned: amavisd-new at daemoninthecloset.org X-Virus-Scanned: amavisd-new at daemoninthecloset.org Date: Thu, 29 Aug 2013 01:46:32 -0500 (CDT) From: Bryan Venteicher To: Andre Oppermann Message-ID: <2112475076.435.1377758792082.JavaMail.root@daemoninthecloset.org> In-Reply-To: <521E78B0.6080709@freebsd.org> References: <521E41CB.30700@yandex-team.ru> <521E78B0.6080709@freebsd.org> Subject: Re: Network stack changes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.10.20] X-Mailer: Zimbra 8.0.2_GA_5569 (ZimbraWebClient - GC20 ([unknown])/8.0.2_GA_5569) Thread-Topic: Network stack changes Thread-Index: anDUShTn7iVw7wFEqZDuK6ld/6VXsQ== Cc: "Alexander V. Chernikov" , adrian@freebsd.org, freebsd-hackers@freebsd.org, freebsd-arch@freebsd.org, luigi@freebsd.org, ae@FreeBSD.org, Gleb Smirnoff , FreeBSD Net X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Aug 2013 06:46:54 -0000 ----- Original Message ----- > On 28.08.2013 20:30, Alexander V. Chernikov wrote: > > Hello list! > > Hello Alexander, > > you sent quite a few things in the same email. I'll try to respond > as much as I can right now. Later you should split it up to have > more in-depth discussions on the individual parts. > > > > We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add > > some more. We even have > > per-driver hooks to program HW filtering. > > We could. Though for vlan it looks like it would be easier to remove the > hardware vlan tag stripping and insertion. It only adds complexity in all > drivers for no gain. > In the shorter term, can we remove the requirement for the parent interface to support IFCAP_VLAN_HWTAGGING in order to do checksum offloading on the VLAN interface (see vlan_capabilities())? From owner-freebsd-arch@FreeBSD.ORG Thu Aug 29 11:49:34 2013 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id B9C08243; Thu, 29 Aug 2013 11:49:34 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-wi0-x234.google.com (mail-wi0-x234.google.com [IPv6:2a00:1450:400c:c05::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 566572FB7; Thu, 29 Aug 2013 11:49:33 +0000 (UTC) Received: by mail-wi0-f180.google.com with SMTP id l12so352069wiv.13 for ; Thu, 29 Aug 2013 04:49:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=t5rRGR7p2lccT5dIVAFMMz5iEhB8EN3uCWH29jaGBug=; b=jG6EG9SHnUTL+LWD0mt8ZdlyhVyFrBZee9eO0wArZSQi7Kxa4sipEEiBbicH27NlRE WCFqBBALUkxOLkfAinjqMBlaV/iJhly1bozkC2JSX40PczqetRoSxgspp1/Uf8S+/7Y/ SAPOMG5R/RfYBn/5LaIxPpziJpJ8uJvmxiuc1U90ViJZGA7R/XjoJgRyDWubRr53+sIM prBz7ivSPp48uUqSxvRc6u09Edy/XM3+hSFHKyMWPMoP/isaPhtr5W6IrGK0lz1Cm4Oc SNbYlKGLpk1MD4mzIVxrREDDjMFyu8VVJjqpBIM9jv8oxMF2TcPAFeMlBXnnKrl9XP9A PNgw== MIME-Version: 1.0 X-Received: by 10.194.79.33 with SMTP id g1mr2141120wjx.79.1377776971643; Thu, 29 Aug 2013 04:49:31 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.216.146.2 with HTTP; Thu, 29 Aug 2013 04:49:31 -0700 (PDT) In-Reply-To: <521E41CB.30700@yandex-team.ru> References: <521E41CB.30700@yandex-team.ru> Date: Thu, 29 Aug 2013 04:49:31 -0700 X-Google-Sender-Auth: fjTZLF4GZ_Hxxlda_cdxncMN6aA Message-ID: Subject: Re: Network stack changes From: Adrian Chadd To: "Alexander V. Chernikov" Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Luigi Rizzo , Andre Oppermann , "freebsd-hackers@freebsd.org" , FreeBSD Net , "Andrey V. Elsukov" , Gleb Smirnoff , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Aug 2013 11:49:34 -0000 Hi, There's a lot of good stuff to review here, thanks! Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to keep locking things like that on a per-packet basis. We should be able to do this in a cleaner way - we can defer RX into a CPU pinned taskqueue and convert the interrupt handler to a fast handler that just schedules that taskqueue. We can ignore the ithread entirely here. What do you think? Totally pie in the sky handwaving at this point: * create an array of mbuf pointers for completed mbufs; * populate the mbuf array; * pass the array up to ether_demux(). For vlan handling, it may end up populating its own list of mbufs to push up to ether_demux(). So maybe we should extend the API to have a bitmap of packets to actually handle from the array, so we can pass up a larger array of mbufs, note which ones are for the destination and then the upcall can mark which frames its consumed. I specifically wonder how much work/benefit we may see by doing: * batching packets into lists so various steps can batch process things rather than run to completion; * batching the processing of a list of frames under a single lock instance - eg, if the forwarding code could do the forwarding lookup for 'n' packets under a single lock, then pass that list of frames up to inet_pfil_hook() to do the work under one lock, etc, etc. Here, the processing would look less like "grab lock and process to completion" and more like "mark and sweep" - ie, we have a list of frames that we mark as needing processing and mark as having been processed at each layer, so we know where to next dispatch them. I still have some tool coding to do with PMC before I even think about tinkering with this as I'd like to measure stuff like per-packet latency as well as top-level processing overhead (ie, CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, etc.) Thanks, -adrian From owner-freebsd-arch@FreeBSD.ORG Fri Aug 30 03:53:25 2013 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 73E5BBC4 for ; Fri, 30 Aug 2013 03:53:25 +0000 (UTC) (envelope-from mailreturn@smtp4.ymlpsrv.net) Received: from smtp4.ymlpsrv.net (smtp4.ymlpsrv.net [62.213.196.184]) by mx1.freebsd.org (Postfix) with SMTP id AF3FA2080 for ; Fri, 30 Aug 2013 03:53:24 +0000 (UTC) Received: (qmail 12197 invoked by uid 0); 30 Aug 2013 03:53:23 -0000 Date: Fri, 30 Aug 2013 05:53:23 +0200 To: freebsd-arch@freebsd.org From: Masters Pratumnak Subject: Fully furnished one bedroom condos from under 2M THB Message-ID: X-YMLPcode: xcv4+2+3642 MIME-Version: 1.0 Content-Type: text/plain; charset = "utf-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: mark@masterspratumnak.com List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Aug 2013 03:53:25 -0000 --------------------------------------------------------------------------= ------ This email newsletter was sent to you in graphical HTML format. If you're seeing this version, your email program prefers plain text = emails. You can read the original version online: http://ymlp297.net/zcAEV7 --------------------------------------------------------------------------= ------ FULLY FURNISHED CONDOS NOW WITH 5% DEVELOPER DISCOUNT * Over 50% Sold =E2=80=93 5 units sold in the last 3 weeks * Price increase scheduled in just 2 weeks * Reserve off-plan price now for only 50,000 THB * Flexible Payment Options available on request * Located on Pratumnak Hill, Pattaya, Thailand * Prices start at 1.8M THB / 58,000 USD / 38,000 GBP In the last three weeks there has been a huge amount of interest in the Masters Residence ( http://homesdirect.asia/mastersv3/?utm_source=3DAugust+2&utm_medium=3DEmai= l&utm_campaign=3DMasters3+YMLP1 ). With over 50% sold at pre-launch, the project is proving very popular, with reservations and sales coming locally and from overseas. We would like you to have the opportunity to reserve a unit for only 50,000 Thai Baht, this will lock-in the current pre-launch prices. Simply click on the more information button and enter your details in the contact form. You can then=C2=A0receive a 5% discount=C2=A0off the = current pricelist when your pay the reservation fee. This=C2=A0exclusively online offer=C2=A0is only available direct through us, the developer. Mark Partner, Masters Residence About Masters Residence Masters Residence ( http://homesdirect.asia/mastersv3/?utm_source=3DAugust+2&utm_medium=3DEmai= l&utm_campaign=3DMasters3+YMLP1 ) offers an exclusive, pristine and tranqual location like no other on Soi Regent 2, between Soi 5 & 6 Phratumnak. The roof-top infinity swimming pool and sun deck will have breathtaking ocean views, while the atrium and lobby offer a combination of green area in a relaxing environment. With the wide selection and positioning of one bedroom units, we are unlike other condos in the area. * All Condos have Exterior Views. * Underground Parking * Spectacular Sea Views * Atrium with Water Features * Fully Furnished * Rooftop Infinity Pool, Sun Deck & Saunas The level of refinement and luxury that Masters Residence ( http://homesdirect.asia/mastersv3/?utm_source=3DAugust+2&utm_medium=3DEmai= l&utm_campaign=3DMasters3+YMLP1 ) offers its owners and residents, the temptation to stay at home could be difficult to resist. Those staying within Masters Pattaya can enjoy the wide range of facilities that the project offers, including an infinity pool on the roof set among tropical gardens, offering spectacular views down the hill to the sea and Koh Lan island beyond. _____________________________ Unsubscribe / Change Profile: http://ymlp297.net/ugbjwwjygsgmhjegegguqquwm Powered by YourMailingListProvider