From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 28 18:30:46 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 624D2B52;
 Wed, 28 Aug 2013 18:30:46 +0000 (UTC)
 (envelope-from melifaro@yandex-team.ru)
Received: from forward-corp1e.mail.yandex.net (forward-corp1e.mail.yandex.net
 [IPv6:2a02:6b8:0:202::10])
 by mx1.freebsd.org (Postfix) with ESMTP id 3E5232D9C;
 Wed, 28 Aug 2013 18:30:45 +0000 (UTC)
Received: from smtpcorp4.mail.yandex.net (smtpcorp4.mail.yandex.net
 [95.108.252.2])
 by forward-corp1e.mail.yandex.net (Yandex) with ESMTP id 1CAD064006D;
 Wed, 28 Aug 2013 22:30:42 +0400 (MSK)
Received: from smtpcorp4.mail.yandex.net (localhost [127.0.0.1])
 by smtpcorp4.mail.yandex.net (Yandex) with ESMTP id 018FC2C0173;
 Wed, 28 Aug 2013 22:30:41 +0400 (MSK)
Received: from dhcp170-36-red.yandex.net (dhcp170-36-red.yandex.net
 [95.108.170.36])
 by smtpcorp4.mail.yandex.net (nwsmtp/Yandex) with ESMTP id fyMijftbk1-UfD0Ynrx;
 Wed, 28 Aug 2013 22:30:41 +0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru;
 s=default; 
 t=1377714641; bh=AW6ufh6kIvAxiM+MsIJACctEm8ZbDDYBUX9jB61JP/4=;
 h=Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject:
 Content-Type;
 b=u8yrzEOZBJx9qwddK0RtoZ49tkgibPrkr2JZ+Jf85+Jaj0CrmBmy77f0747mqE+AS
 w7TDf3gcCt+QbncXKKcIFCxas4ITtcPy4R4fhXsELRhDjBGz7WyBYNLhnt1KibM3je
 uhHqth/NVDpzXMr5toFTgvyh0qUB0BRcXtFa/P2I=
Authentication-Results: smtpcorp4.mail.yandex.net;
 dkim=pass header.i=@yandex-team.ru
Message-ID: <521E41CB.30700@yandex-team.ru>
Date: Wed, 28 Aug 2013 22:30:35 +0400
From: "Alexander V. Chernikov" <melifaro@yandex-team.ru>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130418 Thunderbird/17.0.5
MIME-Version: 1.0
To: FreeBSD Net <net@freebsd.org>, freebsd-hackers@freebsd.org, 
 freebsd-arch@freebsd.org
Subject: Network stack changes
Content-Type: multipart/mixed; boundary="------------010308000904000207080306"
Cc: ae@FreeBSD.org, adrian@freebsd.org, Gleb Smirnoff <glebius@FreeBSD.org>,
 andre@freebsd.org, luigi@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Aug 2013 18:30:46 -0000

This is a multi-part message in MIME format.
--------------010308000904000207080306
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hello list!

There is a lot constantly raising  discussions related to networking 
stack performance/changes.

I'll try to summarize current problems and possible solutions from my 
point of view.
(Generally this is one problem: stack is slooooooooooooooooooooooooooow, 
but we need to know why and what to do).

Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 
'ascii-art' exporter).

Note that we are using process-to-completion model, e.g. process any 
packet in ISR until it is either
consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it does 
not change much:
it can help to do more fine-grained hashing (for GRE or other similar 
traffic), but
1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of 
flags in `netstat -Q`)
People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or 
modified PPPoe/GRE version)
report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine 
since there is nearly no contention
(the only thing that can happen is driver reconfiguration which is rare 
and, more signifficant, we do this once
for the batch of packets received in given interrupt). However, due to 
some (im)possible deadlocks current code
does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing: 
http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html

1*) Possible BPF users. Here we have one rlock if there are any readers 
present
(and mutex for any matching packets, but this is more or less OK. 
Additionally, there is WIP to implement multiqueue BPF
and there is chance that we can reduce lock contention there). There is 
also an "optimize_writers" hack permitting applications
like CDP to use BPF as writers but not registering them as receivers 
(which implies rlock)

2/3) Virtual interfaces (laggs/vlans over lagg and other simular 
constructions).
Currently we simply use rlock to make s/ix0/lagg0/ and, what is much 
more funny - we use complex vlan_hash with another rlock to
get vlan interface from underlying one.

This is definitely not like things should be done and this can be 
changed more or less easily.

There are some useful terms/techniques in world of software/hardware 
routing: they have clear 'control plane' and 'data plane' separation.
Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg 
hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with 
options, destined to hosts without ARP/NDP record, and similar). Latter 
one is done in hardware (or effective software implementation).
Control plane is responsible to provide data for efficient data plane 
operations. This is the point we are missing nearly everywhere.

What I want to say is: lagg is pure control-plane stuff and vlan is 
nearly the same. We can't apply this approach to complex cases like 
lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
but we definitely can do this for most common setups like (igb* or ix* 
in lagg with or without vlans on top of lagg).

We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can 
add some more. We even have per-driver hooks to program HW filtering.

One small step to do is to throw packet to vlan interface directly (P1), 
proof-of-concept(working in production):
http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html

Another is to change lagg packet accounting: 
http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html
Again, this is more like HW boxes do (aggregate all counters including 
errors) (and I can't imagine what real error we can get from _lagg_).

4) If we are router, we can do either slooow ip_input() -> ip_forward() 
-> ip_output() cycle or use optimized ip_fastfwd() which falls back to 
'slow' path for multicast/options/local traffic (e.g. works exactly like 
'data plane' part).
(Btw, we can consider net.inet.ip.fastforwarding to be turned on by 
default at least for non-IPSEC kernels)

Here we have to determine if this is local packet or not, e.g. F(dst_ip) 
returning 1 or 0. Currently we are simply using standard rlock + hash of 
iface addresses.
(And some consumers like ipfw(4) do the same, but without lock).
We don't need to do this! We can build sorted array of IPv4 addresses or 
other efficient structure on every address change and use it unlocked 
with delayed garbage collection (proof-of-concept attached)
(There is another thing to discuss: maybe we can do this once somewhere 
in ip_input and mark mbuf as 'local/non-local' ? )

5, 9) Currently we have L3 ingress/egress PFIL hooks protected by 
rmlocks. This is OK.

However, 6) and 7) are not.
Firewall can use the same pfil lock as reader protection without 
imposing its own lock. currently pfil&ipfw code is ready to do this.

8) Radix/rt* api. This is probably the worst place in entire stack. It 
is toooo generic, tooo slow and buggy (do you use IPv6? you definitely 
know what I'm talking about).
A) It really is too generic and assumption that it can be (effectively) 
used for every family is wrong. Two examples:
we don't need to lookup all 128 bits of IPv6 address. Subnets with mask 
 >/64 are not used widely (actually the only reason to use them are p2p 
links due to ND potential problems).
One of common solutions is to lookup 64bits, and build another trie (or 
other structure) in case of collision.
Another example is MPLS where we can simply do direct array lookup based 
on ingress label.

B) It is terribly slow (AFAIR luigi@ did some performance management, 
numbers available in one of netmap pdfs)
C) It is not multipath-capable. Stateful (and non-working) multipath is 
definitely not the right way.

8*) rtentry
We are doing it wrong.
Currently _every_ lookup locks/unlocks given rte twice.
First lock is related to and old-old story for trusting IP redirects 
(and auto-adding host routes for them). Hopefully currently it is 
disabled automatically when you turn forwarding on.
The second one is much more complicated: we are assuming that rte's with 
non-zero refcount value can stop egress interface from being destroyed.
This is wrong (but widely used) assumption.

We can use delayed GC instead of locking for rte's and this won't break 
things more than they are broken now (patch attached).
We can't do the same for ifp structures since
a) virtual ones can assume some state in underlying physical NIC
b) physical ones just _can_ be destroyed (maybe regardless of user wants 
this or not, like: SFP being unplugged from NIC) or simply lead to 
kernel crash due to SW/HW inconsistency

One of possible solution is to implement stable refcounts based on PCPU 
counters, and apply thos counters to ifp, but seem to be non-trivial.


Another rtalloc(9) problem is the fact that radix is used as both 
'control plane' and 'data plane' structure/api. Some users always want 
to put more information in rte, while others
want to make rte more compact. We just need _different_ structures for that.
Feature-rich, lot-of-data control plane one (to store everything we want 
to store, including, for example, PID of process originating the route) 
- current radix can be modified to do this.
And address-family-depended another structure (array, trie, or anything) 
which contains _only_ data necessary to put packet on the wire.

11) arpresolve. Currently (this was decoupled in 8.x) we have
a) ifaddr rlock
b) lle rlock.

We don't need those locks.
We need to
a) make lle layer per-interface instead of global (and this can also 
solve multiple fibs and L2 mappings done in fib.0 issue)
b) use rtalloc(9)-provided lock instead of separate locking
c) actually, we need to do rewrite this layer because
d) lle actually is the place to do real multipath:

briefly,
you have rte pointing to some special nexthop structure pointing to lle, 
which has the following data:
num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to 
prepend to header
Separate post will follow.

With the following, we can achieve lagg traffic distribution without 
actually using lagg_transmit and similar stuff (at least in most common 
scenarious)
(for example, TCP output definitely can benefit from this, since we can 
account flowid once for TCP session and use in in every mbuf)


So. Imagine we have done all this. How we can estimate the difference?

There was a thread, started a year ago, describing 'stock' performance 
and difference for various modifications.
It is done on 8.x, however I've got similar results on recent 9.x

http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html

Briefly:

2xE5645 @ Intel 82599 NIC.
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, 
no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 
64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to 
destinations in vlan11 (10.100.1.128 - 10.100.1.192). Static arps are 
configured for all destination addresses. Traffic level is slightly 
above or slightly below system performance.

we start from 1.4MPPS (if we are using several routes to minimize mutex 
contention).

My 'current' result for the same test, on same HW, with the following 
modifications:

* 1) ixgbe per-packet ring unlock removed
* P1) ixgbe is modified to do direct vlan input (so 2,3 are not used)
* 4) separate lockless in_localip() version
* 6) - using existing pfil lock
* 7) using lockless version
* 8) radix converted to use rmlock instead of rlock. Delayed GC is used 
instead of mutexes
* 10) - using existing pfil lock
* 11) using radix lock to do arpresolve(). Not using lle rlock

(so the rmlocks are the only locks used on data path).

Additionally, ipstat counters are converted to PCPU (no real performance 
implications).
ixgbe does not do per-packet accounting (as in head).
if_vlan counters are converted to PCPU
lagg is converted to rmlock, per-packet accounting is removed (using 
stat from underlying interfaces)
lle hash size is bumped to 1024 instead of 32 (not applicable here, but 
slows things down for large L2 domains)

The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg 
(16 cores), nearly the same for HT on and 22 cores.

..
while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on 
the same-class hardware and _userland_ forwarding.

One of key features making all such products possible (DPDK, netmap, 
packetshader, Cisco SW forwarding) - is use of batching instead of 
process-to-completion model.
Batching mitigates locking cost, batching does not wash out CPU cache, 
and so on.

So maybe we can consider passing batches from NIC to at least L2 layer 
with netisr? or even up to ip_input() ?

Another question is about making some sort of reliable GC like ("passive 
serialization" or other similar not-to-pronounce-words about Linux and 
lockless objects).


P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly 
how can this be done and what benefit can be achieved.


--------------010308000904000207080306
Content-Type: text/plain; charset=UTF-8;
 name="1_ixgbe_unlock.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="1_ixgbe_unlock.diff"

commit 20a52503455c80cd149d2232bdc0d37e14381178
Author: Charlie Root <root@test15.yandex.net>
Date:   Tue Oct 23 21:20:13 2012 +0000

    Remove RX ring unlock/lock before calling if_input() from ixgbe drivers.

diff --git a/sys/dev/ixgbe/ixgbe.c b/sys/dev/ixgbe/ixgbe.c
index 5d8752b..fc1491e 100644
--- a/sys/dev/ixgbe/ixgbe.c
+++ b/sys/dev/ixgbe/ixgbe.c
@@ -4171,9 +4171,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *ifp, struct mbuf *m, u32 ptype
                         if (tcp_lro_rx(&rxr->lro, m, 0) == 0)
                                 return;
         }
-	IXGBE_RX_UNLOCK(rxr);
         (*ifp->if_input)(ifp, m);
-	IXGBE_RX_LOCK(rxr);
 }
 
 static __inline void

--------------010308000904000207080306
Content-Type: text/plain; charset=UTF-8;
 name="2_ixgbe_vlans2.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="2_ixgbe_vlans2.diff"

Index: sys/dev/ixgbe/ixgbe.c
===================================================================
--- sys/dev/ixgbe/ixgbe.c	(revision 248704)
+++ sys/dev/ixgbe/ixgbe.c	(working copy)
@@ -2880,6 +2880,14 @@ ixgbe_allocate_queues(struct adapter *adapter)
 			error = ENOMEM;
 			goto err_rx_desc;
 		}
+
+		if ((rxr->vlans = malloc(sizeof(struct ifvlans), M_DEVBUF,
+		    M_NOWAIT | M_ZERO)) == NULL) {
+			device_printf(dev,
+			    "Critical Failure setting up vlan index\n");
+			error = ENOMEM;
+			goto err_rx_desc;
+		}
 	}
 
 	/*
@@ -4271,6 +4279,11 @@ ixgbe_free_receive_buffers(struct rx_ring *rxr)
 		rxr->ptag = NULL;
 	}
 
+	if (rxr->vlans != NULL) {
+		free(rxr->vlans, M_DEVBUF);
+		rxr->vlans = NULL;
+	}
+
 	return;
 }
 
@@ -4303,7 +4316,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *
                                 return;
         }
 	IXGBE_RX_UNLOCK(rxr);
-        (*ifp->if_input)(ifp, m);
+        (*ifp->if_input)(m->m_pkthdr.rcvif, m);
 	IXGBE_RX_LOCK(rxr);
 }
 
@@ -4360,6 +4373,7 @@ ixgbe_rxeof(struct ix_queue *que)
 	u16			count = rxr->process_limit;
 	union ixgbe_adv_rx_desc	*cur;
 	struct ixgbe_rx_buf	*rbuf, *nbuf;
+	struct ifnet		*ifp_dst;
 
 	IXGBE_RX_LOCK(rxr);
 
@@ -4522,9 +4536,19 @@ ixgbe_rxeof(struct ix_queue *que)
 			    (staterr & IXGBE_RXD_STAT_VP))
 				vtag = le16toh(cur->wb.upper.vlan);
 			if (vtag) {
-				sendmp->m_pkthdr.ether_vtag = vtag;
-				sendmp->m_flags |= M_VLANTAG;
-			}
+				ifp_dst = rxr->vlans->idx[EVL_VLANOFTAG(vtag)];
+
+				if (ifp_dst != NULL) {
+					ifp_dst->if_ipackets++;
+					sendmp->m_pkthdr.rcvif = ifp_dst;
+				} else {
+					sendmp->m_pkthdr.ether_vtag = vtag;
+					sendmp->m_flags |= M_VLANTAG;
+					sendmp->m_pkthdr.rcvif = ifp;
+				}
+			} else
+				sendmp->m_pkthdr.rcvif = ifp;
+
 			if ((ifp->if_capenable & IFCAP_RXCSUM) != 0)
 				ixgbe_rx_checksum(staterr, sendmp, ptype);
 #if __FreeBSD_version >= 800000
@@ -4625,7 +4649,32 @@ ixgbe_rx_checksum(u32 staterr, struct mbuf * mp, u
 	return;
 }
 
+/*
+ * This routine gets real vlan ifp based on
+ * underlying ifp and vlan tag.
+ */
+static struct ifnet *
+ixgbe_get_vlan(struct ifnet *ifp, uint16_t vtag)
+{
 
+	/* XXX: IFF_MONITOR */
+#if 0
+	struct lagg_port *lp = ifp->if_lagg;
+	struct lagg_softc *sc = lp->lp_softc;
+
+	/* Skip lagg nesting */
+	while (ifp->if_type == IFT_IEEE8023ADLAG) {
+		lp = ifp->if_lagg;
+		sc = lp->lp_softc;
+		ifp = sc->sc_ifp;
+	}
+#endif
+	/* Get vlan interface based on tag */
+	ifp = VLAN_DEVAT(ifp, vtag);
+
+	return (ifp);
+}
+
 /*
 ** This routine is run via an vlan config EVENT,
 ** it enables us to use the HW Filter table since
@@ -4637,7 +4686,9 @@ static void
 ixgbe_register_vlan(void *arg, struct ifnet *ifp, u16 vtag)
 {
 	struct adapter	*adapter = ifp->if_softc;
-	u16		index, bit;
+	u16		index, bit, j;
+	struct rx_ring	*rxr;
+	struct ifnet	*ifv;
 
 	if (ifp->if_softc !=  arg)   /* Not our event */
 		return;
@@ -4645,7 +4696,20 @@ ixgbe_register_vlan(void *arg, struct ifnet *ifp,
 	if ((vtag == 0) || (vtag > 4095))	/* Invalid */
 		return;
 
+	ifv = ixgbe_get_vlan(ifp, vtag);
+
 	IXGBE_CORE_LOCK(adapter);
+
+	if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) {
+		rxr = adapter->rx_rings;
+
+		for (j = 0; j < adapter->num_queues; j++, rxr++) {
+			IXGBE_RX_LOCK(rxr);
+			rxr->vlans->idx[vtag] = ifv;
+			IXGBE_RX_UNLOCK(rxr);
+		}
+	}
+
 	index = (vtag >> 5) & 0x7F;
 	bit = vtag & 0x1F;
 	adapter->shadow_vfta[index] |= (1 << bit);
@@ -4663,7 +4727,8 @@ static void
 ixgbe_unregister_vlan(void *arg, struct ifnet *ifp, u16 vtag)
 {
 	struct adapter	*adapter = ifp->if_softc;
-	u16		index, bit;
+	u16		index, bit, j;
+	struct rx_ring	*rxr;
 
 	if (ifp->if_softc !=  arg)
 		return;
@@ -4672,6 +4737,15 @@ ixgbe_unregister_vlan(void *arg, struct ifnet *ifp
 		return;
 
 	IXGBE_CORE_LOCK(adapter);
+
+	rxr = adapter->rx_rings;
+
+	for (j = 0; j < adapter->num_queues; j++, rxr++) {
+		IXGBE_RX_LOCK(rxr);
+		rxr->vlans->idx[vtag] = NULL;
+		IXGBE_RX_UNLOCK(rxr);
+	}
+
 	index = (vtag >> 5) & 0x7F;
 	bit = vtag & 0x1F;
 	adapter->shadow_vfta[index] &= ~(1 << bit);
@@ -4686,8 +4760,8 @@ ixgbe_setup_vlan_hw_support(struct adapter *adapte
 {
 	struct ifnet 	*ifp = adapter->ifp;
 	struct ixgbe_hw *hw = &adapter->hw;
+	u32		ctrl, j;
 	struct rx_ring	*rxr;
-	u32		ctrl;
 
 
 	/*
@@ -4713,6 +4787,15 @@ ixgbe_setup_vlan_hw_support(struct adapter *adapte
 	if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) {
 		ctrl &= ~IXGBE_VLNCTRL_CFIEN;
 		ctrl |= IXGBE_VLNCTRL_VFE;
+	} else {
+		/* Zero vlan table */
+		rxr = adapter->rx_rings;
+
+		for (j = 0; j < adapter->num_queues; j++, rxr++) {
+			IXGBE_RX_LOCK(rxr);
+			memset(rxr->vlans->idx, 0, sizeof(struct ifvlans));
+			IXGBE_RX_UNLOCK(rxr);
+		}
 	}
 	if (hw->mac.type == ixgbe_mac_82598EB)
 		ctrl |= IXGBE_VLNCTRL_VME;
Index: sys/dev/ixgbe/ixgbe.h
===================================================================
--- sys/dev/ixgbe/ixgbe.h	(revision 248704)
+++ sys/dev/ixgbe/ixgbe.h	(working copy)
@@ -284,6 +284,11 @@ struct ix_queue {
 	u64			irqs;
 };
 
+struct ifvlans {
+	struct ifnet 		*idx[4096];
+};
+
+
 /*
  * The transmit ring, one per queue
  */
@@ -307,7 +312,6 @@ struct tx_ring {
 	}			queue_status;
 	u32			txd_cmd;
 	bus_dma_tag_t		txtag;
-	char			mtx_name[16];
 #ifndef IXGBE_LEGACY_TX
 	struct buf_ring		*br;
 	struct task		txq_task;
@@ -324,6 +328,7 @@ struct tx_ring {
 	unsigned long   	no_tx_dma_setup;
 	u64			no_desc_avail;
 	u64			total_packets;
+	char			mtx_name[16];
 };
 
 
@@ -346,8 +351,8 @@ struct rx_ring {
 	u16			num_desc;
 	u16			mbuf_sz;
 	u16			process_limit;
-	char			mtx_name[16];
 	struct ixgbe_rx_buf	*rx_buffers;
+	struct ifvlans		*vlans;
 	bus_dma_tag_t		ptag;
 
 	u32			bytes; /* Used for AIM calc */
@@ -363,6 +368,7 @@ struct rx_ring {
 #ifdef IXGBE_FDIR
 	u64			flm;
 #endif
+	char			mtx_name[16];
 };
 
 /* Our adapter structure */

--------------010308000904000207080306
Content-Type: text/plain; charset=UTF-8;
 name="3_in_localip_fast.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="3_in_localip_fast.diff"

commit 7f1103ac622881182642b2d3ae17b6ff484c1293
Author: Charlie Root <root@test15.yandex.net>
Date:   Sun Apr 7 23:50:26 2013 +0000

    Use lockles in_localip_fast() function.

diff --git a/sys/net/route.h b/sys/net/route.h
index 4d9371b..f588f03 100644
--- a/sys/net/route.h
+++ b/sys/net/route.h
@@ -365,6 +365,7 @@ void 	 rt_maskedcopy(struct sockaddr *, struct sockaddr *, struct sockaddr *);
  */
 #define RTGC_ROUTE	1
 #define RTGC_IF		3
+#define	RTGC_IFADDR	4
 
 
 int	 rtexpunge(struct rtentry *);
diff --git a/sys/netinet/in.c b/sys/netinet/in.c
index 5341918..a83b8a9 100644
--- a/sys/netinet/in.c
+++ b/sys/netinet/in.c
@@ -93,6 +93,20 @@ VNET_DECLARE(struct inpcbinfo, ripcbinfo);
 VNET_DECLARE(struct arpstat, arpstat);  /* ARP statistics, see if_arp.h */
 #define	V_arpstat		VNET(arpstat)
 
+struct in_ifaddrf {
+	struct in_ifaddrf *next;
+	struct in_addr addr;
+};
+
+struct in_ifaddrhashf {
+	uint32_t hmask;
+	uint32_t count;
+	struct in_ifaddrf **hash;
+};
+
+VNET_DEFINE(struct in_ifaddrhashf *, in_ifaddrhashtblf) = NULL; /* inet addr fast hash table */
+#define	V_in_ifaddrhashtblf	VNET(in_ifaddrhashtblf)
+
 /*
  * Return 1 if an internet address is for a ``local'' host
  * (one to which we have a connection).  If subnetsarelocal
@@ -145,6 +159,120 @@ in_localip(struct in_addr in)
 	return (0);
 }
 
+int
+in_localip_fast(struct in_addr in)
+{
+	struct in_ifaddrf *rec;
+	struct in_ifaddrhashf *f;
+
+	if ((f = V_in_ifaddrhashtblf) == NULL)
+		return (0);
+
+	rec = f->hash[INADDR_HASHVAL(in) & f->hmask];
+
+	while (rec != NULL && rec->addr.s_addr != in.s_addr)
+		rec = rec->next;
+
+	if (rec != NULL)
+		return (1);
+
+	return (0);
+}
+
+struct in_ifaddrhashf *
+in_hash_alloc(int additional)
+{
+	int count, hsize, i;
+	struct in_ifaddr *ia;
+	struct in_ifaddrhashf *new;
+
+	count = additional + 1;
+
+	IN_IFADDR_RLOCK();
+	for (i = 0; i < INADDR_NHASH; i++) {
+		LIST_FOREACH(ia, &V_in_ifaddrhashtbl[i], ia_hash)
+			count++;
+	}
+	IN_IFADDR_RUNLOCK();
+
+	/* roundup to the next power of 2 */
+	hsize = (1UL << flsl(count - 1));
+
+	new = malloc(sizeof(struct in_ifaddrhashf) +
+	    sizeof(void *) * hsize +
+	    sizeof(struct in_ifaddrf) * count, M_IFADDR,
+	    M_NOWAIT | M_ZERO);
+
+	if (new == NULL)
+		return (NULL);
+
+	new->count = count;
+	new->hmask = hsize - 1;
+	new->hash = (struct in_ifaddrf **)(new + 1);
+
+	return (new);
+}
+
+int
+in_hash_build(struct in_ifaddrhashf *new)
+{
+	struct in_ifaddr *ia;
+	int i, j, count, hsize, r;
+	struct in_ifaddrhashf *old;
+	struct in_ifaddrf *rec, *tmp;
+
+	count = new->count - 1;
+	hsize = new->hmask + 1;
+	rec = (struct in_ifaddrf *)&new->hash[hsize];
+
+	IN_IFADDR_RLOCK();
+	for (i = 0; i < INADDR_NHASH; i++) {
+		LIST_FOREACH(ia, &V_in_ifaddrhashtbl[i], ia_hash) {
+			rec->addr.s_addr = IA_SIN(ia)->sin_addr.s_addr;
+
+			j = INADDR_HASHVAL(rec->addr) & new->hmask;
+			if ((tmp = new->hash[j]) == NULL)
+				new->hash[j] = rec;
+			else {
+				while (tmp->next)
+					tmp = tmp->next;
+				tmp->next = rec;
+			}
+
+			rec++;
+			count--;
+
+			/* End of memory */
+			if (count < 0)
+				break;
+		}
+
+		/* End of memory */
+		if (count < 0)
+			break;
+	}
+	IN_IFADDR_RUNLOCK();
+
+	/* If count >0 then we succeeded in building hash. Stop cycle */
+
+	if (count >= 0) {
+		old = V_in_ifaddrhashtblf;
+		V_in_ifaddrhashtblf = new;
+
+		rtgc_free(RTGC_IFADDR, old, 0);
+
+		return (1);
+	}
+
+	/* Fail. */
+	if (new)
+		free(new, M_IFADDR);
+
+	return (0);
+}
+
+
+
 /*
  * Determine whether an IP address is in a reserved set of addresses
  * that may not be forwarded, or whether datagrams to that destination
@@ -239,6 +367,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
 	struct sockaddr_in oldaddr;
 	int error, hostIsNew, iaIsNew, maskIsNew;
 	int iaIsFirst;
+	struct in_ifaddrhashf *new_hash;
 
 	ia = NULL;
 	iaIsFirst = 0;
@@ -405,6 +534,11 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
 				goto out;
 			}
 
+			if ((new_hash = in_hash_alloc(1)) == NULL) {
+				error = ENOBUFS;
+				goto out;
+			}
+
 			ifa = &ia->ia_ifa;
 			ifa_init(ifa);
 			ifa->ifa_addr = (struct sockaddr *)&ia->ia_addr;
@@ -427,6 +561,8 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
 			IN_IFADDR_WLOCK();
 			TAILQ_INSERT_TAIL(&V_in_ifaddrhead, ia, ia_link);
 			IN_IFADDR_WUNLOCK();
+
+			in_hash_build(new_hash);
 			iaIsNew = 1;
 		}
 		break;
@@ -649,6 +785,8 @@ in_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp,
 			ifa_free(&if_ia->ia_ifa);
 	} else
 		IN_IFADDR_WUNLOCK();
+	if ((new_hash = in_hash_alloc(0)) != NULL)
+		in_hash_build(new_hash);
 	ifa_free(&ia->ia_ifa);				/* in_ifaddrhead */
 out:
 	if (ia != NULL)
@@ -852,6 +990,7 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin,
 	register u_long i = ntohl(sin->sin_addr.s_addr);
 	struct sockaddr_in oldaddr;
 	int s = splimp(), flags = RTF_UP, error = 0;
+	struct in_ifaddrhashf *new_hash;
 
 	oldaddr = ia->ia_addr;
 	if (oldaddr.sin_family == AF_INET)
@@ -862,6 +1001,9 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin,
 		LIST_INSERT_HEAD(INADDR_HASH(ia->ia_addr.sin_addr.s_addr),
 		    ia, ia_hash);
 		IN_IFADDR_WUNLOCK();
+
+		if ((new_hash = in_hash_alloc(1)) != NULL)
+			in_hash_build(new_hash);
 	}
 	/*
 	 * Give the interface a chance to initialize
@@ -887,6 +1029,8 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct sockaddr_in *sin,
 				 */
 				LIST_REMOVE(ia, ia_hash);
 			IN_IFADDR_WUNLOCK();
+			if ((new_hash = in_hash_alloc(1)) != NULL)
+				in_hash_build(new_hash);
 			return (error);
 		}
 	}
diff --git a/sys/netinet/in.h b/sys/netinet/in.h
index b03e74c..948938a 100644
--- a/sys/netinet/in.h
+++ b/sys/netinet/in.h
@@ -741,6 +741,7 @@ int	 in_broadcast(struct in_addr, struct ifnet *);
 int	 in_canforward(struct in_addr);
 int	 in_localaddr(struct in_addr);
 int	 in_localip(struct in_addr);
+int	 in_localip_fast(struct in_addr);
 int	 inet_aton(const char *, struct in_addr *); /* in libkern */
 char	*inet_ntoa(struct in_addr); /* in libkern */
 char	*inet_ntoa_r(struct in_addr ina, char *buf); /* in libkern */
diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c
index 692e3e5..f7734a9 100644
--- a/sys/netinet/ip_fastfwd.c
+++ b/sys/netinet/ip_fastfwd.c
@@ -347,7 +347,7 @@ ip_fastforward(struct mbuf *m)
 	/*
 	 * Is it for a local address on this host?
 	 */
-	if (in_localip(ip->ip_dst))
+	if (in_localip_fast(ip->ip_dst))
 		return m;
 
 	//IPSTAT_INC(ips_total);
@@ -390,7 +390,7 @@ ip_fastforward(struct mbuf *m)
 		/*
 		 * Is it now for a local address on this host?
 		 */
-		if (in_localip(dest))
+		if (in_localip_fast(dest))
 			goto forwardlocal;
 		/*
 		 * Go on with new destination address
@@ -479,7 +479,7 @@ passin:
 		/*
 		 * Is it now for a local address on this host?
 		 */
-		if (m->m_flags & M_FASTFWD_OURS || in_localip(dest)) {
+		if (m->m_flags & M_FASTFWD_OURS || in_localip_fast(dest)) {
 forwardlocal:
 			/*
 			 * Return packet for processing by ip_input().
diff --git a/sys/netinet/ipfw/ip_fw2.c b/sys/netinet/ipfw/ip_fw2.c
index b76a638..53f6e97 100644
--- a/sys/netinet/ipfw/ip_fw2.c
+++ b/sys/netinet/ipfw/ip_fw2.c
@@ -1450,10 +1450,7 @@ do {								\
 
 			case O_IP_SRC_ME:
 				if (is_ipv4) {
-					struct ifnet *tif;
-
-					INADDR_TO_IFP(src_ip, tif);
-					match = (tif != NULL);
+					match = in_localip_fast(src_ip);
 					break;
 				}
 #ifdef INET6
@@ -1490,10 +1487,7 @@ do {								\
 
 			case O_IP_DST_ME:
 				if (is_ipv4) {
-					struct ifnet *tif;
-
-					INADDR_TO_IFP(dst_ip, tif);
-					match = (tif != NULL);
+					match = in_localip_fast(dst_ip);
 					break;
 				}
 #ifdef INET6
diff --git a/sys/netinet/ipfw/ip_fw_pfil.c b/sys/netinet/ipfw/ip_fw_pfil.c
index a21f501..bdf8beb 100644
--- a/sys/netinet/ipfw/ip_fw_pfil.c
+++ b/sys/netinet/ipfw/ip_fw_pfil.c
@@ -184,7 +184,7 @@ again:
 		bcopy(args.next_hop, (fwd_tag+1), sizeof(struct sockaddr_in));
 		m_tag_prepend(*m0, fwd_tag);
 
-		if (in_localip(args.next_hop->sin_addr))
+		if (in_localip_fast(args.next_hop->sin_addr))
 			(*m0)->m_flags |= M_FASTFWD_OURS;
 	    }
 #endif /* INET || INET6 */

--------------010308000904000207080306
Content-Type: text/plain; charset=UTF-8;
 name="80_use_rtgc.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="80_use_rtgc.diff"

commit 67a74d91a7b4a47a83fcfa5e79a6c6f0b4b1122d
Author: Charlie Root <root@test15.yandex.net>
Date:   Fri Oct 26 17:10:52 2012 +0000

    Remove rte locking for IPv4. Remove one of 2 locks from IPv6 rtes

diff --git a/sys/net/if.c b/sys/net/if.c
index a875326..eb6a723 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -487,6 +487,13 @@ if_alloc(u_char type)
 	return (ifp);
 }
 
+
+void
+if_free_real(struct ifnet *ifp)
+{
+	free(ifp, M_IFNET);
+}
+
 /*
  * Do the actual work of freeing a struct ifnet, and layer 2 common
  * structure.  This call is made when the last reference to an
@@ -499,6 +506,15 @@ if_free_internal(struct ifnet *ifp)
 	KASSERT((ifp->if_flags & IFF_DYING),
 	    ("if_free_internal: interface not dying"));
 
+	if (rtgc_is_enabled()) {
+		/* 
+		 * FIXME: Sleep some time to permit packets
+		 * using fastforwarding routine without locking
+		 * die withour side effects.
+		 */
+		pause("if_free_gc", hz / 20); /* Sleep 50 milliseconds */
+	}
+
 	if (if_com_free[ifp->if_alloctype] != NULL)
 		if_com_free[ifp->if_alloctype](ifp->if_l2com,
 		    ifp->if_alloctype);
@@ -511,7 +527,10 @@ if_free_internal(struct ifnet *ifp)
 	IF_AFDATA_DESTROY(ifp);
 	IF_ADDR_LOCK_DESTROY(ifp);
 	ifq_delete(&ifp->if_snd);
-	free(ifp, M_IFNET);
+	if (rtgc_is_enabled())
+		rtgc_free(RTGC_IF, ifp, 0);
+	else
+		if_free_real(ifp);
 }
 
 /*
diff --git a/sys/net/if_var.h b/sys/net/if_var.h
index 39c499f..5ef6264 100644
--- a/sys/net/if_var.h
+++ b/sys/net/if_var.h
@@ -857,6 +857,7 @@ void	if_down(struct ifnet *);
 struct ifmultiaddr *
 	if_findmulti(struct ifnet *, struct sockaddr *);
 void	if_free(struct ifnet *);
+void	if_free_real(struct ifnet *);
 void	if_free_type(struct ifnet *, u_char);
 void	if_initname(struct ifnet *, const char *, int);
 void	if_link_state_change(struct ifnet *, int);
diff --git a/sys/net/route.c b/sys/net/route.c
index 3059f5a..97965b3 100644
--- a/sys/net/route.c
+++ b/sys/net/route.c
@@ -142,6 +142,175 @@ VNET_DEFINE(int, rttrash);		/* routes not in table but not freed */
 static VNET_DEFINE(uma_zone_t, rtzone);		/* Routing table UMA zone. */
 #define	V_rtzone	VNET(rtzone)
 
+SYSCTL_NODE(_net, OID_AUTO, gc, CTLFLAG_RW, 0, "Garbage collector");
+
+MALLOC_DEFINE(M_RTGC, "rtgc", "route GC");
+void rtgc_func(void *_unused);
+void rtfree_real(struct rtentry *rt);
+
+int _rtgc_default_enabled = 1;
+TUNABLE_INT("net.gc.enable", &_rtgc_default_enabled);
+
+#define	RTGC_CALLOUT_DELAY	1
+#define	RTGC_EXPIRE_DELAY	3
+
+VNET_DEFINE(struct mtx, rtgc_mtx);
+#define	V_rtgc_mtx	VNET(rtgc_mtx)
+VNET_DEFINE(struct callout, rtgc_callout);
+#define	V_rtgc_callout	VNET(rtgc_callout)
+VNET_DEFINE(int, rtgc_enabled);
+#define	V_rtgc_enabled	VNET(rtgc_enabled)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, enable, CTLFLAG_RW,
+	&VNET_NAME(rtgc_enabled), 1,
+	"Enable garbage collector");
+VNET_DEFINE(int, rtgc_expire_delay) = RTGC_EXPIRE_DELAY;
+#define	V_rtgc_expire_delay	VNET(rtgc_expire_delay)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, expire, CTLFLAG_RW,
+	&VNET_NAME(rtgc_expire_delay), 1,
+	"Object expiration delay");
+VNET_DEFINE(int, rtgc_numfailures);
+#define	V_rtgc_numfailures	VNET(rtgc_numfailures)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, failures, CTLFLAG_RD,
+	&VNET_NAME(rtgc_numfailures), 0,
+	"Number of objects leaked from route garbage collector");
+VNET_DEFINE(int, rtgc_numqueued);
+#define	V_rtgc_numqueued	VNET(rtgc_numqueued)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, queued, CTLFLAG_RD,
+	&VNET_NAME(rtgc_numqueued), 0,
+	"Number of objects queued for deletion");
+VNET_DEFINE(int, rtgc_numfreed);
+#define	V_rtgc_numfreed	VNET(rtgc_numfreed)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, freed, CTLFLAG_RD,
+	&VNET_NAME(rtgc_numfreed), 0,
+	"Number of objects deleted");
+VNET_DEFINE(int, rtgc_numinvoked);
+#define	V_rtgc_numinvoked	VNET(rtgc_numinvoked)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, invoked, CTLFLAG_RD,
+	&VNET_NAME(rtgc_numinvoked), 0,
+	"Number of times GC was invoked");
+
+struct rtgc_item {
+	time_t	expire;	/* Whe we can delete this entry */
+	int 	etype;	/* Entry type */
+	void	*data;	/* data to free */
+	TAILQ_ENTRY(rtgc_item)	items;
+};
+
+VNET_DEFINE(TAILQ_HEAD(, rtgc_item), rtgc_queue);
+#define	V_rtgc_queue	VNET(rtgc_queue)
+
+int
+rtgc_is_enabled()
+{
+	return V_rtgc_enabled;
+}
+
+void
+rtgc_func(void *_unused)
+{
+	struct rtgc_item *item, *temp_item;
+	TAILQ_HEAD(, rtgc_item) rtgc_tq;
+	int empty, deleted;
+
+	CTR2(KTR_NET, "%s: started with %d objects", __func__, V_rtgc_numqueued);
+
+	TAILQ_INIT(&rtgc_tq);
+
+	/* Move all contents of current queue to new empty queue */
+	mtx_lock(&V_rtgc_mtx);
+	V_rtgc_numinvoked++;
+	TAILQ_SWAP(&rtgc_queue, &rtgc_tq, rtgc_item, items);
+	mtx_unlock(&V_rtgc_mtx);
+
+	deleted = 0;
+
+	/* Dispatch as much as we can */
+	TAILQ_FOREACH_SAFE(item, &rtgc_tq, items, temp_item) {
+		if (item->expire > time_uptime)
+			break;
+
+		/* We can definitely delete this item */
+		TAILQ_REMOVE(&rtgc_tq, item, items);
+
+		switch (item->etype) {
+		case RTGC_ROUTE:
+			CTR1(KTR_NET, "Freeing route structure %p", item->data);
+			rtfree_real((struct rtentry *)item->data);
+			break;
+		case RTGC_IF:
+			CTR1(KTR_NET, "Freeing iface structure %p", item->data);
+			if_free_real((struct ifnet *)item->data);
+			break;
+		default:
+			CTR2(KTR_NET, "Unknown type: %d %p", item->etype, item->data);
+			break;
+		}
+
+		/* Remove item itself */
+		free(item, M_RTGC);
+		deleted++;
+	}
+
+	/*
+	 * Add remaining data back to mail queue.
+	 * Note items are still sorted by time_uptime after merge.
+	 */
+
+	mtx_lock(&V_rtgc_mtx);
+	/* Add new items to the end of our temporary queue */
+	TAILQ_CONCAT(&rtgc_tq, &rtgc_queue, items);
+	/* Move items back to stable storage */
+	TAILQ_SWAP(&rtgc_queue, &rtgc_tq, rtgc_item, items);
+	/* Check if we need to run callout another time */
+	empty = TAILQ_EMPTY(&rtgc_queue);
+	/* Update counters */
+	V_rtgc_numfreed += deleted;
+	V_rtgc_numqueued -= deleted;
+	mtx_unlock(&V_rtgc_mtx);
+
+	CTR4(KTR_NET, "%s: ended with %d object(s) (%d deleted), callout: %s",
+		__func__, V_rtgc_numqueued, deleted, empty ? "stopped" : "sheduled");
+	/* Schedule ourself iff there are items to delete */
+	if (!empty)
+		callout_reset(&V_rtgc_callout, hz * RTGC_CALLOUT_DELAY, rtgc_func, NULL);
+}
+
+void
+rtgc_free(int etype, void *data, int can_sleep)
+{
+	struct rtgc_item *item;
+
+	item = malloc(sizeof(struct rtgc_item), M_RTGC, (can_sleep ? M_WAITOK : M_NOWAIT) | M_ZERO);
+	if (item == NULL) {
+		V_rtgc_numfailures++; /* XXX: locking */
+		return; /* Skip route freeing. Memory leak is much better than panic */
+	}
+
+	item->expire = time_uptime + V_rtgc_expire_delay;
+	item->etype = etype;
+	item->data = data;
+
+	if ((!can_sleep) && (mtx_trylock(&V_rtgc_mtx) == 0)) {
+		/* Fail to acquire lock. Add another leak */
+		free(item, M_RTGC);
+		V_rtgc_numfailures++; /* XXX: locking */
+		return;
+	}
+
+	if (can_sleep)
+		mtx_lock(&V_rtgc_mtx);
+
+	TAILQ_INSERT_TAIL(&rtgc_queue, item, items);
+	V_rtgc_numqueued++;
+
+	mtx_unlock(&V_rtgc_mtx);
+
+	/* Schedule callout if not running */
+	if (!callout_pending(&V_rtgc_callout))
+		callout_reset(&V_rtgc_callout, hz * RTGC_CALLOUT_DELAY, rtgc_func, NULL);
+}
+
+
 /*
  * handler for net.my_fibnum
  */
@@ -241,6 +410,17 @@ vnet_route_init(const void *unused __unused)
 			dom->dom_rtattach((void **)rnh, dom->dom_rtoffset);
 		}
 	}
+
+	/* Init garbage collector */
+	mtx_init(&V_rtgc_mtx, "routeGC", NULL, MTX_DEF);
+	/* Init queue */
+	TAILQ_INIT(&V_rtgc_queue);
+	/* Init garbage callout */
+	memset(&V_rtgc_callout, 0, sizeof(rtgc_callout));
+	callout_init(&V_rtgc_callout, 1);
+	/* Set default from loader tunable */
+	V_rtgc_enabled = _rtgc_default_enabled;
+	//callout_reset(&V_rtgc_callout, 3 * hz, &rtgc_func, NULL);
 }
 VNET_SYSINIT(vnet_route_init, SI_SUB_PROTO_DOMAIN, SI_ORDER_FOURTH,
     vnet_route_init, 0);
@@ -351,6 +531,74 @@ rtalloc1(struct sockaddr *dst, int report, u_long ignflags)
 }
 
 struct rtentry *
+rtalloc1_fib_nolock(struct sockaddr *dst, int report, u_long ignflags,
+		    u_int fibnum)
+{
+	struct radix_node_head *rnh;
+	struct radix_node *rn;
+	struct rtentry *newrt;
+	struct rt_addrinfo info;
+	int err = 0, msgtype = RTM_MISS;
+	int needlock;
+
+	KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
+	switch (dst->sa_family) {
+	case AF_INET6:
+	case AF_INET:
+		/* We support multiple FIBs. */
+		break;
+	default:
+		fibnum = RT_DEFAULT_FIB;
+		break;
+	}
+	rnh = rt_tables_get_rnh(fibnum, dst->sa_family);
+	newrt = NULL;
+	if (rnh == NULL)
+		goto miss;
+
+	/*
+	 * Look up the address in the table for that Address Family
+	 */
+	needlock = !(ignflags & RTF_RNH_LOCKED);
+	if (needlock)
+		RADIX_NODE_HEAD_RLOCK(rnh);
+#ifdef INVARIANTS	
+	else
+		RADIX_NODE_HEAD_LOCK_ASSERT(rnh);
+#endif
+	rn = rnh->rnh_matchaddr(dst, rnh);
+	if (rn && ((rn->rn_flags & RNF_ROOT) == 0)) {
+		newrt = RNTORT(rn);
+		if (needlock)
+			RADIX_NODE_HEAD_RUNLOCK(rnh);
+		goto done;
+
+	} else if (needlock)
+		RADIX_NODE_HEAD_RUNLOCK(rnh);
+	
+	/*
+	 * Either we hit the root or couldn't find any match,
+	 * Which basically means
+	 * "caint get there frm here"
+	 */
+miss:
+	V_rtstat.rts_unreach++;
+
+	if (report) {
+		/*
+		 * If required, report the failure to the supervising
+		 * Authorities.
+		 * For a delete, this is not an error. (report == 0)
+		 */
+		bzero(&info, sizeof(info));
+		info.rti_info[RTAX_DST] = dst;
+		rt_missmsg_fib(msgtype, &info, 0, err, fibnum);
+	}	
+done:
+	return (newrt);
+}
+
+struct rtentry *
 rtalloc1_fib(struct sockaddr *dst, int report, u_long ignflags,
 		    u_int fibnum)
 {
@@ -422,6 +670,23 @@ done:
 	return (newrt);
 }
 
+
+void
+rtfree_real(struct rtentry *rt)
+{
+	/*
+	 * The key is separatly alloc'd so free it (see rt_setgate()).
+	 * This also frees the gateway, as they are always malloc'd
+	 * together.
+	 */
+	Free(rt_key(rt));
+	
+	/*
+	 * and the rtentry itself of course
+	 */
+	uma_zfree(V_rtzone, rt);
+}
+
 /*
  * Remove a reference count from an rtentry.
  * If the count gets low enough, take it out of the routing table
@@ -484,18 +749,13 @@ rtfree(struct rtentry *rt)
 		 */
 		if (rt->rt_ifa)
 			ifa_free(rt->rt_ifa);
-		/*
-		 * The key is separatly alloc'd so free it (see rt_setgate()).
-		 * This also frees the gateway, as they are always malloc'd
-		 * together.
-		 */
-		Free(rt_key(rt));
 
-		/*
-		 * and the rtentry itself of course
-		 */
 		RT_LOCK_DESTROY(rt);
-		uma_zfree(V_rtzone, rt);
+
+		if (V_rtgc_enabled)
+			rtgc_free(RTGC_ROUTE, rt, 0);
+		else
+			rtfree_real(rt);
 		return;
 	}
 done:
diff --git a/sys/net/route.h b/sys/net/route.h
index b26ac44..3aa694d 100644
--- a/sys/net/route.h
+++ b/sys/net/route.h
@@ -363,9 +363,14 @@ void 	 rt_maskedcopy(struct sockaddr *, struct sockaddr *, struct sockaddr *);
  *
  *    RTFREE() uses an unlocked entry.
  */
+#define RTGC_ROUTE	1
+#define RTGC_IF		3
+
 
 int	 rtexpunge(struct rtentry *);
 void	 rtfree(struct rtentry *);
+void	 rtgc_free(int etype, void *data, int can_sleep);
+int	rtgc_is_enabled(void);
 int	 rt_check(struct rtentry **, struct rtentry **, struct sockaddr *);
 
 /* XXX MRT COMPAT VERSIONS THAT SET UNIVERSE to 0 */
@@ -394,6 +399,7 @@ int	 rt_getifa_fib(struct rt_addrinfo *, u_int fibnum);
 void	 rtalloc_ign_fib(struct route *ro, u_long ignflags, u_int fibnum);
 void	 rtalloc_fib(struct route *ro, u_int fibnum);
 struct rtentry *rtalloc1_fib(struct sockaddr *, int, u_long, u_int);
+struct rtentry *rtalloc1_fib_nolock(struct sockaddr *, int, u_long, u_int);
 int	 rtioctl_fib(u_long, caddr_t, u_int);
 void	 rtredirect_fib(struct sockaddr *, struct sockaddr *,
 	    struct sockaddr *, int, struct sockaddr *, u_int);
diff --git a/sys/netinet/in_rmx.c b/sys/netinet/in_rmx.c
index 1389873..1c9d9db 100644
--- a/sys/netinet/in_rmx.c
+++ b/sys/netinet/in_rmx.c
@@ -122,12 +122,12 @@ in_matroute(void *v_arg, struct radix_node_head *head)
 	struct rtentry *rt = (struct rtentry *)rn;
 
 	if (rt) {
-		RT_LOCK(rt);
+//		RT_LOCK(rt);
 		if (rt->rt_flags & RTPRF_OURS) {
 			rt->rt_flags &= ~RTPRF_OURS;
 			rt->rt_rmx.rmx_expire = 0;
 		}
-		RT_UNLOCK(rt);
+//		RT_UNLOCK(rt);
 	}
 	return rn;
 }
@@ -365,7 +365,7 @@ in_inithead(void **head, int off)
 
 	rnh = *head;
 	rnh->rnh_addaddr = in_addroute;
-	rnh->rnh_matchaddr = in_matroute;
+	rnh->rnh_matchaddr = rn_match;
 	rnh->rnh_close = in_clsroute;
 	if (_in_rt_was_here == 0 ) {
 		callout_init(&V_rtq_timer, CALLOUT_MPSAFE);
diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c
index d7fe411..d2b98b3 100644
--- a/sys/netinet/ip_fastfwd.c
+++ b/sys/netinet/ip_fastfwd.c
@@ -112,6 +112,22 @@ static VNET_DEFINE(int, ipfastforward_active);
 SYSCTL_VNET_INT(_net_inet_ip, OID_AUTO, fastforwarding, CTLFLAG_RW,
     &VNET_NAME(ipfastforward_active), 0, "Enable fast IP forwarding");
 
+void
+rtalloc_ign_fib_nolock(struct route *ro, u_long ignore, u_int fibnum);
+
+void
+rtalloc_ign_fib_nolock(struct route *ro, u_long ignore, u_int fibnum)
+{
+	struct rtentry *rt;
+
+	if ((rt = ro->ro_rt) != NULL) {
+		if (rt->rt_ifp != NULL && rt->rt_flags & RTF_UP)
+			return;
+		ro->ro_rt = NULL;
+	}
+	ro->ro_rt = rtalloc1_fib_nolock(&ro->ro_dst, 1, ignore, fibnum);
+}
+
 static struct sockaddr_in *
 ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m)
 {
@@ -126,7 +142,7 @@ ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m)
 	dst->sin_family = AF_INET;
 	dst->sin_len = sizeof(*dst);
 	dst->sin_addr.s_addr = dest.s_addr;
-	in_rtalloc_ign(ro, 0, M_GETFIB(m));
+	rtalloc_ign_fib_nolock(ro, 0, M_GETFIB(m));
 
 	/*
 	 * Route there and interface still up?
@@ -140,8 +156,10 @@ ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m)
 	} else {
 		IPSTAT_INC(ips_noroute);
 		IPSTAT_INC(ips_cantforward);
+#if 0
 		if (rt)
 			RTFREE(rt);
+#endif
 		icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_HOST, 0, 0);
 		return NULL;
 	}
@@ -334,10 +352,11 @@ ip_fastforward(struct mbuf *m)
 	if (in_localip(ip->ip_dst))
 		return m;
 
-	IPSTAT_INC(ips_total);
+	//IPSTAT_INC(ips_total);
 
 	/*
 	 * Step 3: incoming packet firewall processing
+	in_rtalloc_ign(ro, 0, M_GETFIB(m));
 	 */
 
 	/*
@@ -476,8 +495,10 @@ forwardlocal:
 			 * "ours"-label.
 			 */
 			m->m_flags |= M_FASTFWD_OURS;
+/*
 			if (ro.ro_rt)
 				RTFREE(ro.ro_rt);
+*/				
 			return m;
 		}
 		/*
@@ -490,7 +511,7 @@ forwardlocal:
 			m_tag_delete(m, fwd_tag);
 		}
 #endif /* IPFIREWALL_FORWARD */
-		RTFREE(ro.ro_rt);
+//		RTFREE(ro.ro_rt);
 		if ((dst = ip_findroute(&ro, dest, m)) == NULL)
 			return NULL;	/* icmp unreach already sent */
 		ifp = ro.ro_rt->rt_ifp;
@@ -601,17 +622,21 @@ passout:
 	if (error != 0)
 		IPSTAT_INC(ips_odropped);
 	else {
+#if 0
 		ro.ro_rt->rt_rmx.rmx_pksent++;
 		IPSTAT_INC(ips_forward);
 		IPSTAT_INC(ips_fastforward);
+#endif
 	}
 consumed:
-	RTFREE(ro.ro_rt);
+//	RTFREE(ro.ro_rt);
 	return NULL;
 drop:
 	if (m)
 		m_freem(m);
+/*
 	if (ro.ro_rt)
 		RTFREE(ro.ro_rt);
+*/		
 	return NULL;
 }
diff --git a/sys/netinet6/in6_rmx.c b/sys/netinet6/in6_rmx.c
index b526030..9aabe63 100644
--- a/sys/netinet6/in6_rmx.c
+++ b/sys/netinet6/in6_rmx.c
@@ -195,12 +195,12 @@ in6_matroute(void *v_arg, struct radix_node_head *head)
 	struct rtentry *rt = (struct rtentry *)rn;
 
 	if (rt) {
-		RT_LOCK(rt);
+		//RT_LOCK(rt);
 		if (rt->rt_flags & RTPRF_OURS) {
 			rt->rt_flags &= ~RTPRF_OURS;
 			rt->rt_rmx.rmx_expire = 0;
 		}
-		RT_UNLOCK(rt);
+		//RT_UNLOCK(rt);
 	}
 	return rn;
 }
@@ -440,7 +440,7 @@ in6_inithead(void **head, int off)
 
 	rnh = *head;
 	rnh->rnh_addaddr = in6_addroute;
-	rnh->rnh_matchaddr = in6_matroute;
+	rnh->rnh_matchaddr = rn_match;
 
 	if (V__in6_rt_was_here == 0) {
 		callout_init(&V_rtq_timer6, CALLOUT_MPSAFE);

--------------010308000904000207080306
Content-Type: text/plain; charset=UTF-8;
 name="81_radix_rmlock.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="81_radix_rmlock.diff"

commit 0e7cebd1753c3b77bdc00d728fbd5910c2d2afec
Author: Charlie Root <root@test15.yandex.net>
Date:   Mon Apr 8 15:35:00 2013 +0000

    Make radix use rmlock.

diff --git a/sys/contrib/ipfilter/netinet/ip_compat.h b/sys/contrib/ipfilter/netinet/ip_compat.h
index 31e5b11..5e74da4 100644
--- a/sys/contrib/ipfilter/netinet/ip_compat.h
+++ b/sys/contrib/ipfilter/netinet/ip_compat.h
@@ -870,6 +870,7 @@ typedef	u_int32_t	u_32_t;
 # if (__FreeBSD_version >= 500043)
 #  include <sys/mutex.h>
 #  if (__FreeBSD_version > 700014)
+#   include <sys/rmlock.h>
 #   include <sys/rwlock.h>
 #    define	KRWLOCK_T		struct rwlock
 #    ifdef _KERNEL
diff --git a/sys/contrib/pf/net/pf_table.c b/sys/contrib/pf/net/pf_table.c
index 40c9f67..b1dd703 100644
--- a/sys/contrib/pf/net/pf_table.c
+++ b/sys/contrib/pf/net/pf_table.c
@@ -44,6 +44,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/mbuf.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #ifdef __FreeBSD__
 #include <sys/malloc.h>
diff --git a/sys/kern/subr_witness.c b/sys/kern/subr_witness.c
index e565d01..f913d27 100644
--- a/sys/kern/subr_witness.c
+++ b/sys/kern/subr_witness.c
@@ -508,7 +508,7 @@ static struct witness_order_list_entry order_lists[] = {
 	 * Routing
 	 */
 	{ "so_rcv", &lock_class_mtx_sleep },
-	{ "radix node head", &lock_class_rw },
+	{ "radix node head", &lock_class_rm },
 	{ "rtentry", &lock_class_mtx_sleep },
 	{ "ifaddr", &lock_class_mtx_sleep },
 	{ NULL, NULL },
diff --git a/sys/kern/sys_socket.c b/sys/kern/sys_socket.c
index 4cbae74..fea12d0 100644
--- a/sys/kern/sys_socket.c
+++ b/sys/kern/sys_socket.c
@@ -50,6 +50,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/ucred.h>
 
 #include <net/if.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/route.h>
 #include <net/vnet.h>
 
diff --git a/sys/kern/vfs_export.c b/sys/kern/vfs_export.c
index 4185211..848c232 100644
--- a/sys/kern/vfs_export.c
+++ b/sys/kern/vfs_export.c
@@ -47,7 +47,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/mbuf.h>
 #include <sys/mount.h>
 #include <sys/mutex.h>
-#include <sys/rwlock.h>
+#include <sys/rmlock.h>
 #include <sys/refcount.h>
 #include <sys/socket.h>
 #include <sys/systm.h>
@@ -427,6 +427,7 @@ vfs_export_lookup(struct mount *mp, struct sockaddr *nam)
 	register struct netcred *np;
 	register struct radix_node_head *rnh;
 	struct sockaddr *saddr;
+	RADIX_NODE_HEAD_READER;
 
 	nep = mp->mnt_export;
 	if (nep == NULL)
diff --git a/sys/net/if.c b/sys/net/if.c
index 5ecde8c..351e046 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -51,6 +51,7 @@
 #include <sys/lock.h>
 #include <sys/refcount.h>
 #include <sys/module.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/sockio.h>
 #include <sys/syslog.h>
diff --git a/sys/net/radix.c b/sys/net/radix.c
index 33fcf82..d8d1e8b 100644
--- a/sys/net/radix.c
+++ b/sys/net/radix.c
@@ -37,7 +37,7 @@
 #ifdef	_KERNEL
 #include <sys/lock.h>
 #include <sys/mutex.h>
-#include <sys/rwlock.h>
+#include <sys/rmlock.h>
 #include <sys/systm.h>
 #include <sys/malloc.h>
 #include <sys/syslog.h>
diff --git a/sys/net/radix.h b/sys/net/radix.h
index 29659b5..2d130f0 100644
--- a/sys/net/radix.h
+++ b/sys/net/radix.h
@@ -36,7 +36,7 @@
 #ifdef _KERNEL
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
-#include <sys/_rwlock.h>
+#include <sys/_rmlock.h>
 #endif
 
 #ifdef MALLOC_DECLARE
@@ -133,7 +133,7 @@ struct radix_node_head {
 	struct	radix_node rnh_nodes[3];	/* empty tree for common case */
 	int	rnh_multipath;			/* multipath capable ? */
 #ifdef _KERNEL
-	struct	rwlock rnh_lock;		/* locks entire radix tree */
+	struct	rmlock rnh_lock;		/* locks entire radix tree */
 #endif
 };
 
@@ -146,18 +146,21 @@ struct radix_node_head {
 #define R_Zalloc(p, t, n) (p = (t) malloc((unsigned long)(n), M_RTABLE, M_NOWAIT | M_ZERO))
 #define Free(p) free((caddr_t)p, M_RTABLE);
 
+#define	RADIX_NODE_HEAD_READER		struct rm_priotracker tracker
 #define	RADIX_NODE_HEAD_LOCK_INIT(rnh)	\
-    rw_init_flags(&(rnh)->rnh_lock, "radix node head", 0)
-#define	RADIX_NODE_HEAD_LOCK(rnh)	rw_wlock(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_UNLOCK(rnh)	rw_wunlock(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_RLOCK(rnh)	rw_rlock(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_RUNLOCK(rnh)	rw_runlock(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh)	rw_try_upgrade(&(rnh)->rnh_lock)
-
-
-#define	RADIX_NODE_HEAD_DESTROY(rnh)	rw_destroy(&(rnh)->rnh_lock)
-#define	RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_LOCKED)
-#define	RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_WLOCKED)
+    rm_init(&(rnh)->rnh_lock, "radix node head")
+#define	RADIX_NODE_HEAD_LOCK(rnh)	rm_wlock(&(rnh)->rnh_lock)
+#define	RADIX_NODE_HEAD_UNLOCK(rnh)	rm_wunlock(&(rnh)->rnh_lock)
+#define	RADIX_NODE_HEAD_RLOCK(rnh)	rm_rlock(&(rnh)->rnh_lock, &tracker)
+#define	RADIX_NODE_HEAD_RUNLOCK(rnh)	rm_runlock(&(rnh)->rnh_lock, &tracker)
+//#define	RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh)	rw_try_upgrade(&(rnh)->rnh_lock)
+
+
+#define	RADIX_NODE_HEAD_DESTROY(rnh)	rm_destroy(&(rnh)->rnh_lock)
+#define	RADIX_NODE_HEAD_LOCK_ASSERT(rnh)
+#define	RADIX_NODE_HEAD_WLOCK_ASSERT(rnh)
+//#define	RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_LOCKED)
+//#define	RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, RA_WLOCKED)
 #endif /* _KERNEL */
 
 void	 rn_init(int);
diff --git a/sys/net/radix_mpath.c b/sys/net/radix_mpath.c
index ee7826f..c69888e 100644
--- a/sys/net/radix_mpath.c
+++ b/sys/net/radix_mpath.c
@@ -45,6 +45,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/socket.h>
 #include <sys/domain.h>
 #include <sys/syslog.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/radix.h>
 #include <net/radix_mpath.h>
 #include <net/route.h>
diff --git a/sys/net/route.c b/sys/net/route.c
index 5d56688..2cf6ea5 100644
--- a/sys/net/route.c
+++ b/sys/net/route.c
@@ -52,6 +52,8 @@
 #include <sys/proc.h>
 #include <sys/domain.h>
 #include <sys/kernel.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 
 #include <net/if.h>
 #include <net/if_dl.h>
@@ -544,6 +546,7 @@ rtalloc1_fib_nolock(struct sockaddr *dst, int report, u_long ignflags,
 	struct rtentry *newrt;
 	struct rt_addrinfo info;
 	int err = 0, msgtype = RTM_MISS;
+	RADIX_NODE_HEAD_READER;
 	int needlock;
 
 	KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
@@ -612,6 +615,7 @@ rtalloc1_fib(struct sockaddr *dst, int report, u_long ignflags,
 	struct rtentry *newrt;
 	struct rt_addrinfo info;
 	int err = 0, msgtype = RTM_MISS;
+	RADIX_NODE_HEAD_READER;
 	int needlock;
 
 	KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
@@ -799,6 +803,7 @@ rtredirect_fib(struct sockaddr *dst,
 	struct rt_addrinfo info;
 	struct ifaddr *ifa;
 	struct radix_node_head *rnh;
+	RADIX_NODE_HEAD_READER;
 
 	ifa = NULL;
 	rnh = rt_tables_get_rnh(fibnum, dst->sa_family);
diff --git a/sys/net/rtsock.c b/sys/net/rtsock.c
index 58c46a6..18d3e06 100644
--- a/sys/net/rtsock.c
+++ b/sys/net/rtsock.c
@@ -45,6 +45,7 @@
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/protosw.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/signalvar.h>
 #include <sys/socket.h>
@@ -577,6 +578,7 @@ route_output(struct mbuf *m, struct socket *so)
 	struct ifnet *ifp = NULL;
 	union sockaddr_union saun;
 	sa_family_t saf = AF_UNSPEC;
+	RADIX_NODE_HEAD_READER;
 
 #define senderr(e) { error = e; goto flush;}
 	if (m == NULL || ((m->m_len < sizeof(long)) &&
@@ -1818,6 +1820,7 @@ sysctl_rtsock(SYSCTL_HANDLER_ARGS)
 	int	i, lim, error = EINVAL;
 	u_char	af;
 	struct	walkarg w;
+	RADIX_NODE_HEAD_READER;
 
 	name ++;
 	namelen--;
diff --git a/sys/netinet/in_rmx.c b/sys/netinet/in_rmx.c
index 1c9d9db..775ba5a 100644
--- a/sys/netinet/in_rmx.c
+++ b/sys/netinet/in_rmx.c
@@ -53,6 +53,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/callout.h>
 
 #include <net/if.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/route.h>
 #include <net/vnet.h>
 
diff --git a/sys/netinet6/in6_ifattach.c b/sys/netinet6/in6_ifattach.c
index 80eb022..cbfe1d8 100644
--- a/sys/netinet6/in6_ifattach.c
+++ b/sys/netinet6/in6_ifattach.c
@@ -42,6 +42,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/proc.h>
 #include <sys/syslog.h>
 #include <sys/md5.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 
 #include <net/if.h>
 #include <net/if_dl.h>
diff --git a/sys/netinet6/in6_rmx.c b/sys/netinet6/in6_rmx.c
index 9aabe63..a291db2 100644
--- a/sys/netinet6/in6_rmx.c
+++ b/sys/netinet6/in6_rmx.c
@@ -84,6 +84,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/socket.h>
 #include <sys/socketvar.h>
 #include <sys/mbuf.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/syslog.h>
 #include <sys/callout.h>
diff --git a/sys/netinet6/nd6_rtr.c b/sys/netinet6/nd6_rtr.c
index 687d84d..7737d47 100644
--- a/sys/netinet6/nd6_rtr.c
+++ b/sys/netinet6/nd6_rtr.c
@@ -45,6 +45,7 @@ __FBSDID("$FreeBSD: stable/8/sys/netinet6/nd6_rtr.c 233201 2012-03-19 20:49:42Z
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/errno.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/syslog.h>
 #include <sys/queue.h>

--------------010308000904000207080306
Content-Type: text/plain; charset=UTF-8;
 name="11_no_lle_rlock.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="11_no_lle_rlock.diff"

commit 963196095589c03880ddd13a5c16f9e50cf6d7ce
Author: Charlie Root <root@test15.yandex.net>
Date:   Sun Nov 4 15:52:50 2012 +0000

    Do not require locking arp lle

diff --git a/sys/net/if_llatbl.h b/sys/net/if_llatbl.h
index 9f6531b..c1b2af9 100644
--- a/sys/net/if_llatbl.h
+++ b/sys/net/if_llatbl.h
@@ -169,6 +169,7 @@ MALLOC_DECLARE(M_LLTABLE);
 #define	LLE_PUB		0x0020	/* publish entry ??? */
 #define	LLE_DELETE	0x4000	/* delete on a lookup - match LLE_IFADDR */
 #define	LLE_CREATE	0x8000	/* create on a lookup miss */
+#define	LLE_UNLOCKED	0x1000	/* return lle unlocked */
 #define	LLE_EXCLUSIVE	0x2000	/* return lle xlocked  */
 
 #define LLATBL_HASH(key, mask) \
diff --git a/sys/netinet/if_ether.c b/sys/netinet/if_ether.c
index f61b803..ecb9b8e 100644
--- a/sys/netinet/if_ether.c
+++ b/sys/netinet/if_ether.c
@@ -283,10 +283,10 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0, struct mbuf *m,
 	struct sockaddr *dst, u_char *desten, struct llentry **lle)
 {
 	struct llentry *la = 0;
-	u_int flags = 0;
+	u_int flags = LLE_UNLOCKED;
 	struct mbuf *curr = NULL;
 	struct mbuf *next = NULL;
-	int error, renew;
+	int error, renew = 0;
 
 	*lle = NULL;
 	if (m != NULL) {
@@ -307,7 +307,41 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0, struct mbuf *m,
 retry:
 	IF_AFDATA_RLOCK(ifp);	
 	la = lla_lookup(LLTABLE(ifp), flags, dst);
+
+	/*
+	 * Fast path. Do not require rlock on llentry.
+	 */
+	if ((la != NULL) && (flags & LLE_UNLOCKED)) {
+		if ((la->la_flags & LLE_VALID) &&
+		    ((la->la_flags & LLE_STATIC) || la->la_expire > time_uptime)) {
+			bcopy(&la->ll_addr, desten, ifp->if_addrlen);
+			/*
+			 * If entry has an expiry time and it is approaching,
+			 * see if we need to send an ARP request within this
+			 * arpt_down interval.
+			 */
+			if (!(la->la_flags & LLE_STATIC) &&
+			    time_uptime + la->la_preempt > la->la_expire) {
+				renew = 1;
+				la->la_preempt--;
+			}
+
+			IF_AFDATA_RUNLOCK(ifp);
+			if (renew != 0)
+				arprequest(ifp, NULL, &SIN(dst)->sin_addr, NULL);
+
+			return (0);
+		}
+
+		/* Revert to normal path for other cases */
+		*lle = la;
+		LLE_RLOCK(la);
+	}
+
+	flags &= ~LLE_UNLOCKED;
+
 	IF_AFDATA_RUNLOCK(ifp);	
+
 	if ((la == NULL) && ((flags & LLE_EXCLUSIVE) == 0)
 	    && ((ifp->if_flags & (IFF_NOARP | IFF_STATICARP)) == 0)) {		
 		flags |= (LLE_CREATE | LLE_EXCLUSIVE);
@@ -324,27 +358,6 @@ retry:
 		return (EINVAL);
 	} 
 
-	if ((la->la_flags & LLE_VALID) &&
-	    ((la->la_flags & LLE_STATIC) || la->la_expire > time_second)) {
-		bcopy(&la->ll_addr, desten, ifp->if_addrlen);
-		/*
-		 * If entry has an expiry time and it is approaching,
-		 * see if we need to send an ARP request within this
-		 * arpt_down interval.
-		 */
-		if (!(la->la_flags & LLE_STATIC) &&
-		    time_second + la->la_preempt > la->la_expire) {
-			arprequest(ifp, NULL,
-			    &SIN(dst)->sin_addr, IF_LLADDR(ifp));
-
-			la->la_preempt--;
-		}
-		
-		*lle = la;
-		error = 0;
-		goto done;
-	} 
-			    
 	if (la->la_flags & LLE_STATIC) {   /* should not happen! */
 		log(LOG_DEBUG, "arpresolve: ouch, empty static llinfo for %s\n",
 		    inet_ntoa(SIN(dst)->sin_addr));
diff --git a/sys/netinet/in.c b/sys/netinet/in.c
index eaba4e5..5341918 100644
--- a/sys/netinet/in.c
+++ b/sys/netinet/in.c
@@ -1561,7 +1561,7 @@ in_lltable_lookup(struct lltable *llt, u_int flags, const struct sockaddr *l3add
 	if (LLE_IS_VALID(lle)) {
 		if (flags & LLE_EXCLUSIVE)
 			LLE_WLOCK(lle);
-		else
+		else if (!(flags & LLE_UNLOCKED))
 			LLE_RLOCK(lle);
 	}
 done:

--------------010308000904000207080306--

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 28 19:37:12 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id BDB86FF7;
 Wed, 28 Aug 2013 19:37:12 +0000 (UTC)
 (envelope-from jfvogel@gmail.com)
Received: from mail-ve0-x235.google.com (mail-ve0-x235.google.com
 [IPv6:2607:f8b0:400c:c01::235])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id DEA2221E4;
 Wed, 28 Aug 2013 19:37:11 +0000 (UTC)
Received: by mail-ve0-f181.google.com with SMTP id jz10so4684066veb.12
 for <multiple recipients>; Wed, 28 Aug 2013 12:37:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=dxnKa8hM+AefrXStvsj3F1WeY+ZAph3Wgx8EEc4Juck=;
 b=nDn2BH6n6WYTCIyZd96zHN8qhwNcQ0pJVWb4lKM7AK7rcBXSJRUP5XpTA3vnpkd5Qz
 5uPda4ecEq7EqxRDWBYRA4Schfm5gAeoGby4K41DMofd8RpKrdlnWj+7ZT8JND8Hedh5
 tAPciWe84X9MKbEc4HINMV7Yku+OAZn2/zgpeaye7vPXzAsGnHwUEpFOcWWPTR05qs43
 8usgrgTfZx3ua2xF9o1tCACdJTt7edXUX0o9mGvYSsaCSTDTJOTVjltYPOmw716R9Fwl
 6k+XnaZw8cUQmXH6qpoWM4ijNVJEv5LgiNCZ6A6JWdDuv0U76pIIt5Z0f7hkNytI40AL
 Phew==
MIME-Version: 1.0
X-Received: by 10.58.235.69 with SMTP id uk5mr27194246vec.17.1377718630983;
 Wed, 28 Aug 2013 12:37:10 -0700 (PDT)
Received: by 10.220.159.141 with HTTP; Wed, 28 Aug 2013 12:37:10 -0700 (PDT)
In-Reply-To: <521E41CB.30700@yandex-team.ru>
References: <521E41CB.30700@yandex-team.ru>
Date: Wed, 28 Aug 2013 12:37:10 -0700
Message-ID: <CAFOYbcnbcp4z60SeDXTQ+acPGC55DCYfhZZuRvHvu7HhyWTang@mail.gmail.com>
Subject: Re: Network stack changes
From: Jack Vogel <jfvogel@gmail.com>
To: "Alexander V. Chernikov" <melifaro@yandex-team.ru>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Adrian Chadd <adrian@freebsd.org>, Andre Oppermann <andre@freebsd.org>,
 FreeBSD Hackers <freebsd-hackers@freebsd.org>, FreeBSD Net <net@freebsd.org>,
 Luigi Rizzo <luigi@freebsd.org>, "Andrey V. Elsukov" <ae@freebsd.org>,
 Gleb Smirnoff <glebius@freebsd.org>, freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Aug 2013 19:37:12 -0000

Very interesting material Alexander, only had time to glance at it now,
will look in more
depth later, thanks!

Jack


On Wed, Aug 28, 2013 at 11:30 AM, Alexander V. Chernikov <
melifaro@yandex-team.ru> wrote:

> Hello list!
>
> There is a lot constantly raising  discussions related to networking stack
> performance/changes.
>
> I'll try to summarize current problems and possible solutions from my
> point of view.
> (Generally this is one problem: stack is slooooooooooooooooooooooooooow**,
> but we need to know why and what to do).
>
> Let's start with current IPv4 packet flow on a typical router:
> http://static.ipfw.ru/images/**freebsd_ipv4_flow.png<http://static.ipfw.ru/images/freebsd_ipv4_flow.png>
>
> (I'm sorry I can't provide this as text since Visio don't have any
> 'ascii-art' exporter).
>
> Note that we are using process-to-completion model, e.g. process any
> packet in ISR until it is either
> consumed by L4+ stack or dropped or put to egress NIC queue.
>
> (There is also deferred ISR model implemented inside netisr but it does
> not change much:
> it can help to do more fine-grained hashing (for GRE or other similar
> traffic), but
> 1) it uses per-packet mutex locking which kills all performance
> 2) it currently does not have _any_ hashing functions (see absence of
> flags in `netstat -Q`)
> People using http://static.ipfw.ru/patches/**netisr_ip_flowid.diff<http://static.ipfw.ru/patches/netisr_ip_flowid.diff>(or modified PPPoe/GRE version)
> report some profit, but without fixing (1) it can't help much
> )
>
> So, let's start:
>
> 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since
> there is nearly no contention
> (the only thing that can happen is driver reconfiguration which is rare
> and, more signifficant, we do this once
> for the batch of packets received in given interrupt). However, due to
> some (im)possible deadlocks current code
> does per-packet ring unlock/lock (see ixgbe_rx_input()).
> There was a discussion ended with nothing: http://lists.freebsd.org/**
> pipermail/freebsd-net/2012-**October/033520.html<http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html>
>
> 1*) Possible BPF users. Here we have one rlock if there are any readers
> present
> (and mutex for any matching packets, but this is more or less OK.
> Additionally, there is WIP to implement multiqueue BPF
> and there is chance that we can reduce lock contention there). There is
> also an "optimize_writers" hack permitting applications
> like CDP to use BPF as writers but not registering them as receivers
> (which implies rlock)
>
> 2/3) Virtual interfaces (laggs/vlans over lagg and other simular
> constructions).
> Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more
> funny - we use complex vlan_hash with another rlock to
> get vlan interface from underlying one.
>
> This is definitely not like things should be done and this can be changed
> more or less easily.
>
> There are some useful terms/techniques in world of software/hardware
> routing: they have clear 'control plane' and 'data plane' separation.
> Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg
> hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with
> options, destined to hosts without ARP/NDP record, and similar). Latter one
> is done in hardware (or effective software implementation).
> Control plane is responsible to provide data for efficient data plane
> operations. This is the point we are missing nearly everywhere.
>
> What I want to say is: lagg is pure control-plane stuff and vlan is nearly
> the same. We can't apply this approach to complex cases like
> lagg-over-vlans-over-vlans-**over-(pppoe_ng0-and_wifi0)
> but we definitely can do this for most common setups like (igb* or ix* in
> lagg with or without vlans on top of lagg).
>
> We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add
> some more. We even have per-driver hooks to program HW filtering.
>
> One small step to do is to throw packet to vlan interface directly (P1),
> proof-of-concept(working in production):
> http://lists.freebsd.org/**pipermail/freebsd-net/2013-**April/035270.html<http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html>
>
> Another is to change lagg packet accounting: http://lists.freebsd.org/**
> pipermail/svn-src-all/2013-**April/067570.html<http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html>
> Again, this is more like HW boxes do (aggregate all counters including
> errors) (and I can't imagine what real error we can get from _lagg_).
>
> 4) If we are router, we can do either slooow ip_input() -> ip_forward() ->
> ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow'
> path for multicast/options/local traffic (e.g. works exactly like 'data
> plane' part).
> (Btw, we can consider net.inet.ip.fastforwarding to be turned on by
> default at least for non-IPSEC kernels)
>
> Here we have to determine if this is local packet or not, e.g. F(dst_ip)
> returning 1 or 0. Currently we are simply using standard rlock + hash of
> iface addresses.
> (And some consumers like ipfw(4) do the same, but without lock).
> We don't need to do this! We can build sorted array of IPv4 addresses or
> other efficient structure on every address change and use it unlocked with
> delayed garbage collection (proof-of-concept attached)
> (There is another thing to discuss: maybe we can do this once somewhere in
> ip_input and mark mbuf as 'local/non-local' ? )
>
> 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks.
> This is OK.
>
> However, 6) and 7) are not.
> Firewall can use the same pfil lock as reader protection without imposing
> its own lock. currently pfil&ipfw code is ready to do this.
>
> 8) Radix/rt* api. This is probably the worst place in entire stack. It is
> toooo generic, tooo slow and buggy (do you use IPv6? you definitely know
> what I'm talking about).
> A) It really is too generic and assumption that it can be (effectively)
> used for every family is wrong. Two examples:
> we don't need to lookup all 128 bits of IPv6 address. Subnets with mask
> >/64 are not used widely (actually the only reason to use them are p2p
> links due to ND potential problems).
> One of common solutions is to lookup 64bits, and build another trie (or
> other structure) in case of collision.
> Another example is MPLS where we can simply do direct array lookup based
> on ingress label.
>
> B) It is terribly slow (AFAIR luigi@ did some performance management,
> numbers available in one of netmap pdfs)
> C) It is not multipath-capable. Stateful (and non-working) multipath is
> definitely not the right way.
>
> 8*) rtentry
> We are doing it wrong.
> Currently _every_ lookup locks/unlocks given rte twice.
> First lock is related to and old-old story for trusting IP redirects (and
> auto-adding host routes for them). Hopefully currently it is disabled
> automatically when you turn forwarding on.
> The second one is much more complicated: we are assuming that rte's with
> non-zero refcount value can stop egress interface from being destroyed.
> This is wrong (but widely used) assumption.
>
> We can use delayed GC instead of locking for rte's and this won't break
> things more than they are broken now (patch attached).
> We can't do the same for ifp structures since
> a) virtual ones can assume some state in underlying physical NIC
> b) physical ones just _can_ be destroyed (maybe regardless of user wants
> this or not, like: SFP being unplugged from NIC) or simply lead to kernel
> crash due to SW/HW inconsistency
>
> One of possible solution is to implement stable refcounts based on PCPU
> counters, and apply thos counters to ifp, but seem to be non-trivial.
>
>
> Another rtalloc(9) problem is the fact that radix is used as both 'control
> plane' and 'data plane' structure/api. Some users always want to put more
> information in rte, while others
> want to make rte more compact. We just need _different_ structures for
> that.
> Feature-rich, lot-of-data control plane one (to store everything we want
> to store, including, for example, PID of process originating the route) -
> current radix can be modified to do this.
> And address-family-depended another structure (array, trie, or anything)
> which contains _only_ data necessary to put packet on the wire.
>
> 11) arpresolve. Currently (this was decoupled in 8.x) we have
> a) ifaddr rlock
> b) lle rlock.
>
> We don't need those locks.
> We need to
> a) make lle layer per-interface instead of global (and this can also solve
> multiple fibs and L2 mappings done in fib.0 issue)
> b) use rtalloc(9)-provided lock instead of separate locking
> c) actually, we need to do rewrite this layer because
> d) lle actually is the place to do real multipath:
>
> briefly,
> you have rte pointing to some special nexthop structure pointing to lle,
> which has the following data:
> num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend
> to header
> Separate post will follow.
>
> With the following, we can achieve lagg traffic distribution without
> actually using lagg_transmit and similar stuff (at least in most common
> scenarious)
> (for example, TCP output definitely can benefit from this, since we can
> account flowid once for TCP session and use in in every mbuf)
>
>
> So. Imagine we have done all this. How we can estimate the difference?
>
> There was a thread, started a year ago, describing 'stock' performance and
> difference for various modifications.
> It is done on 8.x, however I've got similar results on recent 9.x
>
> http://lists.freebsd.org/**pipermail/freebsd-net/2012-**July/032680.html<http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html>
>
> Briefly:
>
> 2xE5645 @ Intel 82599 NIC.
> Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE,
> no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte
> IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to destinations in
> vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured for all
> destination addresses. Traffic level is slightly above or slightly below
> system performance.
>
> we start from 1.4MPPS (if we are using several routes to minimize mutex
> contention).
>
> My 'current' result for the same test, on same HW, with the following
> modifications:
>
> * 1) ixgbe per-packet ring unlock removed
> * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used)
> * 4) separate lockless in_localip() version
> * 6) - using existing pfil lock
> * 7) using lockless version
> * 8) radix converted to use rmlock instead of rlock. Delayed GC is used
> instead of mutexes
> * 10) - using existing pfil lock
> * 11) using radix lock to do arpresolve(). Not using lle rlock
>
> (so the rmlocks are the only locks used on data path).
>
> Additionally, ipstat counters are converted to PCPU (no real performance
> implications).
> ixgbe does not do per-packet accounting (as in head).
> if_vlan counters are converted to PCPU
> lagg is converted to rmlock, per-packet accounting is removed (using stat
> from underlying interfaces)
> lle hash size is bumped to 1024 instead of 32 (not applicable here, but
> slows things down for large L2 domains)
>
> The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16
> cores), nearly the same for HT on and 22 cores.
>
> ..
> while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on
> the same-class hardware and _userland_ forwarding.
>
> One of key features making all such products possible (DPDK, netmap,
> packetshader, Cisco SW forwarding) - is use of batching instead of
> process-to-completion model.
> Batching mitigates locking cost, batching does not wash out CPU cache, and
> so on.
>
> So maybe we can consider passing batches from NIC to at least L2 layer
> with netisr? or even up to ip_input() ?
>
> Another question is about making some sort of reliable GC like ("passive
> serialization" or other similar not-to-pronounce-words about Linux and
> lockless objects).
>
>
> P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how
> can this be done and what benefit can be achieved.
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 28 22:25:04 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 84C03D3D
 for <freebsd-arch@freebsd.org>; Wed, 28 Aug 2013 22:25:04 +0000 (UTC)
 (envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 7E4412CCF
 for <freebsd-arch@freebsd.org>; Wed, 28 Aug 2013 22:25:03 +0000 (UTC)
Received: (qmail 22174 invoked from network); 28 Aug 2013 23:06:41 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
 (envelope-sender <andre@freebsd.org>)
 by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
 for <melifaro@yandex-team.ru>; 28 Aug 2013 23:06:41 -0000
Message-ID: <521E78B0.6080709@freebsd.org>
Date: Thu, 29 Aug 2013 00:24:48 +0200
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: "Alexander V. Chernikov" <melifaro@yandex-team.ru>
Subject: Re: Network stack changes
References: <521E41CB.30700@yandex-team.ru>
In-Reply-To: <521E41CB.30700@yandex-team.ru>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: adrian@freebsd.org, freebsd-hackers@freebsd.org,
 FreeBSD Net <net@freebsd.org>, luigi@freebsd.org, ae@FreeBSD.org,
 Gleb Smirnoff <glebius@FreeBSD.org>, freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Aug 2013 22:25:04 -0000

On 28.08.2013 20:30, Alexander V. Chernikov wrote:
> Hello list!

Hello Alexander,

you sent quite a few things in the same email.  I'll try to respond
as much as I can right now.  Later you should split it up to have
more in-depth discussions on the individual parts.

If you could make it to the EuroBSDcon 2013 DevSummit that would be
even more awesome.  Most of the active network stack people will be
there too.

> There is a lot constantly raising  discussions related to networking stack performance/changes.
>
> I'll try to summarize current problems and possible solutions from my point of view.
> (Generally this is one problem: stack is slooooooooooooooooooooooooooow, but we need to know why and
> what to do).

Compared to others its not thaaaaaaat slow. ;)

> Let's start with current IPv4 packet flow on a typical router:
> http://static.ipfw.ru/images/freebsd_ipv4_flow.png
>
> (I'm sorry I can't provide this as text since Visio don't have any 'ascii-art' exporter).
>
> Note that we are using process-to-completion model, e.g. process any packet in ISR until it is either
> consumed by L4+ stack or dropped or put to egress NIC queue.
>
> (There is also deferred ISR model implemented inside netisr but it does not change much:
> it can help to do more fine-grained hashing (for GRE or other similar traffic), but
> 1) it uses per-packet mutex locking which kills all performance
> 2) it currently does not have _any_ hashing functions (see absence of flags in `netstat -Q`)
> People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or modified PPPoe/GRE version)
> report some profit, but without fixing (1) it can't help much
> )
>
> So, let's start:
>
> 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since there is nearly no contention
> (the only thing that can happen is driver reconfiguration which is rare and, more signifficant, we
> do this once
> for the batch of packets received in given interrupt). However, due to some (im)possible deadlocks
> current code
> does per-packet ring unlock/lock (see ixgbe_rx_input()).
> There was a discussion ended with nothing:
> http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html
>
> 1*) Possible BPF users. Here we have one rlock if there are any readers present
> (and mutex for any matching packets, but this is more or less OK. Additionally, there is WIP to
> implement multiqueue BPF
> and there is chance that we can reduce lock contention there).

Rlock to rmlock?

> There is also an "optimize_writers" hack permitting applications
> like CDP to use BPF as writers but not registering them as receivers (which implies rlock)

I believe longer term we should solve this with a protocol type "ethernet"
so that one can send/receive ethernet frames through a normal socket.

> 2/3) Virtual interfaces (laggs/vlans over lagg and other simular constructions).
> Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more funny - we use complex
> vlan_hash with another rlock to
> get vlan interface from underlying one.
>
> This is definitely not like things should be done and this can be changed more or less easily.

Indeed.

> There are some useful terms/techniques in world of software/hardware routing: they have clear
> 'control plane' and 'data plane' separation.
> Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg hellos, ARP/NDP, etc..) and
> some data traffic (packets with TTL=1, with options, destined to hosts without ARP/NDP record, and
> similar). Latter one is done in hardware (or effective software implementation).
> Control plane is responsible to provide data for efficient data plane operations. This is the point
> we are missing nearly everywhere.

ACK.

> What I want to say is: lagg is pure control-plane stuff and vlan is nearly the same. We can't apply
> this approach to complex cases like lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
> but we definitely can do this for most common setups like (igb* or ix* in lagg with or without vlans
> on top of lagg).

ACK.

> We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add some more. We even have
> per-driver hooks to program HW filtering.

We could.  Though for vlan it looks like it would be easier to remove the
hardware vlan tag stripping and insertion.  It only adds complexity in all
drivers for no gain.

> One small step to do is to throw packet to vlan interface directly (P1), proof-of-concept(working in
> production):
> http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html
>
> Another is to change lagg packet accounting:
> http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html
> Again, this is more like HW boxes do (aggregate all counters including errors) (and I can't imagine
> what real error we can get from _lagg_).
 >
> 4) If we are router, we can do either slooow ip_input() -> ip_forward() -> ip_output() cycle or use
> optimized ip_fastfwd() which falls back to 'slow' path for multicast/options/local traffic (e.g.
> works exactly like 'data plane' part).
> (Btw, we can consider net.inet.ip.fastforwarding to be turned on by default at least for non-IPSEC
> kernels)

ACK.

> Here we have to determine if this is local packet or not, e.g. F(dst_ip) returning 1 or 0. Currently
> we are simply using standard rlock + hash of iface addresses.
> (And some consumers like ipfw(4) do the same, but without lock).
> We don't need to do this! We can build sorted array of IPv4 addresses or other efficient structure
> on every address change and use it unlocked with delayed garbage collection (proof-of-concept attached)

I'm a bit uneasy with unlocked access.  On very weakly ordered architectures
this could trip over cache coherency issues.  A rmlock is essentially for free
in the read case.

> (There is another thing to discuss: maybe we can do this once somewhere in ip_input and mark mbuf as
> 'local/non-local' ? )

The problem is packet filters may change the destination address and thus
can invalidate such a lookup.

> 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks. This is OK.
>
> However, 6) and 7) are not.
> Firewall can use the same pfil lock as reader protection without imposing its own lock. currently
> pfil&ipfw code is ready to do this.

The problem with the global pfil rmlock is the comparatively long time it
is held in a locked state.  Also packet filters may have to acquire additional
locks when they have to modify state tables.  Rmlocks are not made for that
because they pin the thread to the cpu they're currently on.  This is what
Gleb is complaining about.

My idea is to hold the pfil rmlock only for the lookup of the first/next
packet filter that will run, not for the entire duration.  That would solve
the problem.  However packets filter then have to use their own locks again,
which could be rmlock too.

> 8) Radix/rt* api. This is probably the worst place in entire stack. It is toooo generic, tooo slow
> and buggy (do you use IPv6? you definitely know what I'm talking about).
> A) It really is too generic and assumption that it can be (effectively) used for every family is
> wrong. Two examples:
> we don't need to lookup all 128 bits of IPv6 address. Subnets with mask >/64 are not used widely
> (actually the only reason to use them are p2p links due to ND potential problems).
> One of common solutions is to lookup 64bits, and build another trie (or other structure) in case of
> collision.
> Another example is MPLS where we can simply do direct array lookup based on ingress label.

Yes.  While we shouldn't throw it out, it should be run as RIB and
allow a much more protocol specific FIB for the hot packet path.

> B) It is terribly slow (AFAIR luigi@ did some performance management, numbers available in one of
> netmap pdfs)

Again not thaaaat slow but inefficient enough.

> C) It is not multipath-capable. Stateful (and non-working) multipath is definitely not the right way.

Indeed.

> 8*) rtentry
> We are doing it wrong.
> Currently _every_ lookup locks/unlocks given rte twice.
> First lock is related to and old-old story for trusting IP redirects (and auto-adding host routes
> for them). Hopefully currently it is disabled automatically when you turn forwarding on.

They're disabled.

> The second one is much more complicated: we are assuming that rte's with non-zero refcount value can
> stop egress interface from being destroyed.
> This is wrong (but widely used) assumption.

Not really.  The reason for the refcount is not the ifp reference but
other code parts that may hold direct pointers to the rtentry and do
direct dereferencing to access information in it.

> We can use delayed GC instead of locking for rte's and this won't break things more than they are
> broken now (patch attached).

Nope.  Delayed GC is not the way to go here.  To do away with rtentry
locking and refcounting we have change rtalloc(9) to return the information
the caller wants (e.g. ifp, ia, others) and not the rtentry address anymore.
So instead of rtalloc() we have rtlookup().

> We can't do the same for ifp structures since
> a) virtual ones can assume some state in underlying physical NIC
> b) physical ones just _can_ be destroyed (maybe regardless of user wants this or not, like: SFP
> being unplugged from NIC) or simply lead to kernel crash due to SW/HW inconsistency

Here I actually believe we can do a GC or stable storage based approach.
Ifp pointers are kept in too many places and properly refcounting it is
very (too) hard.  So whenever an interface gets destroyed or disappears
it's callable function pointers are replaced with dummies returning an
error.  The ifp in memory will stay for some time and even may be reused
for another new interface later again (Cisco does it that way in their IOS).

> One of possible solution is to implement stable refcounts based on PCPU counters, and apply thos
> counters to ifp, but seem to be non-trivial.
>
>
> Another rtalloc(9) problem is the fact that radix is used as both 'control plane' and 'data plane'
> structure/api. Some users always want to put more information in rte, while others
> want to make rte more compact. We just need _different_ structures for that.

ACK.

> Feature-rich, lot-of-data control plane one (to store everything we want to store, including, for
> example, PID of process originating the route) - current radix can be modified to do this.
> And address-family-depended another structure (array, trie, or anything) which contains _only_ data
> necessary to put packet on the wire.

ACK.

> 11) arpresolve. Currently (this was decoupled in 8.x) we have
> a) ifaddr rlock
> b) lle rlock.
>
> We don't need those locks.
> We need to
> a) make lle layer per-interface instead of global (and this can also solve multiple fibs and L2
> mappings done in fib.0 issue)

Yes!

> b) use rtalloc(9)-provided lock instead of separate locking

No.  Interface rmlock.

> c) actually, we need to do rewrite this layer because
> d) lle actually is the place to do real multipath:

No, you can do multipath through more than one interface.  If lle is
per interface that wont work and is not the right place.

> briefly,
> you have rte pointing to some special nexthop structure pointing to lle, which has the following data:
> num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend to header
> Separate post will follow.

This should be part of the RIB/FIB and select on of the ifp+nexthops
to return on lookup.

> With the following, we can achieve lagg traffic distribution without actually using lagg_transmit
> and similar stuff (at least in most common scenarious)

This seems to be a rather nasty layering violation.

> (for example, TCP output definitely can benefit from this, since we can account flowid once for TCP
> session and use in in every mbuf)
 >
> So. Imagine we have done all this. How we can estimate the difference?
>
> There was a thread, started a year ago, describing 'stock' performance and difference for various
> modifications.
> It is done on 8.x, however I've got similar results on recent 9.x
>
> http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html
>
> Briefly:
>
> 2xE5645 @ Intel 82599 NIC.
> Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, no firewallIxia XM2
> (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte IP packets from vlan10 (10.100.0.64 -
> 10.100.0.156) to destinations in vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured
> for all destination addresses. Traffic level is slightly above or slightly below system performance.
>
> we start from 1.4MPPS (if we are using several routes to minimize mutex contention).
>
> My 'current' result for the same test, on same HW, with the following modifications:
>
> * 1) ixgbe per-packet ring unlock removed
> * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used)
> * 4) separate lockless in_localip() version
> * 6) - using existing pfil lock
> * 7) using lockless version
> * 8) radix converted to use rmlock instead of rlock. Delayed GC is used instead of mutexes
> * 10) - using existing pfil lock
> * 11) using radix lock to do arpresolve(). Not using lle rlock
>
> (so the rmlocks are the only locks used on data path).
>
> Additionally, ipstat counters are converted to PCPU (no real performance implications).
> ixgbe does not do per-packet accounting (as in head).
> if_vlan counters are converted to PCPU
> lagg is converted to rmlock, per-packet accounting is removed (using stat from underlying interfaces)
> lle hash size is bumped to 1024 instead of 32 (not applicable here, but slows things down for large
> L2 domains)
>
> The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16 cores), nearly the same
> for HT on and 22 cores.

That's quite good, but we want more. ;)

> ..
> while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the same-class hardware and
> _userland_ forwarding.

Those numbers sound a bit far out.  Maybe if the packet isn't touched
or looked at at all in a pure netmap interface to interface bridging
scenario.  I don't believe these numbers.

> One of key features making all such products possible (DPDK, netmap, packetshader, Cisco SW
> forwarding) - is use of batching instead of process-to-completion model.
> Batching mitigates locking cost, batching does not wash out CPU cache, and so on.

The work has to be done eventually.  Batching doesn't relieve from it.
IMHO batch moving is only the last step would should look at.  It makes
the stack rather complicated and introduces other issues like packet
latency.

> So maybe we can consider passing batches from NIC to at least L2 layer with netisr? or even up to
> ip_input() ?

And then?  You probably won't win much in the end (if the lock path
is optimized).

> Another question is about making some sort of reliable GC like ("passive serialization" or other
> similar not-to-pronounce-words about Linux and lockless objects).

Rmlocks are our secret weapon and just as good.

> P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how can this be done and what
> benefit can be achieved.

-- 
Andre


From owner-freebsd-arch@FreeBSD.ORG  Thu Aug 29 01:30:37 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id A19FC88D;
 Thu, 29 Aug 2013 01:30:37 +0000 (UTC) (envelope-from slw@zxy.spb.ru)
Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98])
 by mx1.freebsd.org (Postfix) with ESMTP id 5B62426E5;
 Thu, 29 Aug 2013 01:30:37 +0000 (UTC)
Received: from slw by zxy.spb.ru with local (Exim 4.69 (FreeBSD))
 (envelope-from <slw@zxy.spb.ru>)
 id 1VEr6H-000In5-Q7; Thu, 29 Aug 2013 05:32:41 +0400
Date: Thu, 29 Aug 2013 05:32:41 +0400
From: Slawa Olhovchenkov <slw@zxy.spb.ru>
To: Andre Oppermann <andre@freebsd.org>
Subject: Re: Network stack changes
Message-ID: <20130829013241.GB70584@zxy.spb.ru>
References: <521E41CB.30700@yandex-team.ru>
 <521E78B0.6080709@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <521E78B0.6080709@freebsd.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: slw@zxy.spb.ru
X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false
Cc: "Alexander V. Chernikov" <melifaro@yandex-team.ru>, adrian@freebsd.org,
 freebsd-hackers@freebsd.org, freebsd-arch@freebsd.org, luigi@freebsd.org,
 ae@FreeBSD.org, Gleb Smirnoff <glebius@FreeBSD.org>,
 FreeBSD Net <net@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Aug 2013 01:30:37 -0000

On Thu, Aug 29, 2013 at 12:24:48AM +0200, Andre Oppermann wrote:

> > ..
> > while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the same-class hardware and
> > _userland_ forwarding.
> 
> Those numbers sound a bit far out.  Maybe if the packet isn't touched
> or looked at at all in a pure netmap interface to interface bridging
> scenario.  I don't believe these numbers.

80*64*8 = 40.960 Gb/s
May be DCA? And use CPU with 40 PCIe lane and 4 memory chanell.

From owner-freebsd-arch@FreeBSD.ORG  Thu Aug 29 06:46:54 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 0F26FE7D;
 Thu, 29 Aug 2013 06:46:54 +0000 (UTC)
 (envelope-from bryanv@daemoninthecloset.org)
Received: from torment.daemoninthecloset.org (torment.daemoninthecloset.org
 [94.242.209.234])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id C098F2912;
 Thu, 29 Aug 2013 06:46:53 +0000 (UTC)
Received: from sage.daemoninthecloset.org (unknown [70.114.209.60])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "sage.daemoninthecloset.org",
 Issuer "daemoninthecloset.org" (verified OK))
 by torment.daemoninthecloset.org (Postfix) with ESMTPS id DFBE342C08C6;
 Thu, 29 Aug 2013 08:52:03 +0200 (CEST)
X-Virus-Scanned: amavisd-new at daemoninthecloset.org
X-Virus-Scanned: amavisd-new at daemoninthecloset.org
Date: Thu, 29 Aug 2013 01:46:32 -0500 (CDT)
From: Bryan Venteicher <bryanv@daemoninthecloset.org>
To: Andre Oppermann <andre@freebsd.org>
Message-ID: <2112475076.435.1377758792082.JavaMail.root@daemoninthecloset.org>
In-Reply-To: <521E78B0.6080709@freebsd.org>
References: <521E41CB.30700@yandex-team.ru> <521E78B0.6080709@freebsd.org>
Subject: Re: Network stack changes
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [192.168.10.20]
X-Mailer: Zimbra 8.0.2_GA_5569 (ZimbraWebClient - GC20
 ([unknown])/8.0.2_GA_5569)
Thread-Topic: Network stack changes
Thread-Index: anDUShTn7iVw7wFEqZDuK6ld/6VXsQ==
Cc: "Alexander V. Chernikov" <melifaro@yandex-team.ru>, adrian@freebsd.org,
 freebsd-hackers@freebsd.org, freebsd-arch@freebsd.org, luigi@freebsd.org,
 ae@FreeBSD.org, Gleb Smirnoff <glebius@FreeBSD.org>,
 FreeBSD Net <net@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Aug 2013 06:46:54 -0000


----- Original Message -----
> On 28.08.2013 20:30, Alexander V. Chernikov wrote:
> > Hello list!
> 
> Hello Alexander,
> 
> you sent quite a few things in the same email.  I'll try to respond
> as much as I can right now.  Later you should split it up to have
> more in-depth discussions on the individual parts.
> 
<trim>
> 
> > We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add
> > some more. We even have
> > per-driver hooks to program HW filtering.
> 
> We could.  Though for vlan it looks like it would be easier to remove the
> hardware vlan tag stripping and insertion.  It only adds complexity in all
> drivers for no gain.
> 

In the shorter term, can we remove the requirement for the parent
interface to support IFCAP_VLAN_HWTAGGING in order to do checksum
offloading on the VLAN interface (see vlan_capabilities())?


From owner-freebsd-arch@FreeBSD.ORG  Thu Aug 29 11:49:34 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id B9C08243;
 Thu, 29 Aug 2013 11:49:34 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-wi0-x234.google.com (mail-wi0-x234.google.com
 [IPv6:2a00:1450:400c:c05::234])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 566572FB7;
 Thu, 29 Aug 2013 11:49:33 +0000 (UTC)
Received: by mail-wi0-f180.google.com with SMTP id l12so352069wiv.13
 for <multiple recipients>; Thu, 29 Aug 2013 04:49:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=t5rRGR7p2lccT5dIVAFMMz5iEhB8EN3uCWH29jaGBug=;
 b=jG6EG9SHnUTL+LWD0mt8ZdlyhVyFrBZee9eO0wArZSQi7Kxa4sipEEiBbicH27NlRE
 WCFqBBALUkxOLkfAinjqMBlaV/iJhly1bozkC2JSX40PczqetRoSxgspp1/Uf8S+/7Y/
 SAPOMG5R/RfYBn/5LaIxPpziJpJ8uJvmxiuc1U90ViJZGA7R/XjoJgRyDWubRr53+sIM
 prBz7ivSPp48uUqSxvRc6u09Edy/XM3+hSFHKyMWPMoP/isaPhtr5W6IrGK0lz1Cm4Oc
 SNbYlKGLpk1MD4mzIVxrREDDjMFyu8VVJjqpBIM9jv8oxMF2TcPAFeMlBXnnKrl9XP9A
 PNgw==
MIME-Version: 1.0
X-Received: by 10.194.79.33 with SMTP id g1mr2141120wjx.79.1377776971643; Thu,
 29 Aug 2013 04:49:31 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.216.146.2 with HTTP; Thu, 29 Aug 2013 04:49:31 -0700 (PDT)
In-Reply-To: <521E41CB.30700@yandex-team.ru>
References: <521E41CB.30700@yandex-team.ru>
Date: Thu, 29 Aug 2013 04:49:31 -0700
X-Google-Sender-Auth: fjTZLF4GZ_Hxxlda_cdxncMN6aA
Message-ID: <CAJ-Vmo=N=HnZVCD41ZmDg2GwNnoa-tD0J0QLH80x=f7KA5d+Ug@mail.gmail.com>
Subject: Re: Network stack changes
From: Adrian Chadd <adrian@freebsd.org>
To: "Alexander V. Chernikov" <melifaro@yandex-team.ru>
Content-Type: text/plain; charset=ISO-8859-1
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Luigi Rizzo <luigi@freebsd.org>, Andre Oppermann <andre@freebsd.org>,
 "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>,
 FreeBSD Net <net@freebsd.org>, "Andrey V. Elsukov" <ae@freebsd.org>,
 Gleb Smirnoff <glebius@freebsd.org>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Aug 2013 11:49:34 -0000

Hi,

There's a lot of good stuff to review here, thanks!

Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to keep
locking things like that on a per-packet basis. We should be able to do
this in a cleaner way - we can defer RX into a CPU pinned taskqueue and
convert the interrupt handler to a fast handler that just schedules that
taskqueue. We can ignore the ithread entirely here.

What do you think?

Totally pie in the sky handwaving at this point:

* create an array of mbuf pointers for completed mbufs;
* populate the mbuf array;
* pass the array up to ether_demux().

For vlan handling, it may end up populating its own list of mbufs to push
up to ether_demux(). So maybe we should extend the API to have a bitmap of
packets to actually handle from the array, so we can pass up a larger array
of mbufs, note which ones are for the destination and then the upcall can
mark which frames its consumed.

I specifically wonder how much work/benefit we may see by doing:

* batching packets into lists so various steps can batch process things
rather than run to completion;
* batching the processing of a list of frames under a single lock instance
- eg, if the forwarding code could do the forwarding lookup for 'n' packets
under a single lock, then pass that list of frames up to inet_pfil_hook()
to do the work under one lock, etc, etc.

Here, the processing would look less like "grab lock and process to
completion" and more like "mark and sweep" - ie, we have a list of frames
that we mark as needing processing and mark as having been processed at
each layer, so we know where to next dispatch them.

I still have some tool coding to do with PMC before I even think about
tinkering with this as I'd like to measure stuff like per-packet latency as
well as top-level processing overhead (ie, CPU_CLK_UNHALTED.THREAD_P /
lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, etc.)

Thanks,


-adrian

From owner-freebsd-arch@FreeBSD.ORG  Fri Aug 30 03:53:25 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 73E5BBC4
 for <freebsd-arch@freebsd.org>; Fri, 30 Aug 2013 03:53:25 +0000 (UTC)
 (envelope-from mailreturn@smtp4.ymlpsrv.net)
Received: from smtp4.ymlpsrv.net (smtp4.ymlpsrv.net [62.213.196.184])
 by mx1.freebsd.org (Postfix) with SMTP id AF3FA2080
 for <freebsd-arch@freebsd.org>; Fri, 30 Aug 2013 03:53:24 +0000 (UTC)
Received: (qmail 12197 invoked by uid 0); 30 Aug 2013 03:53:23 -0000
Date: Fri, 30 Aug 2013 05:53:23 +0200
To: freebsd-arch@freebsd.org
From: Masters Pratumnak <mark@masterspratumnak.com>
Subject: Fully furnished one bedroom condos from under 2M THB
Message-ID: <ed6bfb8085c3412a38a3cd38f4cb925f@smtp4.ymlpsrv.net>
X-YMLPcode: xcv4+2+3642
MIME-Version: 1.0
Content-Type: text/plain; charset = "utf-8"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
Reply-To: mark@masterspratumnak.com
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 30 Aug 2013 03:53:25 -0000

--------------------------------------------------------------------------=
------
This email newsletter was sent to you in graphical HTML format.
If you're seeing this version, your email program prefers plain text =
emails.
You can read the original version online:
http://ymlp297.net/zcAEV7
--------------------------------------------------------------------------=
------


FULLY FURNISHED CONDOS
NOW WITH 5% DEVELOPER DISCOUNT

* Over 50% Sold =E2=80=93 5 units sold in the last 3 weeks

* Price increase scheduled in just  2 weeks

* Reserve off-plan price now for only 50,000 THB
*  Flexible Payment Options available on request
* Located on Pratumnak Hill, Pattaya, Thailand
* Prices start at 1.8M THB / 58,000 USD / 38,000 GBP

In the last three weeks there has been a huge amount of interest in
the Masters Residence (
http://homesdirect.asia/mastersv3/?utm_source=3DAugust+2&utm_medium=3DEmai=
l&utm_campaign=3DMasters3+YMLP1
). With over 50% sold at pre-launch, the project is proving very
popular, with reservations and sales coming locally and from overseas.

We would like you to have the opportunity to reserve a unit for only
50,000 Thai Baht, this will lock-in the current pre-launch prices.

Simply click on the more information button and enter your details in
the contact form. You can then=C2=A0receive a 5% discount=C2=A0off the =
current
pricelist when your pay the reservation fee. This=C2=A0exclusively online
offer=C2=A0is only available direct through us, the developer.

Mark
Partner, Masters Residence

About Masters Residence
Masters Residence (
http://homesdirect.asia/mastersv3/?utm_source=3DAugust+2&utm_medium=3DEmai=
l&utm_campaign=3DMasters3+YMLP1
) offers an exclusive, pristine and tranqual location like no other on
Soi Regent 2, between Soi 5 & 6 Phratumnak. The roof-top infinity
swimming pool and sun deck will have breathtaking ocean views, while
the atrium and lobby offer a combination of green area in a relaxing
environment. With the wide selection and positioning of one bedroom
units, we are unlike other condos in the area.

* All Condos have Exterior Views.
* Underground Parking
* Spectacular Sea Views
* Atrium with Water Features
* Fully Furnished
* Rooftop Infinity Pool, Sun Deck & Saunas

The level of refinement and luxury that Masters Residence (
http://homesdirect.asia/mastersv3/?utm_source=3DAugust+2&utm_medium=3DEmai=
l&utm_campaign=3DMasters3+YMLP1
) offers its owners and residents, the temptation to stay at home
could be difficult to resist. Those staying within Masters Pattaya can
enjoy the wide range of facilities that the project offers, including
an infinity pool on the roof set among tropical gardens, offering
spectacular views down the hill to the sea and Koh Lan island beyond.

_____________________________
Unsubscribe / Change Profile: http://ymlp297.net/ugbjwwjygsgmhjegegguqquwm
Powered by YourMailingListProvider