Date: Thu, 03 Jan 2008 01:01:38 +0100 From: Andre Oppermann <andre@freebsd.org> To: "Bruce M. Simpson" <bms@FreeBSD.org> Cc: freebsd-net@freebsd.org, Tiffany Snyder <tiffany.snyder@gmail.com> Subject: Re: Routing SMP benefit Message-ID: <477C25E2.4080303@freebsd.org> In-Reply-To: <477C1776.2080002@FreeBSD.org> References: <43B45EEF.6060800@x-trader.de> <43B47CB5.3C0F1632@freebsd.org> <b63e753b0712281551u52894ed9mb0dd55a988bc9c7a@mail.gmail.com> <477C1434.80106@freebsd.org> <477C1776.2080002@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Bruce M. Simpson wrote: > Andre Oppermann wrote: >> So far the PPS rate limit has primarily been the cache miss penalties >> on the packet access. Multiple CPUs can help here of course for bi- >> directional traffic. Hardware based packet header cache prefetching as >> done by some embedded MIPS based network processors at least doubles the >> performance. Intel has something like this for a couple of chipset and >> network chip combinations. We don't support that feature yet though. > > What sort of work is needed in order to support header prefetch? Extracting the documentation out of Intel for a first step. It's called Direct Cache Access (DCA). At least in the Linux implementation it has been intermingled with I/OAT which is an asynchronous memory controller based DMA copy mechanism. Don't know if they really have to be together. The idea of DCA is to cause the memory controller upon DMA'ing a packet into main memory to also load it into the CPU cache(s) right away. For packet forwarding the first 128 bytes are sufficient. For server applications and TCP it may be beneficial to prefetch the whole packet. May cause some considerable cache pollution though depending on usage. Some pointers: http://www.stanford.edu/group/comparch/papers/huggahalli05.pdf http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=tree;f=drivers/dca;hb=HEAD http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=tree;f=drivers/dma;hb=HEAD http://download.intel.com/technology/comms/perfnet/download/ServerNetworkIOAccel.pdf http://download.intel.com/design/network/prodbrf/317796.pdf >> Many of the things you mention here are planned for FreeBSD 8.0 in the >> same or different form. Work in progress is the separation of the ARP >> table from kernel routing table. If we can prevent references to radix >> nodes generally almost all locking can be done away with. Instead only >> a global rmlock (read-mostly) could govern the entire routing table. >> Obtaining the rmlock for reading is essentially free. > > This is exactly what I'm thinking, this feels like the right way forward. > > A single rwlock should be fine, route table updates should generally > only be happening from one process, and thus a single thread, at any > given time. rmlocks are even faster and the change to use ratio is also quite right. >> Table changes >> are very infrequent compared to lookups (like 700,000 to 300-400) in >> default free Internet routing. The radix trie nodes are rather big >> and could use some more trimming to make the fit a single cache line. >> I've already removed some stuff a couple of years ago and more can be >> done. >> >> It's very important to keep this in mind: "profile, don't speculate". > Beware though that functionality isn't sacrificed at the expense of this. > > For example it would be very, very useful to be able to merge the > multicast routing implementation with the unicast -- with the proviso of > course that mBGP requires that RPF can be performed with a separate set > of FIB entries from the unicast FIB. > > Of course if next-hops themselves are held in a container separately > referenced > from the radix node, such as a simple linked list as per the OpenBSD code. Haven't looked at the multicast code so I can't comment. The other stuff is just talk so far. No work in progress, at least from my side. > If we ensure the parent radix trie node object fits in a cache line, > then that's fine. > > [I am looking at some stuff in the dynamic/ad-hoc/mesh space which is > really going to need support for multipath similar to this.] I was looking at some parallel forwarding table for fastforward that is highly optimized for IPv4 and cache efficiency. It was supposed to be 8-bit stride based (256-ary) with SSE based multi segment longest prefix match updates. Never managed to this past the design state though. And it's not one of the pressing issues. The radix trie is pretty efficient though for being architecture independent. Even though the depth and variety in destination addresses matters it never really turned out to become bottleneck in my profile at the time. It does have its limitations though becoming more apparent at very high PPS and very large routing tables as in the DFZ. -- Andre
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?477C25E2.4080303>