Date: Tue, 08 Apr 2008 04:35:11 -0700 From: Darren Reed <darrenr@freebsd.org> To: Robert Watson <rwatson@FreeBSD.org> Cc: arch@freebsd.org, freebsd-current@freebsd.org, "Christian S.J. Peron" <csjp@FreeBSD.org> Subject: Re: HEADS UP: zerocopy bpf commits impending Message-ID: <47FB586F.90606@freebsd.org> In-Reply-To: <20080317134335.A3253@fledge.watson.org> References: <20080317133029.GA19369@sub.vaned.net> <20080317134335.A3253@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Robert Watson wrote: > On Mon, 17 Mar 2008, Christian S.J. Peron wrote: > >> Just wanted to give a heads up that I plan to start merging the work >> located in the zerocopy bpf perforce branch. We have been working on >> this project for about a year now and feel that it is ready to come >> into the tree. >> >> I will begin to merge hopefully today [assuming nobody has any >> concerns] or tommorow. Zerocopy bpf will be disabled by default, and >> can be enabled globally though the use of a sysctl variable. Once the >> kernel bits are in and we sort out a couple minor nits in >> libpcap+tcpdump, we will be be looking at getting our libpcap patches >> committed upstream. I will post a patch for people to experiment >> with in the meantime after the kernel commits are complete. >> >> We do not anticipate this will have any effect on existing bpf >> consumers like libpcap, tcpdump etc... so if something breaks, it >> shouldn't have and we need to know about :) We were pretty careful >> about preserving the ABI. The only exception to this is, netstat will >> need a recompile because the size of it's bpf stats structure changed. >> >> So if there are any objections or concerns, now is the time to raise >> them. > > Per previous posts, interested parties can find the slides on the > design from the BSDCan 2008 developer summit here: > > > http://www.watson.org/~robert/freebsd/2007bsdcan/20070517-devsummit-zerocopybpf.pdf Is there a performance analysis of the copy vs zerocopy available? (I don't see one in the paper, just a "to do" item.) The numbers I'm interested in seeing are how many Mb/s you can capture before you start suffering packet loss. This needs to be done with sequenced packets so that you can observe gaps in the sequence captured. I kind of experimented with this back in 2004: http://mail-index.netbsd.org/tech-net/2004/05/02/0001.html http://mail-index.netbsd.org/tech-net/2004/05/21/0001.html Rather than map the user space memory into the kernel, I used mmap(2) to access the kernel's buffer from user space and then did the ioctl thing to move pointers. I also played with changing the size of the primary buffer to be smaller but to have more alternate buffers. So while one buffer was mapped out to the user space, one (or more) buffer(s) were available in the kernel. Speed improvement? Slight (less than 2%) in the testing I did. Why only slight? Because there's another factor here, and that's how long it takes to process the data that is in the buffer and free it up for the kernel. But then the time you gain from having more buffer space available in the kernel you lose (in part) to the management overhead. In the end I decided that change, while interesting, didn't really solve the problem which was that the speed at which capturing could be effectively done was bounded by the time spent analysing the data captured. If there are packets that you want to analyse arriving faster than you can do the analysis, then you will drop packets - end of story. So why isn't there a huge performance increase? My $0.2c... When using read(2) to get bpf data, you straight away transfer the data from the kernel to the user space buffer and that immediately free's up that buffer in the kernel for more capture. When you share the buffer between the kernel and user space, you either (1) delay kernel access to that buffer while you process all the contents, and if there any bits that you want to keep, then you need to copy them out or (2) do another copy from the shared buffer to a private buffer, releasing contention for the shared buffer but again doing a copy, so the end result is not much different. The problem with (1) is that you always have less buffer space available at the kernel level for storing packet data than you do without that segment "held" for user space activity. So even if you do a write(2) of the buffer used in (1) straight away, there is a delay in the turnaround time for the buffer of how even long your disk I/O takes to complete. And someone asked about packet capture direct to disk - too slow if you do it through a vnode with an eye on 10G. Heck, at 10G speeds, you need to be handling 2GB/sec - can any affordable disk write that fast? Why 2GB/sec? To successfully sniff a 10G stream, you need two 10G NICs, for a combined total of 20G incoming (remember, full duplex, 10G going in both direction... and you thought plugging your single NIC into any full-duplex monitor port on a switch was always enough....ha!) State of the art packet capture has moved to hardware assisted cards, such as those from Endace: http://www.endace.com/our-products/dag-network-monitoring-cards/ethernet If you want to get 10G capture on FreeBSD, get drivers for those cards made for FreeBSD. Those cards are absolutely necessary on Linux to get performance anywhere near FreeBSD"s ;) Darren
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47FB586F.90606>