From owner-freebsd-current@FreeBSD.ORG Thu Apr 19 13:10:53 2012 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 51A6F1065675; Thu, 19 Apr 2012 13:10:53 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id E785C8FC1F; Thu, 19 Apr 2012 13:10:52 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 14C837300A; Thu, 19 Apr 2012 15:30:18 +0200 (CEST) Date: Thu, 19 Apr 2012 15:30:18 +0200 From: Luigi Rizzo To: net@freebsd.org, current@freebsd.org Message-ID: <20120419133018.GA91364@onelab2.iet.unipi.it> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Cc: Subject: Some performance measurements on the FreeBSD network stack X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Apr 2012 13:10:53 -0000 I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. Test conditions: - intel i7-870 CPU running at 2.93 GHz + TurboBoost, all 4 cores enabled, no hyperthreading - FreeBSD HEAD as of 15 april 2012, no ipfw, no other pfilter clients, no ipv6 or ipsec. - userspace running 'netsend 10.0.0.2 5555 18 0 5' (output to a physical interface, udp port 5555, small frame, no rate limitations, 5sec experiments) - the 'ns' column reports the total time divided by the number of successful transmissions we report the min and max in 5 tests - 1 to 4 parallel tasks, variable packet sizes - there are variations in the numbers which become larger as we reach the bottom of the stack Caveats: - in the table below, clock and pktlen are constant. I am including the info here so it is easier to compare the results with future experiments - i have a small number of samples, so i am only reporting the min and the max in a handful of experiments. - i am only measuring average values over millions of cycles. I have no info on what is the variance between the various executions. - from what i have seen, numbers vary significantly on different systems, depending on memory speed, caches and other things. The big jumps are significant and present on all systems, but the small deltas (say < 5%) are not even statistically significant. - if someone is interested in replicating the experiments email me and i will post a link to a suitable picobsd image. - i have not yet instrumented the bottom layers (if_output and below). The results show a few interesting things: - the packet-sending application is reasonably fast and certainly not a bottleneck (over 100Mpps before calling the system call); - the system call is somewhat expensive, about 100ns. I am not sure where the time is spent (the amd64 code does a few push on the stack and then runs "syscall" (followed by a sysret). I am not sure how much room for improvement is there in this area. The relevant code is in lib/libc/i386/SYS.h and lib/libc/i386/sys/syscall.S (KERNCALL translates to "syscall" on amd64, and "int 0x80" on the i386) - the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet). I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release. It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. There is other bad stuff occurring in if_output() and below (on this system it takes about 1300ns to send one packet even with one core, and ony 500-550 are consumed before the call to if_output()) but i don't have detailed information yet. POS CPU clock pktlen ns/pkt --- EXIT POINT ---- min max ----------------------------------------------------- U 1 2934 18 8 8 userspace, before the send() call [ syscall ] 20 1 2934 18 103 107 sys_sendto(): begin 20 4 2934 18 104 107 21 1 2934 18 110 113 sendit(): begin 21 4 2934 18 111 116 22 1 2934 18 110 114 sendit() after getsockaddr(&to, ...) 22 4 2934 18 111 124 23 1 2934 18 111 115 sendit() before kern_sendit 23 4 2934 18 112 120 24 1 2934 18 117 120 kern_sendit() after AUDIT_ARG_FD 24 4 2934 18 117 121 25 1 2934 18 134 140 kern_sendit() before sosend() 25 4 2934 18 134 146 40 1 2934 18 144 149 sosend_dgram(): start 40 4 2934 18 144 151 41 1 2934 18 157 166 sosend_dgram() before m_uiotombuf() 41 4 2934 18 157 168 [ mbuf allocation and copy. The copy is relatively cheap ] 42 1 2934 18 264 268 sosend_dgram() after m_uiotombuf() 42 4 2934 18 265 269 30 1 2934 18 273 276 udp_send() begin 30 4 2934 18 274 278 [ here we start seeing some contention with multiple threads ] 31 1 2934 18 323 324 udp_output() before ip_output() 31 4 2934 18 344 348 50 1 2934 18 326 331 ip_output() beginning 50 4 2934 18 356 367 51 1 2934 18 343 349 ip_output() before "if (opt) { ..." 51 4 2934 18 366 373 [ rtalloc() is sequential so multiple clients contend heavily ] 56 1 2934 18 470 480 ip_output() after rtalloc*() 56 4 2934 18 1310 1378 52 1 2934 18 472 488 ip_output() at sendit: 52 4 2934 18 1252 1286 53 1 2934 18 ip_output() before pfil_run_hooks() 53 4 2934 18 54 1 2934 18 476 477 ip_output() at passout: 54 4 2934 18 1249 1286 55 1 2934 18 509 526 ip_output() before if_output 55 4 2934 18 1268 1278 ---------------------------------------------------------------------- cheers luigi