From owner-freebsd-net@FreeBSD.ORG Wed Aug 17 16:11:54 2011 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8AB28106566B for ; Wed, 17 Aug 2011 16:11:54 +0000 (UTC) (envelope-from syuu@dokukino.com) Received: from mail-qy0-f182.google.com (mail-qy0-f182.google.com [209.85.216.182]) by mx1.freebsd.org (Postfix) with ESMTP id 4A2258FC15 for ; Wed, 17 Aug 2011 16:11:53 +0000 (UTC) Received: by qyk9 with SMTP id 9so889756qyk.13 for ; Wed, 17 Aug 2011 09:11:53 -0700 (PDT) Received: by 10.224.212.74 with SMTP id gr10mr1315180qab.302.1313597513211; Wed, 17 Aug 2011 09:11:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.100.84 with HTTP; Wed, 17 Aug 2011 09:11:13 -0700 (PDT) In-Reply-To: <0BB87D28-3094-422D-8262-5FA0E40BFC7C@dudu.ro> References: <2AB05A3E-BDC3-427D-B4A7-ABDDFA98D194@dudu.ro> <0BB87D28-3094-422D-8262-5FA0E40BFC7C@dudu.ro> From: Takuya ASADA Date: Thu, 18 Aug 2011 01:11:13 +0900 Message-ID: To: Vlad Galu Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: net@freebsd.org Subject: Re: Multiqueue support for bpf X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Aug 2011 16:11:54 -0000 2011/8/16 Vlad Galu : > On Aug 16, 2011, at 11:50 AM, Vlad Galu wrote: >> On Aug 16, 2011, at 11:13 AM, Takuya ASADA wrote: >>> Hi all, >>> >>> I implemented multiqueue support for bpf, I'd like to present for revie= w. >>> This is a Google Summer of Code project, the project goal is to >>> support multiqueue network interface on BPF, and provide interfaces >>> for multithreaded packet processing using BPF. >>> Modern high performance NICs have multiple receive/send queues and RSS >>> feature, this allows to process packet concurrently on multiple >>> processors. >>> Main purpose of the project is to support these hardware and get >>> benefit of parallelism. >>> >>> This provides following new APIs: >>> - queue filter for each bpf descriptor (bpf ioctl) >>> =C2=A0 - BIOCENAQMASK =C2=A0 =C2=A0Enables multiqueue filter on the des= criptor >>> =C2=A0 - BIOCDISQMASK =C2=A0 =C2=A0Disables multiqueue filter on the de= scriptor >>> =C2=A0 - BIOCSTRXQMASK =C2=A0 =C2=A0Set mask bit on specified RX queue >>> =C2=A0 - BIOCCRRXQMASK =C2=A0 =C2=A0Clear mask bit on specified RX queu= e >>> =C2=A0 - BIOCGTRXQMASK =C2=A0 =C2=A0Get mask bit on specified RX queue >>> =C2=A0 - BIOCSTTXQMASK =C2=A0 =C2=A0Set mask bit on specified TX queue >>> =C2=A0 - BIOCCRTXQMASK =C2=A0 =C2=A0Clear mask bit on specified TX queu= e >>> =C2=A0 - BIOCGTTXQMASK =C2=A0 =C2=A0Get mask bit on specified TX queue >>> =C2=A0 - BIOCSTOTHERMASK =C2=A0 =C2=A0Set mask bit for the packets whic= h not tied >>> with any queues >>> =C2=A0 - BIOCCROTHERMASK =C2=A0 =C2=A0Clear mask bit for the packets wh= ich not tied >>> with any queues >>> =C2=A0 - BIOCGTOTHERMASK =C2=A0 =C2=A0Get mask bit for the packets whic= h not tied >>> with any queues >>> >>> - generic interface for getting hardware queue information from NIC >>> driver (socket ioctl) >>> =C2=A0 - SIOCGIFQLEN =C2=A0 =C2=A0Get interface RX/TX queue length >>> =C2=A0 - SIOCGIFRXQAFFINITY =C2=A0 =C2=A0Get interface RX queue affinit= y >>> =C2=A0 - SIOCGIFTXQAFFINITY =C2=A0 =C2=A0Get interface TX queue affinit= y >>> >>> Patch for -CURRENT is here, right now it only supports igb(4), >>> ixgbe(4), mxge(4): >>> http://www.dokukino.com/mq_bpf_20110813.diff >>> >>> And below is performance benchmark: >>> >>> =3D=3D=3D=3D >>> I implemented benchmark programs based on >>> bpfnull(//depot/projects/zcopybpf/utils/bpfnull/), >>> >>> test_sqbpf measures bpf throughput on one thread, without using multiqu= eue APIs. >>> http://p4db.freebsd.org/fileViewer.cgi?FSPC=3D//depot/projects/soc2011/= mq_bpf/src/tools/regression/bpf/mq_bpf/test_sqbpf/test_sqbpf.c >>> >>> test_mqbpf is multithreaded version of test_sqbpf, using multiqueue API= s. >>> http://p4db.freebsd.org/fileViewer.cgi?FSPC=3D//depot/projects/soc2011/= mq_bpf/src/tools/regression/bpf/mq_bpf/test_mqbpf/test_mqbpf.c >>> >>> I benchmarked with six conditions: >>> - benchmark1 only reads bpf, doesn't write packet anywhere >>> - benchmark2 writes packet on memory(mfs) >>> - benchmark3 writes packet on hdd(zfs) >>> - benchmark4 only reads bpf, doesn't write packet anywhere, with zeroco= py >>> - benchmark5 writes packet on memory(mfs), with zerocopy >>> - benchmark6 writes packet on hdd(zfs), with zerocopy >>> >>>> From benchmark result, I can say the performance is increased using >>> mq_bpf on 10GbE, but not on GbE. >>> >>> * Throughput benchmark >>> - Test environment >>> - FreeBSD node >>> =C2=A0CPU: Core i7 X980 (12 threads) >>> =C2=A0MB: ASUS P6X58D Premium(Intel X58) >>> =C2=A0NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) >>> =C2=A0NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) >>> - Linux node >>> =C2=A0CPU: Core 2 Quad (4 threads) >>> =C2=A0MB: GIGABYTE GA-G33-DS3R(Intel G33) >>> =C2=A0NIC1: Intel Gigabit ET Dual Port Server Adapter(82576) >>> =C2=A0NIC2: Intel Ethernet X520-DA2 Server Adapter(82599) >>> >>> iperf used for generate network traffic, with following argument option= s >>> =C2=A0- Linux node: iperf -c [IP] -i 10 -t 100000 -P12 >>> =C2=A0- FreeBSD node: iperf -s >>> =C2=A0# 12 threads, TCP >>> >>> following sysctl parameter is changed >>> =C2=A0sysctl -w net.bpf.maxbufsize=3D1048576 >> >> >> Thank you for your work! You may want to increase that (4x/8x) and rerun= the test, though. > > More, actually. Your current buffer is easily filled. Hi, I measured performance again with maxbufsize =3D 268435456 and multiple cpu configurations, here's an result. It seems the performance on 10GbE is bit unstable, not scaling linearly by adding cpus/queues. Maybe it depends some sort of system parameter, but I don't figure out the answer. Multithreaded BPF performance is increasing than single thread BPF in all case, anyway. * Test environment - FreeBSD node =C2=A0CPU: Core i7 X980 (12 threads) # Tested on 1 core, 2 core, 4 core and 6 core configuration (Each core has 2 threads using HT) =C2=A0MB: ASUS P6X58D Premium(Intel X58) =C2=A0NIC: Intel Ethernet X520-DA2 Server Adapter(82599) - Linux node =C2=A0CPU: Core 2 Quad (4 threads) =C2=A0MB: GIGABYTE GA-G33-DS3R(Intel G33) =C2=A0NIC: Intel Ethernet X520-DA2 Server Adapter(82599) - iperf Linux node: iperf -c [IP] -i 10 -t 100000 -P16 FreeBSD node: iperf -s # 16 threads, TCP - system parameter net.bpf.maxbufsize=3D268435456 hw.ixgbe.num_queues=3D[n queues] * 2threads, 2queues - iperf throughput iperf only: 8.845Gbps test_mqbpf: 5.78Gbps test_sqbpf: 6.89Gbps - test program throughput test_mqbpf: 4526.863414 Mbps test_sqbpf: 762.452475 Mbps - received/dropped test_mqbpf: 45315011 packets received (BPF) 9646958 packets dropped (BPF) test_sqbpf: 56216145 packets received (BPF) 49765127 packets dropped (BPF) * 4threads, 4queues - iperf throughput iperf only: 3.03Gbps test_mqbpf: 2.49Gbps test_sqbpf: 2.57Gbps - test program throughput test_mqbpf: 2420.195051 Mbps test_sqbpf: 430.774870 Mbps - received/dropped test_mqbpf: 19601503 packets received (BPF) 0 packets dropped (BPF) test_sqbpf: 22803778 packets received (BPF) 18869653 packets dropped (BPF) * 8threads, 8queues - iperf throughput iperf only: 5.80Gbps test_mqbpf: 4.42Gbps test_sqbpf: 4.30Gbps - test program throughput test_mqbpf: 4242.314913 Mbps test_sqbpf: 1291.719866 Mbps - received/dropped test_mqbpf: 34996953 packets received (BPF) 361947 packets dropped (BPF) test_sqbpf: 35738058 packets received (BPF) 24749546 packets dropped (BPF) * 12threads, 12queues - iperf throughput iperf only: 9.31Gbps test_mqbpf: 8.06Gbps test_sqbpf: 5.67Gbps - test program throughput test_mqbpf: 8089.242472 Mbps test_sqbpf: 5754.910665 Mbps - received/dropped test_mqbpf: 73783957 packets received (BPF) 9938 packets dropped (BPF) test_sqbpf: 49434479 packets received (BPF) 0 packets dropped (BPF)