From owner-freebsd-net@FreeBSD.ORG Mon Jul 21 15:34:43 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4488D567 for ; Mon, 21 Jul 2014 15:34:43 +0000 (UTC) Received: from mail-ig0-x232.google.com (mail-ig0-x232.google.com [IPv6:2607:f8b0:4001:c05::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1166522E0 for ; Mon, 21 Jul 2014 15:34:43 +0000 (UTC) Received: by mail-ig0-f178.google.com with SMTP id uq10so2945898igb.11 for ; Mon, 21 Jul 2014 08:34:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; bh=crchBRbrynphQsfAr8xZLiGtoNDWz8yX+gnLOxCxGHs=; b=gYT0rRz9NjHg4ubgdFgmnL8oGV+/2dhfjt1uDC6Deo+vnZQYyPBb2xX5/byF/GkeW7 qDCO1X3XcFf+qekXEg5yuH6mhtr6h3RyUF3TXRNPDjB44RiC1IZ91AWwCnP5aecicL8b 4MO7iugjg8sEaomtvtV0nxgWWjExaQLHIFqLrQKuwMMvoNOce+fGZXEOJ4WBGgGPc/7s Ct2Bizy0e11UCszrBxFhn/IY2w5tQHjlIyybO6v/kJG7ef280O0QtgHhXHfmn/7P0v18 19O5XdSM/DWAqZ8qQc/zT+yeE6newmRZx+DY2Xka5z1ch3v9NLEsocp+gCScZd7nKw3F UjOQ== X-Received: by 10.50.33.16 with SMTP id n16mr6261087igi.15.1405956882412; Mon, 21 Jul 2014 08:34:42 -0700 (PDT) Received: from [10.1.68.187] (gs-sv-1-49-ac1.gsfc.nasa.gov. [198.119.56.43]) by mx.google.com with ESMTPSA id il3sm39674316igb.1.2014.07.21.08.34.40 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 21 Jul 2014 08:34:41 -0700 (PDT) Message-ID: <53CD330E.6090407@gmail.com> Date: Mon, 21 Jul 2014 11:34:38 -0400 From: John Jasen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: FreeBSD Net , Navdeep Parhar Subject: packet forwarding and possible mitigation of Intel QuickPath Interconnect ugliness in multi cpu systems X-Enigmail-Version: 1.5.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Jul 2014 15:34:43 -0000 Executive Summary: Appropriate use of cpuset(1) can mitigate performance bottlenecks over the Intel QPI processor interconnection, and improve packets-per-second processing rate by over 100%. Test Environment: My test system is a Dell dual CPU R820, populated with evaluation cards graciously provided by Chelsio. Currently, each dual port chelsio card is populated in a 16x slot, one physically attached to each CPU. My load generators are 20 CentOS-based linux systems, using Mellanox VPI ConnectX-3 cards in ethernet mode. The test environment divides the load generators into 4 distinct subnets of 5 systems, with each one utilizing a Chelsio interface as its route to the other networks. I use iperf3 on the linux systems to generate packets. The test runs select two systems on each subnet to be a sender, and three on each to be receivers. The sending systems establish 4 UDP streams to each receiver. Results: netstat -w 1 -q 100 -d before each run I summarized results with the following. awk '{ipackets+=$1} {idrops+=$3} {opackets+=$5} {odrops+=$9} END {print "input " ipackets/NR, "idrops " idrops/NR, "opackets " opackets/NR, "odrops " odrops/NR}' Without any cpuset tuning at all: input 7.25464e+06 idrops 5.89939e+06 opackets 1.34888e+06 odrops 947.409 With cpuset assigning interrupts equally to each physical processor: input 1.10886e+07 idrops 9.85347e+06 opackets 1.22887e+06 odrops 3384.86 cpuset assigning interrupts across cores on the first physical processor: input 1.14046e+07 idrops 8.6674e+06 opackets 2.73365e+06 odrops 2420.75 cpuset assigning interrupts across cores on the second physical processor: input 1.16746e+07 idrops 8.96412e+06 opackets 2.70652e+06 odrops 3076.52 I will follow this up with both cards being in PCIE slots physically connected to the first CPU, but for a rule of thumb comparision, with cpuset'ing the interrupts appropriately, it was usually about 10-15% higher than cpuset-one-processor-low and cpuset-one-processor-high. Conclusion: The best solution for highest performance is still to avoid QPI as much as possible, by appropriate physical placement of the PCIe cards. However, in cases where that may not be possible or desirable, using cpuset to assign all the interrupt affinity to one processor will help mitigate performance loss. Credits: Thanks to Dell for the loan of the Dell R820 using for testing; Thanks to Chelsio for the loan of the two T580-CR cards; and thanks to the CXGBE maintainer, Navdeep Parhar, for his assistance and patience during debugging and testing. Feedback is always welcome. I can provide detailed results upon request. Scripts provided by a vendor, I need to get their permission to redistribute/publish, but I do not think thats a problem.