From owner-freebsd-net@FreeBSD.ORG Wed Aug 21 18:40:45 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 059C6723 for ; Wed, 21 Aug 2013 18:40:45 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 0C73420EC for ; Wed, 21 Aug 2013 18:40:43 +0000 (UTC) Received: (qmail 73735 invoked from network); 21 Aug 2013 19:23:46 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 21 Aug 2013 19:23:46 -0000 Message-ID: <521509A4.8020601@freebsd.org> Date: Wed, 21 Aug 2013 20:40:36 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it> In-Reply-To: <20130814102109.GA63246@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Lawrence Stewart , FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Aug 2013 18:40:45 -0000 On 14.08.2013 12:21, Luigi Rizzo wrote: > On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote: >> I think (check the driver code in question as I'm not sure) that if you >> "ifconfig lro" and the driver has hardware support or has been made >> aware of our software implementation, it should DTRT. > > The "lower throughput than linux" that julian was seeing is either > because of a slow (CPU-bound) sender or slow receiver. Given that > the FreeBSD tx path is quite expensive (redoing route and arp lookups > on every packet, etc.) I highly suspect the sender side is at fault. > > Then the problem remains that we should keep a copy of route and > arp information in the socket instead of redoing the lookups on > every single transmission, as they consume some 25% of the time of > a sendto(), and probably even more when it comes to large tcp > segments, sendfile() and the like. It's the locking and ref-counting overhead in the routing table and ARP table causing a lot of cache thrashing and bus lock cycles. The fix is rather simple. The routing table gets protected by a rm_lock instead of a normal lock. Individual routes no longer have their own lock and no more ref-counting. All pointers to routes and into the routing table are prohibited. Upon lookup the sought information is copied out (ifp, ifaddr, nexthop) without retaining any reference to the routing entry. Ditto for the ARP table. Because changes to the routing and ARP tables are very infrequent compared to the number of lookups performed on them, this exhibits very good cache behavior across multiple cores and cpus. No shared routing table memory is dirtied during lookup. Approaches that do NOT work (well): - flow caching where a separate entry is generated for every active connection containing direct pointers to the rtentry, arp entry and interface. Besides the pointer validity and refcounting issues it scales very poorly for a large number of "flows" exhibiting a large lookup overhead. The routing table (default and interface routes) and ARP table (a few hosts) stay at the same size and have a "constant" lookup time. - per cpu copies of routing and arp table have increased memory consumption and synchronization issues on updates especially with high core counts. - storing the rtentry and arp entry pointers in the inpcb has similar issues as the the flow table approach while periodically having to check if the route or arp entry changed. The rm_lock is the fastest, cheapest and most SMP scalable approach shown so far. I have patches against a roughly 12 month old current laying around if someone wants to brush them up and work out the final kinks. The speedup and reduction in overhead is significant. -- Andre