From owner-freebsd-net@FreeBSD.ORG Fri Sep 13 22:44:58 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id BA466CF7; Fri, 13 Sep 2013 22:44:58 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id F0B502CB0; Fri, 13 Sep 2013 22:44:57 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEAMaUM1KDaFve/2dsb2JhbABbhBGDKr1RgTN0giUBAQQBDhVCFBsYAgINGQJZBhOHfQanYJFpgSmOFDQHgmmBNQOpboNAIIFu X-IronPort-AV: E=Sophos;i="4.90,901,1371096000"; d="scan'208";a="51784977" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 13 Sep 2013 18:43:23 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 11462B3F45; Fri, 13 Sep 2013 18:43:23 -0400 (EDT) Date: Fri, 13 Sep 2013 18:43:23 -0400 (EDT) From: Rick Macklem To: George Neville-Neil Message-ID: <221093226.23439826.1379112203059.JavaMail.root@uoguelph.ca> In-Reply-To: <6BDA4619-783C-433E-9819-A7EAA0BD3299@neville-neil.com> Subject: Re: Network stack changes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) X-Mailman-Approved-At: Fri, 13 Sep 2013 23:03:44 +0000 Cc: "Alexander V. Chernikov" , Luigi Rizzo , Andre Oppermann , freebsd-hackers@freebsd.org, FreeBSD Net , Adrian Chadd , "Andrey V. Elsukov" , freebsd-arch@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Sep 2013 22:44:58 -0000 George Neville-Neil wrote: > > On Aug 29, 2013, at 7:49 , Adrian Chadd wrote: > > > Hi, > > > > There's a lot of good stuff to review here, thanks! > > > > Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless > > to keep > > locking things like that on a per-packet basis. We should be able > > to do > > this in a cleaner way - we can defer RX into a CPU pinned taskqueue > > and > > convert the interrupt handler to a fast handler that just schedules > > that > > taskqueue. We can ignore the ithread entirely here. > > > > What do you think? > > > > Totally pie in the sky handwaving at this point: > > > > * create an array of mbuf pointers for completed mbufs; > > * populate the mbuf array; > > * pass the array up to ether_demux(). > > > > For vlan handling, it may end up populating its own list of mbufs > > to push > > up to ether_demux(). So maybe we should extend the API to have a > > bitmap of > > packets to actually handle from the array, so we can pass up a > > larger array > > of mbufs, note which ones are for the destination and then the > > upcall can > > mark which frames its consumed. > > > > I specifically wonder how much work/benefit we may see by doing: > > > > * batching packets into lists so various steps can batch process > > things > > rather than run to completion; > > * batching the processing of a list of frames under a single lock > > instance > > - eg, if the forwarding code could do the forwarding lookup for 'n' > > packets > > under a single lock, then pass that list of frames up to > > inet_pfil_hook() > > to do the work under one lock, etc, etc. > > > > Here, the processing would look less like "grab lock and process to > > completion" and more like "mark and sweep" - ie, we have a list of > > frames > > that we mark as needing processing and mark as having been > > processed at > > each layer, so we know where to next dispatch them. > > > > One quick note here. Every time you increase batching you may > increase bandwidth > but you will also increase per packet latency for the last packet in > a batch. > That is fine so long as we remember that and that this is a tuning > knob > to balance the two. > And any time you increase latency, that will have a negative impact on NFS performance. NFS RPCs are usually small messages (except Write requests and Read replies) and the RTT for these (mostly small, bidirectional) messages can have a significant impact on NFS perf. rick > > I still have some tool coding to do with PMC before I even think > > about > > tinkering with this as I'd like to measure stuff like per-packet > > latency as > > well as top-level processing overhead (ie, > > CPU_CLK_UNHALTED.THREAD_P / > > lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, > > etc.) > > > > This would be very useful in identifying the actual hot spots, and > would be helpful > to anyone who can generate a decent stream of packets with, say, an > IXIA. > > Best, > George > > >