From owner-freebsd-current@FreeBSD.ORG Sat Jun 11 23:02:50 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 008E01065670; Sat, 11 Jun 2011 23:02:50 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (relay04.stack.nl [IPv6:2001:610:1108:5010::107]) by mx1.freebsd.org (Postfix) with ESMTP id 8B2248FC13; Sat, 11 Jun 2011 23:02:49 +0000 (UTC) Received: from turtle.stack.nl (turtle.stack.nl [IPv6:2001:610:1108:5010::132]) by mx1.stack.nl (Postfix) with ESMTP id DA3451DD6F9; Sun, 12 Jun 2011 01:02:48 +0200 (CEST) Received: by turtle.stack.nl (Postfix, from userid 1677) id D113D173E9; Sun, 12 Jun 2011 01:02:48 +0200 (CEST) Date: Sun, 12 Jun 2011 01:02:48 +0200 From: Jilles Tjoelker To: Luigi Rizzo Message-ID: <20110611230248.GA20040@stack.nl> References: <20110609035952.GA30464@onelab2.iet.unipi.it> <201106090911.51817.jhb@freebsd.org> <20110611004150.GB59898@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110611004150.GB59898@onelab2.iet.unipi.it> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-current@freebsd.org Subject: Re: any place to look at for PCI-express performance issues ? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Jun 2011 23:02:50 -0000 On Sat, Jun 11, 2011 at 02:41:50AM +0200, Luigi Rizzo wrote: > just for the records: the AMD motherboard works fine and can reach > 14.88Mpps, i was just doing a couple of mistakes in my AMD tests, > including the use of a slot with 16x form factor but only 4 lanes > connected. > This said, the i7-870 is about twice as fast as the Athlon II X4-635 > in generating packets for the same clock speed. > I think the different cache size might have some impact on the > result given the Athlon has no L3 cache and the test program surely > overflows the 512k L2 cache (i am using a total of 8k packet buffers, > touching 64 bytes each for the payload, plus 24 bytes each for > descriptors). > Unfortunately at these speeds even small things matter a lot! It may help to use non-temporal stores to fill the packet buffers. Because this data will never be read again by the CPU, caching it is useless. Also, non-temporal stores may help avoid reading a cache line only to overwrite it completely. With SSE, this could be done with a loop of four MOVUPS and four MOVNTPS instructions, transferring 64 bytes per iteration, and an SFENCE at the end (or the corresponding intrinsics from , _mm_loadu_ps(), _mm_stream_ps(), _mm_sfence()). For the receive side, there are also various non-temporal loads and prefetch instructions. On the other hand, because generating small packets only writes to 64 bytes of each 2048 byte aligned block, only a small portion of the cache will be polluted. This is because caches are usually not fully associative. This small portion could contain other important data, however. When generating full 1500 byte packets, most of the cache will be polluted. Because caching is not useful for the ring buffers, it is probably not a problem that they are laid out in such a way that they cannot be cached efficiently. -- Jilles Tjoelker