From owner-freebsd-arm@FreeBSD.ORG Tue Sep 9 15:33:08 2008 Return-Path: Delivered-To: freebsd-arm@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3D9E5106567C for ; Tue, 9 Sep 2008 15:33:08 +0000 (UTC) (envelope-from sam@freebsd.org) Received: from ebb.errno.com (ebb.errno.com [69.12.149.25]) by mx1.freebsd.org (Postfix) with ESMTP id F31C18FC16 for ; Tue, 9 Sep 2008 15:33:07 +0000 (UTC) (envelope-from sam@freebsd.org) Received: from trouble.errno.com (trouble.errno.com [10.0.0.248]) (authenticated bits=0) by ebb.errno.com (8.13.6/8.12.6) with ESMTP id m89FX7Pw054746 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 9 Sep 2008 08:33:07 -0700 (PDT) (envelope-from sam@freebsd.org) Message-ID: <48C69733.2050601@freebsd.org> Date: Tue, 09 Sep 2008 08:33:07 -0700 From: Sam Leffler Organization: FreeBSD Project User-Agent: Thunderbird 2.0.0.9 (X11/20071125) MIME-Version: 1.0 To: Jacques Fourie References: <20080909175556.07bac5f0.stas@FreeBSD.org> <48C6900C.8070708@freebsd.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-DCC-Misty-Metrics: ebb.errno.com; whitelist Cc: freebsd-arm@freebsd.org Subject: Re: Routing benchmarks X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the StrongARM Processor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Sep 2008 15:33:08 -0000 Jacques Fourie wrote: > On Tue, Sep 9, 2008 at 5:02 PM, Sam Leffler wrote: > >> Jacques Fourie wrote: >> >>> On Tue, Sep 9, 2008 at 3:55 PM, Stanislav Sedov wrote: >>> >>> >>>> On Tue, 9 Sep 2008 15:33:30 +0200 >>>> "Jacques Fourie" mentioned: >>>> >>>> >>>> >>>>> Hi, >>>>> >>>>> I've performed some benchmark tests on my Gumstix Connex 400 (Intel >>>>> Xscale PXA 255 CPU clocked at 400MHz) with a netDuo expansion board. >>>>> This board has two smc network interfaces. I configure the gumstix as >>>>> a router and measure network throughput with netperf running on >>>>> seperate boxes on either side of the gumstix. My initial tests showed >>>>> a TCP throughput of 2Mbit/s. After adapting the smc driver to use DMA >>>>> this figure went up to 7Mbit/s. Although this is a significant >>>>> improvement, it still seems to be a bit slow. Does anyone have any >>>>> tips on how I can go about to try and figure out where the bottleneck >>>>> lies? Initial profiling showed that a significant amount of time was >>>>> spent doing memory to memory copies of data, but after the DMA change >>>>> profiling does not show any obvious culprits. >>>>> >>>>> >>>>> >>>> Have you tried checking the speed of the interface itself? Without >>>> routing involved? May it be the interfaces itself being so slow? >>>> >>>> -- >>>> Stanislav Sedov >>>> ST4096-RIPE >>>> >>>> >>>> >>> Running netserver on the gumstix shows a throughput of 2.4Mbit/s. At >>> the moment I can't get if_bridge to work - will try to figure out what >>> is going on. A bridging benchmark may be more informative. >>> >>> >> You said you did profiling but you didn't provide the data to inspect. It's >> possible kernel profiling has never been tried on your platform; did you >> sanity check the results? (e.g. run a known test load and check results; >> verify all routines that should execute appear in the profile). Also if >> copy overhead shows up as significant look to see why those copies are being >> done; it's often possible to avoid a copy. >> >> My experience in working with architectures like this is that cache handling >> can be a significant cost that doesn't always show up on a profile. >> >> Also you may find useful information by tracking mbufs using the h/w clock >> at important places along the "fast path" then look at whether the overhead >> for each step is reasonable. I did this for bridged traffic by forcing the >> rx dma to go to an mbuf+cluster then used the free storage in the mbuf >> header to store timestamps. At the end of the processing path I sorted the >> data into buckets by the sample points and added a sysctl to dump the >> histogram to see min/max/avg. >> >> Sam >> >> >> > > Thanks for the nice idea - will try something similar. At the moment > I'm also suspecting that cache handling has got a lot to do with the > performance figures that I'm seeing. The PXA255 has a 32KB data and > 32KB instruction cache. > I was thinking more of cases where you must flush the d-cache because a memory object is treated r/w (e.g. packet data). bus_dmamap_sync ops can do cache flushes and may not be required or may be overly expensive. Also, sometimes you can get away with treating objects as read-only and avoid the cache flush. Sam