From owner-freebsd-arm@FreeBSD.ORG  Mon Mar  8 02:16:51 2010
Return-Path: <owner-freebsd-arm@FreeBSD.ORG>
Delivered-To: freebsd-arm@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0CAB11065673
	for <freebsd-arm@freebsd.org>; Mon,  8 Mar 2010 02:16:51 +0000 (UTC)
	(envelope-from ticso@cicely7.cicely.de)
Received: from raven.bwct.de (raven.bwct.de [85.159.14.73])
	by mx1.freebsd.org (Postfix) with ESMTP id 8DC318FC16
	for <freebsd-arm@freebsd.org>; Mon,  8 Mar 2010 02:16:50 +0000 (UTC)
Received: from mail.cicely.de ([10.1.1.37])
	by raven.bwct.de (8.13.4/8.13.4) with ESMTP id o282Gj23076897
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Mon, 8 Mar 2010 03:16:46 +0100 (CET)
	(envelope-from ticso@cicely7.cicely.de)
Received: from cicely7.cicely.de (cicely7.cicely.de [10.1.1.9])
	by mail.cicely.de (8.14.3/8.14.3) with ESMTP id o282GgQo096116
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Mon, 8 Mar 2010 03:16:42 +0100 (CET)
	(envelope-from ticso@cicely7.cicely.de)
Received: from cicely7.cicely.de (localhost [127.0.0.1])
	by cicely7.cicely.de (8.14.2/8.14.2) with ESMTP id o282GgnY014523;
	Mon, 8 Mar 2010 03:16:42 +0100 (CET)
	(envelope-from ticso@cicely7.cicely.de)
Received: (from ticso@localhost)
	by cicely7.cicely.de (8.14.2/8.14.2/Submit) id o282GgqM014522;
	Mon, 8 Mar 2010 03:16:42 +0100 (CET) (envelope-from ticso)
Date: Mon, 8 Mar 2010 03:16:42 +0100
From: Bernd Walter <ticso@cicely7.cicely.de>
To: Mark Tinguely <tinguely@casselton.net>
Message-ID: <20100308021642.GQ11192@cicely7.cicely.de>
References: <FB81E027-0CCC-4DF6-A29F-88920A39556B@semihalf.com>
	<201003072125.o27LPfFb000968@casselton.net>
	<20100308002704.GL11192@cicely7.cicely.de>
	<20100308013105.GP11192@cicely7.cicely.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20100308013105.GP11192@cicely7.cicely.de>
X-Operating-System: FreeBSD cicely7.cicely.de 7.0-STABLE i386
User-Agent: Mutt/1.5.11
X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED=-1.8, AWL=0.001,
	BAYES_00=-2.599 autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on spamd.cicely.de
Cc: freebsd-arm@freebsd.org
Subject: Re: Performance of SheevaPlug on 8-stable
X-BeenThere: freebsd-arm@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: ticso@cicely.de
List-Id: Porting FreeBSD to the StrongARM Processor <freebsd-arm.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arm>,
	<mailto:freebsd-arm-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arm>
List-Post: <mailto:freebsd-arm@freebsd.org>
List-Help: <mailto:freebsd-arm-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arm>,
	<mailto:freebsd-arm-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Mar 2010 02:16:51 -0000

On Mon, Mar 08, 2010 at 02:31:05AM +0100, Bernd Walter wrote:
> On Mon, Mar 08, 2010 at 01:27:04AM +0100, Bernd Walter wrote:
> > On Sun, Mar 07, 2010 at 03:25:41PM -0600, Mark Tinguely wrote:
> > > 
> > > FreeBSD-current has kernel and user witness turned on. Witness is for
> > > locks, so it should not change the performance of a tight arithmetic loop
> > > like this.
> > 
> > I have no kernel debugging enabled.
> > I have no malloc.conf on current, but I have on the 8.0-current system,
> > so malloc debugging is enabled on one machine, but it shouldn't hurt in
> > this case since it is not allocating anything.
> > 
> > > I don't know the marvell interals, and from what I tell, their technial
> > > docs require NDA. That said, many of the ARM processors also have a
> > > instruction internal cache (instruction prefetch) in addition to the
> > > instruction cache. I don't think the prefetch has an enable/disable.
> > > 
> > > It looks like from the cpu identification that the the branch prediction
> > > is turned on. Branch prediction compensates for the longer pipelines.
> > > I can't see how in the tight loop how that could go astray.
> > > 
> > > Thus says the ARM ARM:
> > > 
> > > 	ARM implementations are free to choose how far ahead of the
> > > 	current point of execution they prefetch instructions; either
> > > 	a fixed or a dynamically varying number of instructions. As well
> > > 	as being free to choose how many instructions to prefetch, an ARM
> > > 	implementation can choose which possible future execution path to
> > > 	prefetch along. For example, after a branch instruction, it can
> > > 	choose to prefetch either the instruction following the branch
> > > 	or the instruction at the branch target. This is known as branch
> > > 	prediction.
> > > 
> > > There are a few data dangling allocations that I would like to see
> > > closed from the multiple kernel allocation fix. *IN THEORY, IF* a page
> > > is allocated via the arm_nocache (DMA COHERENT) or a sendfile, then
> > > it is never marked as unallocated. *IN THEORY*, if that page is used
> > > again, then we could falsely believe that page is being shared and
> > > we turn off the cache, eventhough it is not shared.
> > > 
> > > 	http://www.casselton.net/~tinguely/arm_pmap_unmanaged.diff
> > > 
> > > * Disclaimer: I am not sure if DMA COHERENT nor sendfiles are used in
> > > the Sheeva implementation. This is a theoritical observation of a side
> > > effect of the multiple kernel mapping patch that we did just before
> > > FreeBSD 8-release.
> 
> This sounds possible.
> My 8.0-current system should be before that change and it is much faster
> than my current system.
> It is still slower than the calculated ~80s and the difference looks
> a bit large to just think it is a stalled pipeline because of the branch.
> Has anyone access to a RM9200 system running Linux?

With your patch my current system is faster as well.
[55]chipmunk.cicely.de# ./test 
207.000u 0.000s 4:01.13 86.0%   46+1516k 0+0io 0pf+0w
[56]chipmunk.cicely.de# ./test
207.000u 0.000s 3:55.66 87.9%   45+1516k 0+0io 0pf+0w

It is still puzzling me why it is not near 80 seconds.
This would mean it is loosing something about 5-6 cycles.
Well - Ok - the pipeline might be that long and real loops are
mostly some instructions longer.
But I would still be interested to see Linux results on RM9200.

-- 
B.Walter <bernd@bwct.de> http://www.bwct.de
Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm.