From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 01:49:57 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 323DA16A492
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 01:49:57 +0000 (UTC)
	(envelope-from markir@paradise.net.nz)
Received: from smtp3.clear.net.nz (smtp3.clear.net.nz [203.97.33.64])
	by mx1.freebsd.org (Postfix) with ESMTP id E5D0113C457
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 01:49:56 +0000 (UTC)
	(envelope-from markir@paradise.net.nz)
Received: from [192.168.1.11]
	(121-72-65-158.dsl.telstraclear.net [121.72.65.158])
	by smtp3.clear.net.nz (CLEAR Net Mail)
	with ESMTP id <0JAN00FSKJR6G730@smtp3.clear.net.nz> for
	freebsd-performance@freebsd.org; Fri, 22 Dec 2006 14:49:56 +1300 (NZDT)
Date: Fri, 22 Dec 2006 14:49:53 +1300
From: Mark Kirkwood <markir@paradise.net.nz>
In-reply-to: <458B3651.8090601@paradise.net.nz>
To: freebsd-performance@freebsd.org
Message-id: <458B39C1.2080906@paradise.net.nz>
MIME-version: 1.0
Content-type: multipart/mixed; boundary=------------050506000400060507050102
References: <458B3651.8090601@paradise.net.nz>
User-Agent: Thunderbird 1.5.0.8 (X11/20061129)
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 01:49:57 -0000

This is a multi-part message in MIME format.
--------------050506000400060507050102
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Mark Kirkwood wrote:

> Anyway on to the results: I used the attached program to read a cached 

Silly bug in attached program : lseek failure test has 1 instead of -1 
(finger trouble).

--------------050506000400060507050102
Content-Type: text/x-patch;
 name="readtest.c.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="readtest.c.patch"

*** readtest.c.orig	Fri Dec 22 14:43:42 2006
--- readtest.c	Fri Dec 22 14:43:24 2006
***************
*** 103,109 ****
  			}
  		} else {
  			offset = (off_t) (random() % (numblocks - 1)) * blocksz;
! 			if (lseek(fd, offset, SEEK_SET) == 1) {
  				perror("seek failed");
  				exit(1);
  			}
--- 103,109 ----
  			}
  		} else {
  			offset = (off_t) (random() % (numblocks - 1)) * blocksz;
! 			if (lseek(fd, offset, SEEK_SET) == -1) {
  				perror("seek failed");
  				exit(1);
  			}

--------------050506000400060507050102--

From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 01:50:18 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 0672716A47C
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 01:50:17 +0000 (UTC)
	(envelope-from markir@paradise.net.nz)
Received: from smtp3.clear.net.nz (smtp3.clear.net.nz [203.97.33.64])
	by mx1.freebsd.org (Postfix) with ESMTP id 6E82213C469
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 01:50:17 +0000 (UTC)
	(envelope-from markir@paradise.net.nz)
Received: from [192.168.1.11]
	(121-72-65-158.dsl.telstraclear.net [121.72.65.158])
	by smtp3.clear.net.nz (CLEAR Net Mail)
	with ESMTP id <0JAN00F7PJ2QG730@smtp3.clear.net.nz> for
	freebsd-performance@freebsd.org; Fri, 22 Dec 2006 14:35:15 +1300 (NZDT)
Date: Fri, 22 Dec 2006 14:35:13 +1300
From: Mark Kirkwood <markir@paradise.net.nz>
To: freebsd-performance@freebsd.org
Message-id: <458B3651.8090601@paradise.net.nz>
MIME-version: 1.0
Content-type: multipart/mixed; boundary=------------080706040303080408030509
User-Agent: Thunderbird 1.5.0.8 (X11/20061129)
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 01:50:18 -0000

This is a multi-part message in MIME format.
--------------080706040303080408030509
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I recently did some testing on the performance of cached reads using two 
(almost identical) systems, one running FreeBSD 6.2PRE and the other 
running Gentoo Linux - the latter acting as a control. I initially 
started a thread of the same name on -stable, but it was suggested I 
submit a mail here.

My background for wanting to examine this is that I work with developing 
database software (postgres internals related) and cached read 
performance is pretty important - since we typically try hard to 
encourage cached access whenever possible.

Anyway on to the results: I used the attached program to read a cached 
781MB file sequentially and randomly with a specified block size (see 
below). The conclusion I came to was that our (i.e FreeBSD) cached read 
performance (particularly for smaller block sizes) could perhaps be 
improved... now I'm happy to help in any way - the machine I've got 
running STABLE can be upgraded to CURRENT in order to try out patches 
(or in fact to see if CURRENT is faster at this already!)...

Best wishes

Mark


----------------------results-etc---------------------------------
Machines
========

FreeBSD (6.2-PRERELEASE #7: Mon Nov 27 19:32:33 NZDT 2006):
- Supermicro P3TDER
- 2xSL5QL 1.26 GHz PIII
- 2xKingston PC133 RCC Registered 1GB DIMMS
- 3Ware 7506 4x Maxtor Plus 9 ATA-133 7200 80G
- Kernal GENERIC + SMP
- /etc/malloc.conf -> >aj
- ufs2 32k blocksize, 4K fragments
- RAID0 256K stripe using twe driver

Gentoo (2.6.18-gentoo-r3 ):
- Supermicro P3TDER
- 2xSL5QL 1.26 GHz PIII
- 2xKingston PC133 RCC Registered 1GB DIMMS
- Promise TX4000 4x Maxtor plus 8 ATA-133 7200 40G
- default make CFLAGS (-O2 -march-i686)
- xfs stripe width 2
- RAID0 256K stripe using md driver (software RAID)

Given the tests were about cached I/O, the differences in RAID 
controller and the disks themselves were seen as not significant (indeed 
booting the FreeBSD box with the Gentoo livecd and running the tests 
there confirmed this).

Results
=======

FreeBSD:
--------

$ ./readtest /data0/dump/file 8192 0
random reads: 100000 of: 8192 bytes elapsed: 4.4477s io rate: 184186327 
bytes/s
$ ./readtest /data0/dump/file 8192 1
sequential reads: 100000 of: 8192 bytes elapsed: 1.9797s io rate: 
413804878 bytes/s

$ ./readtest /data0/dump/file 32768 0
random reads: 25000 of: 32768 bytes elapsed: 2.0076s io rate: 408040469 
bytes/s
$ ./readtest /data0/dump/file 32768 1
sequential reads: 25000 of: 32768 bytes elapsed: 1.7068s io rate: 
479965034 bytes/s

$ ./readtest /data0/dump/file 65536 0
random reads: 12500 of: 65536 bytes elapsed: 1.7856s io rate: 458778279 
bytes/s
$ ./readtest /data0/dump/file 65536 1
sequential reads: 12500 of: 65536 bytes elapsed: 1.6611s io rate: 
493158866 bytes/s


Gentoo:
-------

$ ./readtest /data0/dump/file 8192 0
random reads: 100000 of: 8192 bytes elapsed: 1.2698s io rate: 645155193 
bytes/s
$ ./readtest /data0/dump/file 8192 1
sequential reads: 100000 of: 8192 bytes elapsed: 1.1329s io rate: 
723129371 bytes/s


$ ./readtest /data0/dump/file 32768 0
random reads: 25000 of: 32768 bytes elapsed: 1.1583s io rate: 707244595 
bytes/s
$ ./readtest /data0/dump/file 32768 1
sequential reads: 25000 of: 32768 bytes elapsed: 1.1178s io rate: 
732838631 bytes/s

$ ./readtest /data0/dump/file 65536 0
random reads: 12500 of: 65536 bytes elapsed: 1.1478s io rate: 713742417 
bytes/s
$ ./readtest /data0/dump/file 65536 1
sequential reads: 12500 of: 65536 bytes elapsed: 1.1012s io rate: 
743921133 bytes/s


--------------080706040303080408030509--

From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 02:08:10 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from [127.0.0.1] (localhost [127.0.0.1])
	by hub.freebsd.org (Postfix) with ESMTP id A119B16A40F;
	Fri, 22 Dec 2006 02:08:09 +0000 (UTC)
	(envelope-from davidxu@freebsd.org)
Message-ID: <458B3E0C.6090104@freebsd.org>
Date: Fri, 22 Dec 2006 10:08:12 +0800
From: David Xu <davidxu@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.13) Gecko/20061204
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Mark Kirkwood <markir@paradise.net.nz>
References: <458B3651.8090601@paradise.net.nz>
In-Reply-To: <458B3651.8090601@paradise.net.nz>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-performance@freebsd.org
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 02:08:10 -0000

Mark Kirkwood wrote:
> I recently did some testing on the performance of cached reads using two 
> (almost identical) systems, one running FreeBSD 6.2PRE and the other 
> running Gentoo Linux - the latter acting as a control. I initially 
> started a thread of the same name on -stable, but it was suggested I 
> submit a mail here.
> 
> My background for wanting to examine this is that I work with developing 
> database software (postgres internals related) and cached read 
> performance is pretty important - since we typically try hard to 
> encourage cached access whenever possible.
> 
> Anyway on to the results: I used the attached program to read a cached 
> 781MB file sequentially and randomly with a specified block size (see 
> below). The conclusion I came to was that our (i.e FreeBSD) cached read 
> performance (particularly for smaller block sizes) could perhaps be 
> improved... now I'm happy to help in any way - the machine I've got 
> running STABLE can be upgraded to CURRENT in order to try out patches 
> (or in fact to see if CURRENT is faster at this already!)...
> 
> Best wishes
> 
> Mark
> 

I suspect in such a test, memory copying speed will be a key factor,
I don't have number to back up my idea, but I think Linux has lots
of tweaks, such as using MMX instruction to copy data.

Regards,
David Xu


From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 02:31:42 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 0012B16A416
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 02:31:41 +0000 (UTC)
	(envelope-from markir@paradise.net.nz)
Received: from smtp3.clear.net.nz (smtp3.clear.net.nz [203.97.33.64])
	by mx1.freebsd.org (Postfix) with ESMTP id B9E2F13C455
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 02:31:41 +0000 (UTC)
	(envelope-from markir@paradise.net.nz)
Received: from [192.168.1.11]
	(121-72-65-158.dsl.telstraclear.net [121.72.65.158])
	by smtp3.clear.net.nz (CLEAR Net Mail)
	with ESMTP id <0JAN00BS2LOO3200@smtp3.clear.net.nz>; Fri,
	22 Dec 2006 15:31:36 +1300 (NZDT)
Date: Fri, 22 Dec 2006 15:31:35 +1300
From: Mark Kirkwood <markir@paradise.net.nz>
In-reply-to: <458B3E0C.6090104@freebsd.org>
To: David Xu <davidxu@freebsd.org>
Message-id: <458B4387.9090409@paradise.net.nz>
MIME-version: 1.0
Content-type: text/plain; charset=ISO-8859-1; format=flowed
Content-transfer-encoding: 7bit
References: <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org>
User-Agent: Thunderbird 1.5.0.8 (X11/20061129)
Cc: freebsd-performance@freebsd.org
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 02:31:42 -0000

David Xu wrote:
> Mark Kirkwood wrote:
>> I recently did some testing on the performance of cached reads using 
>> two (almost identical) systems, one running FreeBSD 6.2PRE and the 
>> other running Gentoo Linux - the latter acting as a control. I 
>> initially started a thread of the same name on -stable, but it was 
>> suggested I submit a mail here.
>>
>> My background for wanting to examine this is that I work with 
>> developing database software (postgres internals related) and cached 
>> read performance is pretty important - since we typically try hard to 
>> encourage cached access whenever possible.
>>
>> Anyway on to the results: I used the attached program to read a cached 
>> 781MB file sequentially and randomly with a specified block size (see 
>> below). The conclusion I came to was that our (i.e FreeBSD) cached 
>> read performance (particularly for smaller block sizes) could perhaps 
>> be improved... now I'm happy to help in any way - the machine I've got 
>> running STABLE can be upgraded to CURRENT in order to try out patches 
>> (or in fact to see if CURRENT is faster at this already!)...
>>
>> Best wishes
>>
>> Mark
>>
> 
> I suspect in such a test, memory copying speed will be a key factor,
> I don't have number to back up my idea, but I think Linux has lots
> of tweaks, such as using MMX instruction to copy data.
> 
> Regards,
> David Xu
> 

David - very interesting - checking 2.6.18 sources I see:

arch/i386/lib/memcpy.c:7->

void *memcpy(void *to, const void *from, size_t n)
{
#ifdef CONFIG_X86_USE_3DNOW
     return __memcpy3d(to, from, n);
#else
     return __memcpy(to, from, n);
#endif
}


If I understand this correctly, I need CONFIG_X86_USE_3DNOW (or perhaps 
CONFIG_M586MMX) set in my Linux kernel config to be using these.... 
which I don't appear to have (I'll do some more digging and see if maybe 
profiling tells us anything useful).

Cheers

Mark


From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 04:35:34 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 81BFF16A412
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 04:35:34 +0000 (UTC)
	(envelope-from adrian.chadd@gmail.com)
Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.224])
	by mx1.freebsd.org (Postfix) with ESMTP id 43A7413C442
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 04:35:34 +0000 (UTC)
	(envelope-from adrian.chadd@gmail.com)
Received: by wx-out-0506.google.com with SMTP id s18so2520909wxc
	for <freebsd-performance@freebsd.org>;
	Thu, 21 Dec 2006 20:35:33 -0800 (PST)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
	b=am3X+zUz5qUnJsAfgKLztZq8ruboybSPE8RS6hAHW6p2oP8/zsC6U310dTaAPlHmqJzzR5EpGuwNtIEigHpad6K+MOorS25d8yTF7+Mq8iVv2Tw+35ZQPLi/+3TFe70wLxDEJwRdN5ELZcVpo7hSGj79FwSK4uYgk5tgWkqn0jI=
Received: by 10.90.49.19 with SMTP id w19mr9064152agw.1166760588204;
	Thu, 21 Dec 2006 20:09:48 -0800 (PST)
Received: by 10.90.31.12 with HTTP; Thu, 21 Dec 2006 20:09:48 -0800 (PST)
Message-ID: <d763ac660612212009x30bab8d6kecec9bc2e49a2b66@mail.gmail.com>
Date: Fri, 22 Dec 2006 13:09:48 +0900
From: "Adrian Chadd" <adrian@freebsd.org>
Sender: adrian.chadd@gmail.com
To: "David Xu" <davidxu@freebsd.org>
In-Reply-To: <458B3E0C.6090104@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org>
X-Google-Sender-Auth: 471ea5e266d692cf
Cc: freebsd-performance@freebsd.org, Mark Kirkwood <markir@paradise.net.nz>
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 04:35:34 -0000

On 22/12/06, David Xu <davidxu@freebsd.org> wrote:

> I suspect in such a test, memory copying speed will be a key factor,
> I don't have number to back up my idea, but I think Linux has lots
> of tweaks, such as using MMX instruction to copy data.

I had the oppertunity to study the AMD Athlon XP Optimisation guide
and noted their example copy routine, optimised for the chipset, was
quite a hell of a lot faster over a straight block copy.

Has anyone here done any similar modifications to optimise
copyin/copyout? I can't imagine it'd be a bad thing to have.


Adrian

From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 05:44:51 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 2C03C16A407
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 05:44:51 +0000 (UTC)
	(envelope-from anderson@centtech.com)
Received: from mh1.centtech.com (moat3.centtech.com [64.129.166.50])
	by mx1.freebsd.org (Postfix) with ESMTP id DC82813C43E
	for <freebsd-performance@freebsd.org>;
	Fri, 22 Dec 2006 05:44:50 +0000 (UTC)
	(envelope-from anderson@centtech.com)
Received: from [192.168.42.21] (andersonbox1.centtech.com [192.168.42.21])
	by mh1.centtech.com (8.13.8/8.13.8) with ESMTP id kBM5GVhD085201;
	Thu, 21 Dec 2006 23:16:32 -0600 (CST)
	(envelope-from anderson@centtech.com)
Message-ID: <458B6A39.5040902@centtech.com>
Date: Thu, 21 Dec 2006 23:16:41 -0600
From: Eric Anderson <anderson@centtech.com>
User-Agent: Thunderbird 1.5.0.7 (X11/20061015)
MIME-Version: 1.0
To: Mark Kirkwood <markir@paradise.net.nz>
References: <458B3651.8090601@paradise.net.nz>
In-Reply-To: <458B3651.8090601@paradise.net.nz>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV 0.88.4/2367/Thu Dec 21 10:35:52 2006 on
	mh1.centtech.com
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=8.0 tests=AWL,BAYES_00 autolearn=ham 
	version=3.1.6
X-Spam-Checker-Version: SpamAssassin 3.1.6 (2006-10-03) on mh1.centtech.com
Cc: freebsd-performance@freebsd.org
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 05:44:51 -0000

On 12/21/06 19:35, Mark Kirkwood wrote:
> I recently did some testing on the performance of cached reads using two 
> (almost identical) systems, one running FreeBSD 6.2PRE and the other 
> running Gentoo Linux - the latter acting as a control. I initially 
> started a thread of the same name on -stable, but it was suggested I 
> submit a mail here.
> 
> My background for wanting to examine this is that I work with developing 
> database software (postgres internals related) and cached read 
> performance is pretty important - since we typically try hard to 
> encourage cached access whenever possible.
> 
> Anyway on to the results: I used the attached program to read a cached 
> 781MB file sequentially and randomly with a specified block size (see 
> below). The conclusion I came to was that our (i.e FreeBSD) cached read 
> performance (particularly for smaller block sizes) could perhaps be 
> improved... now I'm happy to help in any way - the machine I've got 
> running STABLE can be upgraded to CURRENT in order to try out patches 
> (or in fact to see if CURRENT is faster at this already!)...
> 
> Best wishes
> 
> Mark
> 
> 
> ----------------------results-etc---------------------------------
> Machines
> ========
> 
> FreeBSD (6.2-PRERELEASE #7: Mon Nov 27 19:32:33 NZDT 2006):
> - Supermicro P3TDER
> - 2xSL5QL 1.26 GHz PIII
> - 2xKingston PC133 RCC Registered 1GB DIMMS
> - 3Ware 7506 4x Maxtor Plus 9 ATA-133 7200 80G
> - Kernal GENERIC + SMP
> - /etc/malloc.conf -> >aj
> - ufs2 32k blocksize, 4K fragments
> - RAID0 256K stripe using twe driver
> 
> Gentoo (2.6.18-gentoo-r3 ):
> - Supermicro P3TDER
> - 2xSL5QL 1.26 GHz PIII
> - 2xKingston PC133 RCC Registered 1GB DIMMS
> - Promise TX4000 4x Maxtor plus 8 ATA-133 7200 40G
> - default make CFLAGS (-O2 -march-i686)
> - xfs stripe width 2
> - RAID0 256K stripe using md driver (software RAID)
> 
> Given the tests were about cached I/O, the differences in RAID 
> controller and the disks themselves were seen as not significant (indeed 
> booting the FreeBSD box with the Gentoo livecd and running the tests 
> there confirmed this).

[..snip of useful results..]

Aren't you also slightly testing parts of the file system code?  Why not 
(since it is only read-only you are interested in) use FreeBSD's xfs 
support (only in -CURRENT however) and run the tests also?  I'm just 
curious if it would make any difference - I would bet not much of any 
though.

Eric


-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
An undefined problem has an infinite number of solutions.
------------------------------------------------------------------------

From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 06:52:21 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: performance@freebsd.org
Delivered-To: freebsd-performance@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 9377716A40F
	for <performance@freebsd.org>; Fri, 22 Dec 2006 06:52:21 +0000 (UTC)
	(envelope-from mikej@rogers.com)
Received: from smtp103.rog.mail.re2.yahoo.com (smtp103.rog.mail.re2.yahoo.com
	[206.190.36.81])
	by mx1.freebsd.org (Postfix) with SMTP id 560FE13C428
	for <performance@freebsd.org>; Fri, 22 Dec 2006 06:52:21 +0000 (UTC)
	(envelope-from mikej@rogers.com)
Received: (qmail 16986 invoked from network); 22 Dec 2006 06:25:40 -0000
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=rogers.com;
	h=Received:X-YMail-OSG:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding;
	b=mIMfsNWL2ByrIhBd2jYUl0qOITBD8b6wRSk2ITPm++a41iOICYc2mY1+I39VLR5HxVuD27IbyrEBgNj+6rxQbX+npqRlNODeerTCsgUbYVR3P5IBAEwTjzv2D/j4bI+MR2TJeZIUcl7Y9Xdbawx1Tq1Re7thefQxR/F/Mb6gTtA=
	; 
Received: from unknown (HELO ?172.16.0.200?) (mikej@rogers.com@74.111.253.239
	with plain)
	by smtp103.rog.mail.re2.yahoo.com with SMTP; 22 Dec 2006 06:25:40 -0000
X-YMail-OSG: psLcbC0VM1nS9tg49t_5kWL6MW1GOe1W_wPT3BgU6SriCOoxRdve7ZP1tdn670BZffHvjQEgX2peJezkVxPjTdGrT88egSpln7tiVGLzJAlMxqQoONt..nveiiC1hETgnN7ukLU37PY9KsA-
Message-ID: <458B7A86.5060908@rogers.com>
Date: Fri, 22 Dec 2006 01:26:14 -0500
From: Mike Jakubik <mikej@rogers.com>
User-Agent: Thunderbird 1.5.0.9 (Windows/20061207)
MIME-Version: 1.0
To: performance@freebsd.org
References: <45888C68.10305@paradise.net.nz>	<200612200816.51043.joao@matik.com.br>	<4589128F.9030404@paradise.net.nz>	<200612201536.25497.pieter@degoeje.nl>	<458A606E.6080008@paradise.net.nz>
	<20061221184535.GF41566@turion.vk2pj.dyndns.org>
In-Reply-To: <20061221184535.GF41566@turion.vk2pj.dyndns.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: 
Subject: Re: Cached file read performance with 6.2-PRERELEASE
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 06:52:21 -0000

Has anyone tried these tests with 4.x? Well, i did, and i was surprised
how good the performance is, it gave me the highest number of all tests,
even compared to much faster HW. Although this is all different
hardware, it seems like the performance drops the higher the version of
FreeBSD is, specifically right after 6.1. Is there a possibility that
there was some performance problem introduced around that time?

All tests done with "dd if=/dev/zero of=/tmp/file bs=8k count=30000", a
234M file.

---
FreeBSD 4.11-STABLE, Pentium(R) 4 CPU 2.60GHz, 512MB

# dd of=/dev/null if=/tmp/file bs=8k
30000+0 records in
30000+0 records out
245760000 bytes transferred in 0.298992 secs (821962015 bytes/sec)

# dd of=/dev/null if=/tmp/file bs=32k
7500+0 records in
7500+0 records out
245760000 bytes transferred in 0.221009 secs (1111990834 bytes/sec)


FreeBSD 6.1-STABLE, Dual Core AMD Opteron(tm) Processor 170 (2009.27-MHz
K8-class CPU), 1GB

# dd of=/dev/null if=/tmp/file bs=8k
30000+0 records in
30000+0 records out
245760000 bytes transferred in 0.289550 secs (848765132 bytes/sec)

# dd of=/dev/null if=/tmp/file bs=32k
7500+0 records in
7500+0 records out
245760000 bytes transferred in 0.243281 secs (1010190329 bytes/sec)


FreeBSD 6.1-STABLE, Intel(R) Pentium(R) D CPU 3.20GHz (3118.91-MHz
686-class CPU), 1GB

# dd of=/dev/null if=/tmp/file bs=8k
30000+0 records in
30000+0 records out
245760000 bytes transferred in 0.354899 secs (692478377 bytes/sec)

# dd of=/dev/null if=/tmp/file bs=32k
7500+0 records in
7500+0 records out
245760000 bytes transferred in 0.285909 secs (859574388 bytes/sec)


FreeBSD 6.2-PRERELEASE, AMD Athlon(tm) 64 Processor 3000+ (2002.58-MHz
K8-class CPU), 512MB

# dd of=/dev/null if=/tmp/file bs=8k
30000+0 records in
30000+0 records out
245760000 bytes transferred in 0.354382 secs (693488872 bytes/sec)

# dd of=/dev/null if=/tmp/file bs=32k
7500+0 records in
7500+0 records out
245760000 bytes transferred in 0.356816 secs (688758249 bytes/sec)


FreeBSD 6.2-PRERELEASE, Intel(R) Pentium(R) 4 CPU 1.80GHz (1796.94-MHz
686-class CPU), 512MB

# dd of=/dev/null if=/tmp/file bs=8k
30000+0 records in
30000+0 records out
245760000 bytes transferred in 0.483906 secs (507867448 bytes/sec)

# dd of=/dev/null if=/tmp/file bs=32k
7500+0 records in
7500+0 records out
245760000 bytes transferred in 0.390824 secs (628825123 bytes/sec)


FreeBSD 7.0-CURRENT (all debugging off), AMD Athlon(tm) Processor
(1410.21-MHz 686-class CPU), 512MB

# dd of=/dev/null if=/tmp/file bs=8k
30000+0 records in
30000+0 records out
245760000 bytes transferred in 0.846895 secs (290189464 bytes/sec)

# dd of=/dev/null if=/tmp/file bs=32k
7500+0 records in
7500+0 records out
245760000 bytes transferred in 0.794950 secs (309151516 bytes/sec)


From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 11:18:09 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@FreeBSD.org
Delivered-To: freebsd-performance@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 3DCFB16A403
	for <freebsd-performance@FreeBSD.org>;
	Fri, 22 Dec 2006 11:18:09 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226])
	by mx1.freebsd.org (Postfix) with ESMTP id D430B13C428
	for <freebsd-performance@FreeBSD.org>;
	Fri, 22 Dec 2006 11:18:08 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout2.pacific.net.au (Postfix) with ESMTP id CBDF410B218;
	Fri, 22 Dec 2006 22:18:05 +1100 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (Postfix) with ESMTP id A147F2740C;
	Fri, 22 Dec 2006 22:18:05 +1100 (EST)
Date: Fri, 22 Dec 2006 22:18:04 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Mark Kirkwood <markir@paradise.net.nz>
In-Reply-To: <458B3651.8090601@paradise.net.nz>
Message-ID: <20061222171431.L18486@delplex.bde.org>
References: <458B3651.8090601@paradise.net.nz>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-performance@FreeBSD.org
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 11:18:09 -0000

On Fri, 22 Dec 2006, Mark Kirkwood wrote:

> I recently did some testing on the performance of cached reads using two 
> (almost identical) systems, one running FreeBSD 6.2PRE and the other running 
> Gentoo Linux - the latter acting as a control. I initially started a thread 
> of the same name on -stable, but it was suggested I submit a mail here.

Linux has less bloat in the file system and (cached, at least) block
i/o paths, so it won't be competed with fully any time soon.  However,
the differences shouldn't be more than a factor of 2.

> conclusion I came to was that our (i.e FreeBSD) cached read performance 
> (particularly for smaller block sizes) could perhaps be improved... now I'm

None was attached.

> Machines
> ========
>
> FreeBSD (6.2-PRERELEASE #7: Mon Nov 27 19:32:33 NZDT 2006):
> - Supermicro P3TDER
> - 2xSL5QL 1.26 GHz PIII
> - 2xKingston PC133 RCC Registered 1GB DIMMS
> - 3Ware 7506 4x Maxtor Plus 9 ATA-133 7200 80G
> - Kernal GENERIC + SMP
> - /etc/malloc.conf -> >aj
> - ufs2 32k blocksize, 4K fragments
     ^^^^^^^^^^^^^^^^^^

Try using an unpessimized block size.  Block sizes larger than BKVASIZE
(default 16K) fragment the buffer cache virtual memory.  However, I
couldn't see much difference between block sizes of 16, 32 and 64K for
a small (32MB) md-malloced file system with a simple test program.
All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10%
an an AthlonXP with ~PC3200 memory).  On this system, half of the
bandwidth of main memory is (apparently) unavailable for reads because
it has to go through the CPU caches (only nontemporal writes go at
full speed), and another 1/2 of the bandwidth is lost to system
overheads, so 800MB/S is within a factor of 2 of the best possible.

> - RAID0 256K stripe using twe driver
>
> Gentoo (2.6.18-gentoo-r3 ):
> - Supermicro P3TDER
> - 2xSL5QL 1.26 GHz PIII
> - 2xKingston PC133 RCC Registered 1GB DIMMS
> - Promise TX4000 4x Maxtor plus 8 ATA-133 7200 40G
> - default make CFLAGS (-O2 -march-i686)
> - xfs stripe width 2
> - RAID0 256K stripe using md driver (software RAID)

PIII's and PC133 are very slow these days.  I could never get more
than a couple of hundred MB/s main memory copy bandwidth out of PC100.
PC133 and the read bandwidth are not much faster.  The read bandwidth on
freefall (800 MHz PIII) with a block size of 4MB is now 500MB/S for my
best read methods.

> Given the tests were about cached I/O, the differences in RAID controller and 
> the disks themselves were seen as not significant (indeed booting the FreeBSD 
> box with the Gentoo livecd and running the tests there confirmed this).

Yes, if the disk LED blinks then the test is invalid.

> --------
>
> $ ./readtest /data0/dump/file 8192 0
> random reads: 100000 of: 8192 bytes elapsed: 4.4477s io rate: 184186327 
> bytes/s
> $ ./readtest /data0/dump/file 8192 1
> sequential reads: 100000 of: 8192 bytes elapsed: 1.9797s io rate: 413804878 
> bytes/s

The speed seems to be limited mainly by main memory bandwidth for sequential
reads and by system overheads for random reads.

> $ ./readtest /data0/dump/file 32768 0
> random reads: 25000 of: 32768 bytes elapsed: 2.0076s io rate: 408040469 
> bytes/s
> $ ./readtest /data0/dump/file 32768 1
> sequential reads: 25000 of: 32768 bytes elapsed: 1.7068s io rate: 479965034 
> bytes/s

Now the difference is acceptably small.  This also indicates that the system
overhead for random accesses with non-large blocks is too large.

> Gentoo:
> -------
>
> $ ./readtest /data0/dump/file 8192 0
> random reads: 100000 of: 8192 bytes elapsed: 1.2698s io rate: 645155193 
> bytes/s
> $ ./readtest /data0/dump/file 8192 1
> sequential reads: 100000 of: 8192 bytes elapsed: 1.1329s io rate: 723129371 
> bytes/s

:-(.  I thought that PC133 couldn't go that fast even for a pure memory
benchmark.

Bruce

From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 12:38:05 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@FreeBSD.org
Delivered-To: freebsd-performance@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 1D7EE16A403;
	Fri, 22 Dec 2006 12:38:05 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226])
	by mx1.freebsd.org (Postfix) with ESMTP id 74DF013C455;
	Fri, 22 Dec 2006 12:38:04 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout2.pacific.net.au (Postfix) with ESMTP id 74F6510B21F;
	Fri, 22 Dec 2006 23:37:55 +1100 (EST)
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (Postfix) with ESMTP id E47FE2741A;
	Fri, 22 Dec 2006 23:37:53 +1100 (EST)
Date: Fri, 22 Dec 2006 23:37:53 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: Adrian Chadd <adrian@FreeBSD.org>
In-Reply-To: <d763ac660612212009x30bab8d6kecec9bc2e49a2b66@mail.gmail.com>
Message-ID: <20061222222757.G18486@delplex.bde.org>
References: <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org>
	<d763ac660612212009x30bab8d6kecec9bc2e49a2b66@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-performance@FreeBSD.org, David Xu <davidxu@FreeBSD.org>,
	Mark Kirkwood <markir@paradise.net.nz>
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 12:38:05 -0000

On Fri, 22 Dec 2006, Adrian Chadd wrote:

> On 22/12/06, David Xu <davidxu@freebsd.org> wrote:
>
>> I suspect in such a test, memory copying speed will be a key factor,
>> I don't have number to back up my idea, but I think Linux has lots
>> of tweaks, such as using MMX instruction to copy data.
>
> I had the oppertunity to study the AMD Athlon XP Optimisation guide
> and noted their example copy routine, optimised for the chipset, was
> quite a hell of a lot faster over a straight block copy.
>
> Has anyone here done any similar modifications to optimise
> copyin/copyout? I can't imagine it'd be a bad thing to have.

Sure.  It's a larger win mainly in benchmarks.  It's a twisty MD maze.
It's a small loss for the MMX method used in linux (2.6.10 at least)
on the original poster's machine (PIII).  The main win is from using
nontemporal writes, but that requires SSE2 and the kernel already uses
these in the most important case (sse2_pagezero(); other cases have
tradeoffs).

Times for some copy methods:

freefall (800MHz PIII), block size 4K (fully cached)
%%%
copy0: 3448148133 B/s (  29001 us) ( 25037077 tsc) (movsl)
copy1: 1840531252 B/s (  54332 us) ( 46333183 tsc) (unroll 16)
copy2: 1571211313 B/s (  63645 us) ( 52383615 tsc) (unroll 16 prefetch)
copy3: 2246932794 B/s (  44505 us) ( 36824018 tsc) (unroll 64 i586-opt)
copy4: 1970554791 B/s (  50747 us) ( 43268191 tsc) (unroll 64 i586-opt prefetch)
copy5: 2117741296 B/s (  47220 us) ( 38651415 tsc) (unroll 64 i586-opx prefetch)
copy6: 1684092760 B/s (  59379 us) ( 48916219 tsc) (unroll 32 prefetch 2)
copy7: 1506746384 B/s (  66368 us) ( 54357751 tsc) (unroll 64 fp++)
copy8: 1574228925 B/s (  63523 us) ( 52241051 tsc) (unroll 128 fp i-prefetch)
copy9: 1579051367 B/s (  63329 us) ( 51821088 tsc) (unroll 64 fp reordered)
copyA: 1625298552 B/s (  61527 us) ( 51037242 tsc) (unroll 256 fp reordered++)
copyB: 1633849261 B/s (  61205 us) ( 50367459 tsc) (unroll 512 fp reordered++)
copyC:  452936367 B/s ( 220781 us) (181557329 tsc) (Terje cksum)
copyD: 1449124640 B/s (  69007 us) ( 56524152 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
copyE: 1525968138 B/s (  65532 us) ( 53339199 tsc) (unroll 64 fp i-prefetch++)
copyF: 1513634002 B/s (  66066 us) ( 54251674 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))
copyG: 3389821831 B/s (  29500 us) ( 23686951 tsc) (memcpy (movsl))
copyK: 3522482088 B/s (  28389 us) ( 23081104 tsc) (movq)
copyL: 3018586815 B/s (  33128 us) ( 27671714 tsc) (movq with prefetchnta)
copyM: 3057441649 B/s (  32707 us) ( 27641525 tsc) (movq with block prefetch)
copya: 2584306603 B/s (  38695 us) ( 31756341 tsc) (~i686_memcpy (movaps))
%%%

movsl/memcpy is simplest and best here.

copyL is like the Linux-2.6.10 _mmx_memcpy().  The latter uses `prefetch'
which is older than prefetchnta and differently unportable.  I've never
noticed much difference between these, but the older instruction might
work better on older CPUs like PIII's.  Note that prefetchnta is slower
than explicit block prefetch.  This happens on AthlonXP's too, and IIRC
the XP optimisation guide points this out and uses block prefetch for
its last and biggest copy optimization.  Note that prefetching is just
a loss for the fully cached case.  The main point of interest here is
that block prefetch still beats prefetchnta by an insignificant amount
(it might be expected to lose because it takes more instructions and
the bottleneck in the fully cached case is is instruction execution.

There aren't many methods using XMM registers here because the methods
here are limited to ones that work on machines with only plain SSE and
I couldn't find any such machines where using either MMX or XMM was
any use.  CopyL uses MMX registers.  copya uses XMM registers.  copya
loses significantly to movsl/memcpy and copyL.

freefall, block size 4096K (fully uncached)
%%%
copy0:  199343794 B/s ( 493138 us) (613912221 tsc) (movsl)
copy1:  185455801 B/s ( 530067 us) (636100521 tsc) (unroll 16)
copy2:  181088365 B/s ( 542851 us) (474134548 tsc) (unroll 16 prefetch)
copy3:  183647620 B/s ( 535286 us) (456166441 tsc) (unroll 64 i586-opt)
copy4:  177010464 B/s ( 555357 us) (466214836 tsc) (unroll 64 i586-opt prefetch)
copy5:  176540627 B/s ( 556835 us) (465979821 tsc) (unroll 64 i586-opx prefetch)
copy6:  181682761 B/s ( 541075 us) (457801523 tsc) (unroll 32 prefetch 2)
copy7:  174978240 B/s ( 561807 us) (486332757 tsc) (unroll 64 fp++)
copy8:  192576224 B/s ( 510468 us) (429718012 tsc) (unroll 128 fp i-prefetch)
copy9:  177291074 B/s ( 554478 us) (473659591 tsc) (unroll 64 fp reordered)
copyA:  179384243 B/s ( 548008 us) (476730841 tsc) (unroll 256 fp reordered++)
copyB:  182308792 B/s ( 539217 us) (455082354 tsc) (unroll 512 fp reordered++)
copyC:  132747808 B/s ( 740532 us) (621009558 tsc) (Terje cksum)
copyD:  191875581 B/s ( 512332 us) (434236713 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
copyE:  192663787 B/s ( 510236 us) (430394085 tsc) (unroll 64 fp i-prefetch++)
copyF:  192714776 B/s ( 510101 us) (431859413 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))
copyG:  184343619 B/s ( 533265 us) (451905971 tsc) (memcpy (movsl))
copyK:  182133150 B/s ( 539737 us) (479260121 tsc) (movq)
copyL:  185353345 B/s ( 530360 us) (449688860 tsc) (movq with prefetchnta)
copyM:  187979371 B/s ( 522951 us) (442852446 tsc) (movq with block prefetch)
copya:  185523701 B/s ( 529873 us) (465249860 tsc) (~i686_memcpy (movaps))
%%%

movsl/memcpy is still simplest and best.

Other methods are only slightly slower (except copyC, which does a checksum
in parallel with read/dwrite; extra operations combined with copying are
free on some machines, but not here, even in the fully uncached case).

freefall's times may be inaccurate since freefall is loaded, and the tsc's
may be very inaccurate because freefall is SMP, but the following are very
accurate since the machine is unloaded !SMP:

Athlon XP2600, 193MHz FSB, 8-3-3-2.5 memory (not quite PC3200), block size 4K:
%%%
copy0: 6492646669 B/s (  15402 us) ( 34282451 tsc) (movsl)
copy1: 5815290998 B/s (  17196 us) ( 38282332 tsc) (unroll 16)
copy2: 5099686063 B/s (  19609 us) ( 44504640 tsc) (unroll 16 prefetch)
copy3: 6580229256 B/s (  15197 us) ( 33837406 tsc) (unroll 64 i586-opt)
copy4: 6608931597 B/s (  15131 us) ( 33685684 tsc) (unroll 64 i586-opt prefetch)
copy5: 6620745763 B/s (  15104 us) ( 33624302 tsc) (unroll 64 i586-opx prefetch)
copy6: 5371132452 B/s (  18618 us) ( 41448096 tsc) (unroll 32 prefetch 2)
copy7: 7544303584 B/s (  13255 us) ( 29523386 tsc) (unroll 64 fp++)
copy8: 8178600147 B/s (  12227 us) ( 28765763 tsc) (unroll 128 fp i-prefetch)
copy9: 9280718701 B/s (  10775 us) ( 25879055 tsc) (unroll 64 fp reordered)
copyA: 8625128860 B/s (  11594 us) ( 25817196 tsc) (unroll 256 fp reordered++)
copyB: 8883338723 B/s (  11257 us) ( 25161370 tsc) (unroll 512 fp reordered++)
copyC: 2927478673 B/s (  34159 us) ( 76030527 tsc) (Terje cksum)
copyD: 7751918140 B/s (  12900 us) ( 28727306 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
copyE: 7834514572 B/s (  12764 us) ( 28403840 tsc) (unroll 64 fp i-prefetch++)
copyF: 7818588272 B/s (  12790 us) ( 28475409 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))
copyG: 6419292849 B/s (  15578 us) ( 34670640 tsc) (memcpy (movsl))
copyH: 2950106027 B/s (  33897 us) ( 75467527 tsc) (movntps)
copyI: 2939094286 B/s (  34024 us) ( 77103399 tsc) (movntps with prefetchnta)
copyJ: 2940477064 B/s (  34008 us) ( 77144512 tsc) (movntps with block prefetch)
copyK: 11064366453 B/s (   9038 us) ( 20691582 tsc) (movq)
copyL: 9832816519 B/s (  10170 us) ( 22685200 tsc) (movq with prefetchnta)
copyM: 9853162282 B/s (  10149 us) ( 22599677 tsc) (movq with block prefetch)
copyN: 2950018998 B/s (  33898 us) ( 75452984 tsc) (movntq)
copyO: 2933576156 B/s (  34088 us) ( 77122605 tsc) (movntq with prefetchnta)
copyP: 2885246083 B/s (  34659 us) ( 77147363 tsc) (movntq with block prefetch)
copyQ: 6749442765 B/s (  14816 us) ( 32985677 tsc) (movdqa)
copya: 7504108059 B/s (  13326 us) ( 29680371 tsc) (~i686_memcpy (movaps))
%%%

Now movq is best.  It is almost twice as fast as movsl.  This is because
movsl only issues 32-bit accesses and the number of those per cycle
has the same limit as 64-bit accesses, at least for read/write in parallel
(AXP's have some asymmetry for read/write that gets in the way of other
access mixes.  A64's are better here).

Even the old PI FPU method easily beats movsl.  It was turned off because
it was a large loss on PII's.

There are now some SSE+ extensions (movnt*).  These use an AthlonXP extension
of SSE.  Thes are just a loss in the fully cached case (and in all cases
for small data unless you know that the target shouldn't be cached).

AthlonXP... block size 4096K:
%%%
copy0:  636873680 B/s ( 154354 us) (344356579 tsc) (movsl)
copy1:  649887944 B/s ( 151263 us) (337326810 tsc) (unroll 16)
copy2:  582949855 B/s ( 168632 us) (376274011 tsc) (unroll 16 prefetch)
copy3:  736911544 B/s ( 133400 us) (315267117 tsc) (unroll 64 i586-opt)
copy4:  683944313 B/s ( 143731 us) (320308617 tsc) (unroll 64 i586-opt prefetch)
copy5:  684006179 B/s ( 143718 us) (320114790 tsc) (unroll 64 i586-opx prefetch)
copy6:  656704054 B/s ( 149693 us) (333513466 tsc) (unroll 32 prefetch 2)
copy7:  675350371 B/s ( 145560 us) (324722661 tsc) (unroll 64 fp++)
copy8:  793971554 B/s ( 123813 us) (276326666 tsc) (unroll 128 fp i-prefetch)
copy9:  679120150 B/s ( 144752 us) (322757764 tsc) (unroll 64 fp reordered)
copyA:  650429743 B/s ( 151137 us) (336686142 tsc) (unroll 256 fp reordered++)
copyB:  686849773 B/s ( 143123 us) (318835219 tsc) (unroll 512 fp reordered++)
copyC:  656370811 B/s ( 149769 us) (333826275 tsc) (Terje cksum)
copyD:  777715366 B/s ( 126401 us) (282197950 tsc) (kernel bcopy (unroll 64 fp i-prefetch))
copyE:  779930499 B/s ( 126042 us) (280900317 tsc) (unroll 64 fp i-prefetch++)
copyF:  773888810 B/s ( 127026 us) (283770359 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++))
copyG:  636189490 B/s ( 154520 us) (344278918 tsc) (memcpy (movsl))
copyH: 1056702749 B/s (  93029 us) (207224289 tsc) (movntps)
copyI: 1072590588 B/s (  91651 us) (204188841 tsc) (movntps with prefetchnta)
copyJ: 1395630138 B/s (  70437 us) (156912756 tsc) (movntps with block prefetch)
copyK:  708242075 B/s ( 138800 us) (309879060 tsc) (movq)
copyL:  706770485 B/s ( 139089 us) (311075317 tsc) (movq with prefetchnta)
copyM:  814300625 B/s ( 120722 us) (269160923 tsc) (movq with block prefetch)
copyN: 1076549051 B/s (  91314 us) (203659502 tsc) (movntq)
copyO: 1066898198 B/s (  92140 us) (205514511 tsc) (movntq with prefetchnta)
copyP: 1413551133 B/s (  69544 us) (155496730 tsc) (movntq with block prefetch)
copyQ:  680954822 B/s ( 144362 us) (321945223 tsc) (movdqa)
copya:  710699826 B/s ( 138320 us) (308106574 tsc) (~i686_memcpy (movaps))
%%%

Now the movnt* methods win easily.  Block prefetch wins easily over
prefetchnta.  (Unlike for PIII's, I know that it is preferred to plain
"prefetch".)

Athlon64's behave significantly differently here (details not shown):
- movsl is still quite slow
- movsq/memcpy has the same speed as movq (MMX) and movq(64-bit integer)
- the memory system is better relative to the CPU, so the fully cached case
   is not so much faster, especially with DDR2
- prefetchnta now wins over block prefetch, since the memory system now
   actually understands prefetchnta
- movnt* us a larger win.

Memcpy (movsq) is simplest and best again unless movnt* is used.  amd64
already uses simplest and best methods except for large copyin/copyout's
where it should probably use movnt*.  It is unclear whether a block
size of 8K is large -- in cases where the application actually uses
the data, it may be best to not use movnt*.  movnt* for 8K writes is
more likely to right, since in many cases the kernel's only "use" of
the data is to DMA it to a disk drive and for that it should never be
put in the CPU's caches.

Bruce

From owner-freebsd-performance@FreeBSD.ORG  Fri Dec 22 20:22:30 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 29F2F16A40F;
	Fri, 22 Dec 2006 20:22:30 +0000 (UTC)
	(envelope-from alexander@leidinger.net)
Received: from redbull.bpaserver.net (redbullneu.bpaserver.net
	[213.198.78.217])
	by mx1.freebsd.org (Postfix) with ESMTP id CBCD213C49D;
	Fri, 22 Dec 2006 20:22:29 +0000 (UTC)
	(envelope-from alexander@leidinger.net)
Received: from outgoing.leidinger.net (p54A5E998.dip.t-dialin.net
	[84.165.233.152])
	by redbull.bpaserver.net (Postfix) with ESMTP id 3D65A2E192;
	Fri, 22 Dec 2006 20:30:45 +0100 (CET)
Received: from Magellan.Leidinger.net (Magellan.Leidinger.net [192.168.1.1])
	by outgoing.leidinger.net (Postfix) with ESMTP id 3B27C5B480D;
	Fri, 22 Dec 2006 20:29:34 +0100 (CET)
Date: Fri, 22 Dec 2006 20:29:33 +0100
From: Alexander Leidinger <Alexander@Leidinger.net>
To: Bruce Evans <bde@zeta.org.au>
Message-ID: <20061222202933.709d2279@Magellan.Leidinger.net>
In-Reply-To: <20061222222757.G18486@delplex.bde.org>
References: <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org>
	<d763ac660612212009x30bab8d6kecec9bc2e49a2b66@mail.gmail.com>
	<20061222222757.G18486@delplex.bde.org>
X-Mailer: Claws Mail 2.6.1 (GTK+ 2.10.6; i386-portbld-freebsd7.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-BPAnet-MailScanner-Information: Please contact the ISP for more information
X-BPAnet-MailScanner: Found to be clean
X-BPAnet-MailScanner-SpamCheck: not spam, SpamAssassin (not cached,
	score=-14.787, required 6, autolearn=not spam, BAYES_00 -15.00,
	DK_POLICY_SIGNSOME 0.00, FORGED_RCVD_HELO 0.14, TW_CP 0.08)
X-BPAnet-MailScanner-From: alexander@leidinger.net
X-Spam-Status: No
Cc: Adrian Chadd <adrian@FreeBSD.org>, rookie@gufi.org,
	freebsd-performance@FreeBSD.org, Mark Kirkwood <markir@paradise.net.nz>,
	David Xu <davidxu@FreeBSD.org>
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Dec 2006 20:22:30 -0000

Quoting Bruce Evans <bde@zeta.org.au> (Fri, 22 Dec 2006 23:37:53 +1100 (EST)):

> On Fri, 22 Dec 2006, Adrian Chadd wrote:
> 
> > On 22/12/06, David Xu <davidxu@freebsd.org> wrote:
> >  
> >> I suspect in such a test, memory copying speed will be a key factor,
> >> I don't have number to back up my idea, but I think Linux has lots
> >> of tweaks, such as using MMX instruction to copy data.  
> >
> > I had the oppertunity to study the AMD Athlon XP Optimisation guide
> > and noted their example copy routine, optimised for the chipset, was
> > quite a hell of a lot faster over a straight block copy.
> >
> > Has anyone here done any similar modifications to optimise
> > copyin/copyout? I can't imagine it'd be a bad thing to have.  
> 
> Sure.  It's a larger win mainly in benchmarks.  It's a twisty MD maze.

I want to point out http://www.freebsd.org/projects/ideas/#p-memcpy
here. Just in case someone wants to play around a little bit.

Bye,
Alexander.

-- 
I like your SNOOPY POSTER!!
http://www.Leidinger.net  Alexander @ Leidinger.net: PGP ID = B0063FE7
http://www.FreeBSD.org     netchild @ FreeBSD.org  : PGP ID = 72077137

From owner-freebsd-performance@FreeBSD.ORG  Sat Dec 23 01:00:45 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@FreeBSD.org
Delivered-To: freebsd-performance@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 0157116A506
	for <freebsd-performance@FreeBSD.org>;
	Sat, 23 Dec 2006 01:00:45 +0000 (UTC)
	(envelope-from markir@paradise.net.nz)
Received: from smtp4.clear.net.nz (smtp4.clear.net.nz [203.97.37.64])
	by mx1.freebsd.org (Postfix) with ESMTP id BC1D513C448
	for <freebsd-performance@FreeBSD.org>;
	Sat, 23 Dec 2006 01:00:44 +0000 (UTC)
	(envelope-from markir@paradise.net.nz)
Received: from [192.168.1.11]
	(121-72-69-162.dsl.telstraclear.net [121.72.69.162])
	by smtp4.clear.net.nz (CLEAR Net Mail)
	with ESMTP id <0JAP006OEC4XLV10@smtp4.clear.net.nz> for
	freebsd-performance@FreeBSD.org; Sat, 23 Dec 2006 14:00:34 +1300 (NZDT)
Date: Sat, 23 Dec 2006 14:00:33 +1300
From: Mark Kirkwood <markir@paradise.net.nz>
In-reply-to: <20061222171431.L18486@delplex.bde.org>
To: Bruce Evans <bde@zeta.org.au>
Message-id: <458C7FB1.9020002@paradise.net.nz>
MIME-version: 1.0
Content-type: text/plain; charset=ISO-8859-1; format=flowed
Content-transfer-encoding: 7bit
References: <458B3651.8090601@paradise.net.nz>
	<20061222171431.L18486@delplex.bde.org>
User-Agent: Thunderbird 1.5.0.8 (X11/20061129)
Cc: freebsd-performance@FreeBSD.org
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Dec 2006 01:00:45 -0000

Bruce Evans wrote:

> 
> None was attached.
> 

(meaning the c prog yes?) I notice that it is stripped out from the web 
archive... so here's a link:

http://homepages.paradise.net.nz/markir/download/freebsd/readtest.c


>> Machines
>> ========
>> - ufs2 32k blocksize, 4K fragments
>     ^^^^^^^^^^^^^^^^^^
> 
> Try using an unpessimized block size.  Block sizes larger than BKVASIZE
> (default 16K) fragment the buffer cache virtual memory. 

Right - I should have said, I saw a comment to that effect in 
src/sys/sys/param.h, and so I tested with 8K, 16K too, interestingly on 
my system 32K seemed to be faster, even for the bigger files (of course 
- hard to know if it was really significant...).

> However, I
> couldn't see much difference between block sizes of 16, 32 and 64K for
> a small (32MB) md-malloced file system with a simple test program.
> All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10%
> an an AthlonXP with ~PC3200 memory).  

Cheers

Mark

From owner-freebsd-performance@FreeBSD.ORG  Sat Dec 23 05:21:57 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from localhost.my.domain (localhost [127.0.0.1])
	by hub.freebsd.org (Postfix) with ESMTP id BBA1B16A403;
	Sat, 23 Dec 2006 05:21:56 +0000 (UTC)
	(envelope-from davidxu@freebsd.org)
From: David Xu <davidxu@freebsd.org>
To: freebsd-performance@freebsd.org
Date: Sat, 23 Dec 2006 13:21:51 +0800
User-Agent: KMail/1.8.2
References: <458B3651.8090601@paradise.net.nz>
	<20061222222757.G18486@delplex.bde.org>
	<20061222202933.709d2279@Magellan.Leidinger.net>
In-Reply-To: <20061222202933.709d2279@Magellan.Leidinger.net>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200612231321.52178.davidxu@freebsd.org>
Cc: Alexander Leidinger <Alexander@leidinger.net>,
	Adrian Chadd <adrian@freebsd.org>, Mark Kirkwood <markir@paradise.net.nz>,
	Bruce Evans <bde@zeta.org.au>, rookie@gufi.org
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Dec 2006 05:21:57 -0000

On Saturday 23 December 2006 03:29, Alexander Leidinger wrote:

> I want to point out http://www.freebsd.org/projects/ideas/#p-memcpy
> here. Just in case someone wants to play around a little bit.
>
> Bye,
> Alexander.

I have read the code, if a buffer is not aligned at 16 bytes boundary,
it will not use FPU to copy data, but user buffer is not always 16 bytes
aligned.

David Xu

From owner-freebsd-performance@FreeBSD.ORG  Sat Dec 23 09:38:10 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 1DC7116A412
	for <freebsd-performance@freebsd.org>;
	Sat, 23 Dec 2006 09:38:10 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210])
	by mx1.freebsd.org (Postfix) with ESMTP id 9608413C45A
	for <freebsd-performance@freebsd.org>;
	Sat, 23 Dec 2006 09:38:09 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout1.pacific.net.au (Postfix) with ESMTP id DBA07328229;
	Sat, 23 Dec 2006 20:38:07 +1100 (EST)
Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (Postfix) with ESMTP id 40C058C0D;
	Sat, 23 Dec 2006 20:38:05 +1100 (EST)
Date: Sat, 23 Dec 2006 20:38:04 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@epsplex.bde.org
To: Mark Kirkwood <markir@paradise.net.nz>
In-Reply-To: <458C7FB1.9020002@paradise.net.nz>
Message-ID: <20061223175413.W1116@epsplex.bde.org>
References: <458B3651.8090601@paradise.net.nz>
	<20061222171431.L18486@delplex.bde.org>
	<458C7FB1.9020002@paradise.net.nz>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-performance@freebsd.org
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Dec 2006 09:38:10 -0000

On Sat, 23 Dec 2006, Mark Kirkwood wrote:

> Bruce Evans wrote:
>> 
>> None was attached.
>> 
>
> (meaning the c prog yes?) I notice that it is stripped out from the web 
> archive... so here's a link:
>
> http://homepages.paradise.net.nz/markir/download/freebsd/readtest.c

>> However, I
>> couldn't see much difference between block sizes of 16, 32 and 64K for
>> a small (32MB) md-malloced file system with a simple test program.
>> All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10%
>> an an AthlonXP with ~PC3200 memory).

Now I see the problem with a normal file system.  The main difference
in my quick test was probably that 32MB is too small to show the
problem.  32MB fits in the buffer cache, but slightly larger files
only fit in the VMIO cache, and the main problem is the interaction
of these caches.

This behaviour is easy to understand using kernel profiling:

Part of a profile for the random case (reading 400MB with a block
size of 4K -- smaller block sizes make larger differences):
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.70 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  22.5      0.608    0.608   102466     5933     5933  copyout [13]
  21.2      1.180    0.573        0  100.00%           mcount [14]
  10.2      1.457    0.277        0  100.00%           mexitcount [17]
  10.2      1.733    0.276   450823      612      612  buf_splay [18]
   9.7      1.995    0.262   348917      751      751  vm_page_splay [20]
   5.7      2.148    0.153        0  100.00%           cputime [22]
   2.0      2.202    0.054   348017      154      179  vm_page_unwire [26]
   1.8      2.252    0.050    87127      573     3487  getnewbuf [16]
   1.7      2.298    0.047        0  100.00%           user [29]
   1.3      2.332    0.034    87132      388      388  pmap_qremove [31]
   1.1      2.363    0.031    87127      351     4025  allocbuf [15]
   1.0      2.388    0.026   348505       74      117  vm_page_wire [30]
%%%

Sequential case:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 1.35 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  39.3      0.530    0.530   102443     5178     5178  copyout [11]
  23.7      0.850    0.320        0  100.00%           mcount [12]
  11.5      1.004    0.154        0  100.00%           mexitcount [13]
   6.3      1.090    0.085        0  100.00%           cputime [16]
   3.0      1.130    0.040   102816      389      389  vm_page_splay [19]
   1.6      1.151    0.021   409846       52       59  _lockmgr [22]
   1.3      1.168    0.017        0  100.00%           user [23]
   0.9      1.180    0.012   102617      117      117  buf_splay [26]
...
   0.7      1.200    0.009    25603      356     1553  getnewbuf [20]
...
   0.6      1.208    0.009    25603      337     2197  allocbuf [17]
...
   0.6      1.224    0.008   101915       78       96  vm_page_unwire [29]
...
   0.5      1.239    0.007    25608      274      274  pmap_qremove [32]
...
   0.2      1.316    0.002   102409       20       35  vm_page_wire [44]
%%%

It is a buffer-cache/vm problem like I suspected.  The file system
block size is 16K, so with a read size of 4K, random reads allocate a
new buffer about 16K/4K = 4 times more often than sequential reads.
Allocation involves vm stuff which is very expensive (takes about 1.25
times as long as the actual copying).  I believe it was even more
expensive before it used splay trees.

More details from separate runs:

Random:
%%%
-----------------------------------------------

                 0.00        0.00       5/102404      breadn [237]
                 0.01        0.84  102399/102404      cluster_read [10]
[11]    31.3    0.01        0.84  102404         getblk [11]
                 0.03        0.32   87126/87126       allocbuf [15]
                 0.05        0.25   87126/87126       getnewbuf [16]
                 0.01        0.12  189530/189530      gbincore [23]
                 0.00        0.06   87126/87126       bgetvp [27]
                 0.00        0.00   15278/409852      _lockmgr [32]
                 0.00        0.00   15278/15278       bremfree [144]

-----------------------------------------------
%%%

Sequential:
%%%
-----------------------------------------------

                 0.00        0.00       6/102404      breadn [371]
                 0.01        0.12  102398/102404      cluster_read [14]
[15]     9.5    0.01        0.12  102404         getblk [15]
                 0.01        0.05   25603/25605       allocbuf [17]
                 0.01        0.03   25603/25603       getnewbuf [18]
                 0.00        0.01  128007/128007      gbincore [31]
                 0.00        0.00   76801/409846      _lockmgr [22]
                 0.00        0.00   25603/25603       bgetvp [39]
                 0.00        0.00   76801/76801       bremfree [66]

-----------------------------------------------
%%%

getblk() is called the same number of times for each.  In the sequential
case, it uses a previously allocated buffer (almost always one allocated
just before) with a probabilty of almost exactly 0.75, but in the
random case it uses a previosly allocated buffer with a probability
of about 0.13.  The second probably is only larger than epsilon because
there is a buffer pool with a size of a few thousand.  Sometimes you
get a hit in this pool, but for large working data sets, mostly you
don't; then the buffer must be consituted from vm (or the disk).

This problem is (now) fairly small because most working data sets
aren't large compared with the buffer pool.  It was much larger 10
years ago when the size of the buffer pool was only a few hundred.  It
was much larger still more than about 12 years ago in FreeBSD before
the buffer cache was merged with vm.  Then there was only the buffer
pool with nothing between it and the disk, and it was too small.

Linux might not have this problem because it is still using a simple
and better buffer cache.  At least 10-15 years ago, its buffer cache
had a fixed block size of 1K where FreeBSD's buffer cache had a variable
block size with the usual size equal to the ffs file system block size
of 4K or 8K.  With a block size of 1K, at least 4 times as many buffers
are needed to compete on storage with a block size of 4K, and the
buffer allocation routines need to be at least 4 times as efficient
to compete on efficiency.  Linux actually had a much larger multiple
than 4 for the storage.  I'm not sure about the efficiency factor, but
it wasn't too bad (any in-memory buffer management is better than
waiting for the disk, the small fixed size of 1K is much easier to
manage than larger, variable sizes).

The FreeBSD buffer management was and is especially unsuited to file
systems with small block sizes like msdsofs floppies (512-blocks) and
the original version of Linux's extfs (1K-blocks).  With a buffer cache
(pool) size of 256, you can manage a whole 128KB comprised of 512-blocks
and got enormous thrashing for accessing a 1200KB floppy.  With vm backing
and a buffer cache size of a few thousand, the thrashing only occurs in
memory, and a 1200KB floppy now barely fits in the buffer cache (pool).
Also, no one uses 1200KB floppies.  More practically, this problem makes
msdosfs on hard disks (normally 4K-blocks) and ext2fs on hard disks
(1K or 4K blocks) slower than they should be under FreeBSD.  vm backing
and clustering masks only some of the slowness.

The problem becomes smaller as the read block size appoaches the file
system block size and vanishes when the sizes are identical.  Then
there is apparently a different (smaller) problem:

Read size 16K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  49.1      0.565    0.565    25643    22037    22037  copyout [11]
  12.6      0.710    0.145        0  100.00%           mcount [14]
   8.8      0.811    0.101    87831     1153     1153  vm_page_splay [17]
   7.0      0.892    0.081   112906      715      715  buf_splay [19]
   6.1      0.962    0.070        0  100.00%           mexitcount [20]
   3.4      1.000    0.039        0  100.00%           cputime [22]
   1.2      1.013    0.013    86883      153      181  vm_page_unwire [28]
   1.1      1.027    0.013        0  100.00%           user [29]
   1.1      1.040    0.013    21852      595     3725  getnewbuf [18]
%%%

Read size 16K, sequential:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  57.1      0.550    0.550    25643    21464    21464  copyout [11]
  14.2      0.687    0.137        0  100.00%           mcount [12]
   6.9      0.754    0.066        0  100.00%           mexitcount [15]
   4.2      0.794    0.040   102830      391      391  vm_page_splay [19]
   3.8      0.830    0.037        0  100.00%           cputime [20]
   1.4      0.844    0.013   102588      130      130  buf_splay [22]
   1.3      0.856    0.012    25603      488     1920  getnewbuf [17]
   1.0      0.866    0.009    25606      368      368  pmap_qremove [24]
%%%

Now the splay routines are called almost the same number of times, but
take much longer in the random case.  buf_splay() seems to be unrelated
to vm -- it is called from gbincore() even if the buffer is already
in the buffer cache.  It seems quite slow for that -- almost 1 uS just
to look up compared with 21 uS to copyout a 16K buffer.  Linux-sized
buffers would take only 1.5 uS and then 1 uS to look them up is clearly
too much.  Another benchmark shows gbincore() taking 501 nS each to
look up 64 in-buffer-cache buffers for 1MB file -- this must be the
best case for it (all these times are for -current on an Athlon XP2700
overclocked to 2025MHz).  The generic hash function used in my compiler
takes 40 nS to hash a 16-byte string on this machine.

The merged vm/buffer cache is clearly implemented suboptimally.  Direct
access to VMIO pages might be better, but it isn't clear how to implement
it without getting the slowest parts of vm for all accesses.  The
buffer cache is now essentially just a cache of vm mappings, with vm
mapping being so slow that it needs to be cached.  The last thing that
you want to do is throw away this cache and have to do a slow mapping
for every access.  I think the correct method is to wait for larger
virtual address spaces (already here) and use sparse mappings more.

Bruce

From owner-freebsd-performance@FreeBSD.ORG  Sat Dec 23 10:27:42 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id A876A16A416
	for <freebsd-performance@freebsd.org>;
	Sat, 23 Dec 2006 10:27:42 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226])
	by mx1.freebsd.org (Postfix) with ESMTP id 464D913C448
	for <freebsd-performance@freebsd.org>;
	Sat, 23 Dec 2006 10:27:42 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.2.162])
	by mailout2.pacific.net.au (Postfix) with ESMTP id 9055010994D;
	Sat, 23 Dec 2006 21:27:39 +1100 (EST)
Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (Postfix) with ESMTP id 567BC8C0A;
	Sat, 23 Dec 2006 21:27:39 +1100 (EST)
Date: Sat, 23 Dec 2006 21:27:38 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@epsplex.bde.org
To: Mark Kirkwood <markir@paradise.net.nz>
In-Reply-To: <20061223175413.W1116@epsplex.bde.org>
Message-ID: <20061223205324.B1533@epsplex.bde.org>
References: <458B3651.8090601@paradise.net.nz>
	<20061222171431.L18486@delplex.bde.org>
	<458C7FB1.9020002@paradise.net.nz>
	<20061223175413.W1116@epsplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-performance@freebsd.org
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Dec 2006 10:27:42 -0000

On Sat, 23 Dec 2006, I wrote:

> The problem becomes smaller as the read block size appoaches the file
> system block size and vanishes when the sizes are identical.  Then
> there is apparently a different (smaller) problem:
>
> Read size 16K, random:
> %%%
> granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds
>
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ns/call  ns/call  name
> 49.1      0.565    0.565    25643    22037    22037  copyout [11]
> 12.6      0.710    0.145        0  100.00%           mcount [14]
>  8.8      0.811    0.101    87831     1153     1153  vm_page_splay [17]
>  7.0      0.892    0.081   112906      715      715  buf_splay [19]
>  6.1      0.962    0.070        0  100.00%           mexitcount [20]
>  3.4      1.000    0.039        0  100.00%           cputime [22]
>  1.2      1.013    0.013    86883      153      181  vm_page_unwire [28]
>  1.1      1.027    0.013        0  100.00%           user [29]
>  1.1      1.040    0.013    21852      595     3725  getnewbuf [18]
> %%%
>
> Read size 16K, sequential:
> %%%
> granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds
>
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ns/call  ns/call  name
> 57.1      0.550    0.550    25643    21464    21464  copyout [11]
> 14.2      0.687    0.137        0  100.00%           mcount [12]
>  6.9      0.754    0.066        0  100.00%           mexitcount [15]
>  4.2      0.794    0.040   102830      391      391  vm_page_splay [19]
>  3.8      0.830    0.037        0  100.00%           cputime [20]
>  1.4      0.844    0.013   102588      130      130  buf_splay [22]
>  1.3      0.856    0.012    25603      488     1920  getnewbuf [17]
>  1.0      0.866    0.009    25606      368      368  pmap_qremove [24]
> %%%
>
> Now the splay routines are called almost the same number of times, but
> take much longer in the random case.  buf_splay() seems to be unrelated
> to vm -- it is called from gbincore() even if the buffer is already
> in the buffer cache.  It seems quite slow for that -- almost 1 uS just
> to look up compared with 21 uS to copyout a 16K buffer.  Linux-sized
> buffers would take only 1.5 uS and then 1 uS to look them up is clearly
> too much.  Another benchmark shows gbincore() taking 501 nS each to
> look up 64 in-buffer-cache buffers for 1MB file -- this must be the
> best case for it (all these times are for -current on an Athlon XP2700
> overclocked to 2025MHz).  The generic hash function used in my compiler
> takes 40 nS to hash a 16-byte string on this machine.

FreeBSD-~4.10 is faster.  The difference is especially noticeable when
the read size is the same as the fs block size (16K, as above).  Then
I get the following speeds:

~4.10, random:     580MB/S
~4.10, sequential: 580MB/S
~5.2, random:      575MB/S
~5.2, sequential:  466MB/S

All with kernel profiling not configured, and no INVARIANTS etc.

~5.2 is quite different from -current, but it has buf_splay() and
vm_page_splay(), and behaves similarly in this benchmark.

With profiling ~4.10, read size 16K, sequential +some random:
%%%
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  51.1      0.547    0.547    25643    21323    21323  generic_copyout [9]
  17.3      0.732    0.185        0  100.00%           mcount [10]
   7.9      0.817    0.085        0  100.00%           mexitcount [13]
   5.0      0.870    0.053        0  100.00%           cputime [16]
   1.9      0.891    0.020    51207      395      395  gbincore [20]
 						(424 for random)
   1.4      0.906    0.015   102418      150      253  vm_page_wire [18]
 						(322)
   1.3      0.920    0.014   231218       62       62  splvm [23]
   1.3      0.934    0.014    25603      541     2092  allocbuf [15]
 					       (2642)
   1.0      0.945    0.010   566947       18       18  splx <cycle 1> [25]
   1.0      0.955    0.010   102122      100      181  vm_page_unwire [21]
   0.9      0.964    0.009    25606      370      370  pmap_qremove [27]
   0.9      0.973    0.009    25603      359     2127  getnewbuf [14]
 					       (2261)
%%%

There is little difference for the sequential case, but the old gbincore()
and buffer allocation routines are much faster for the random case.

With profiling ~4.10, read size 4K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.63 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  27.3      0.720    0.720        0  100.00%           mcount [8]
  22.5      1.312    0.592   102436     5784     5784  generic_copyout [10]
  12.6      1.643    0.331        0  100.00%           mexitcount [13]
   7.9      1.850    0.207        0  100.00%           cputime [15]
   2.9      1.926    0.076   189410      402      402  gbincore [20]
   2.3      1.988    0.061   348029      176      292  vm_page_wire [18]
   2.2      2.045    0.058    87010      662     2500  allocbuf [14]
   2.0      2.099    0.053   783280       68       68  splvm [22]
   1.6      2.142    0.043        0   99.33%           user [24]
   1.6      2.184    0.042  2041759       20       20  splx <cycle 3> [26]
   1.3      2.217    0.034   347298       97      186  vm_page_unwire [21]
   1.2      2.249    0.032    86895      370      370  pmap_qremove [28]
   1.1      2.279    0.029    87006      337     2144  getnewbuf [16]
   0.9      2.303    0.024    86891      280     1617  vfs_vmio_release [17]
%%%

Now the result is little different from -current -- the random case is
almost as slow as in -current according to the total time, although this
may be an artifact of profiling (allocbuf takes 2500 nS total in ~4.10
vs 4025 nS in -current).

Bruce

From owner-freebsd-performance@FreeBSD.ORG  Sat Dec 23 17:07:33 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 2CFD416A403
	for <freebsd-performance@freebsd.org>;
	Sat, 23 Dec 2006 17:07:33 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.187])
	by mx1.freebsd.org (Postfix) with ESMTP id BB7A413C44B
	for <freebsd-performance@freebsd.org>;
	Sat, 23 Dec 2006 17:07:30 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: by nf-out-0910.google.com with SMTP id x37so3577588nfc
	for <freebsd-performance@freebsd.org>;
	Sat, 23 Dec 2006 09:07:29 -0800 (PST)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
	b=DJC2zEUnGP9IJOajhK19FGfpnDEVFG4WHD6YVNwC665gX9xmadNOKWw1M0BKJC5F2ZzCnt1WwxKtvPEBmfegm7BVEpIDlzX6S6Zwva94R3RXIVl9lcoQTH2VP2EiYmrZk9+sFhARWIin97KriasCgZ0mGb7Ax0gsA6UqjJzXndY=
Received: by 10.82.136.4 with SMTP id j4mr2296203bud.1166892025490;
	Sat, 23 Dec 2006 08:40:25 -0800 (PST)
Received: by 10.82.178.4 with HTTP; Sat, 23 Dec 2006 08:40:25 -0800 (PST)
Message-ID: <3bbf2fe10612230840u7ffb2855y8d6151d2f24ace4@mail.gmail.com>
Date: Sat, 23 Dec 2006 17:40:25 +0100
From: "Attilio Rao" <attilio@freebsd.org>
Sender: asmrookie@gmail.com
To: "David Xu" <davidxu@freebsd.org>
In-Reply-To: <200612231321.52178.davidxu@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <458B3651.8090601@paradise.net.nz>
	<20061222222757.G18486@delplex.bde.org>
	<20061222202933.709d2279@Magellan.Leidinger.net>
	<200612231321.52178.davidxu@freebsd.org>
X-Google-Sender-Auth: 463bc17a13ab91f1
X-Mailman-Approved-At: Sat, 23 Dec 2006 19:22:26 +0000
Cc: Mark Kirkwood <markir@paradise.net.nz>,
	Alexander Leidinger <Alexander@leidinger.net>,
	Adrian Chadd <adrian@freebsd.org>,
	freebsd-performance@freebsd.org, Bruce Evans <bde@zeta.org.au>
Subject: Re: Cached file read performance
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 23 Dec 2006 17:07:33 -0000

2006/12/23, David Xu <davidxu@freebsd.org>:
> On Saturday 23 December 2006 03:29, Alexander Leidinger wrote:
>
> > I want to point out http://www.freebsd.org/projects/ideas/#p-memcpy
> > here. Just in case someone wants to play around a little bit.
> >
> > Bye,
> > Alexander.
>
> I have read the code, if a buffer is not aligned at 16 bytes boundary,
> it will not use FPU to copy data, but user buffer is not always 16 bytes
> aligned.

If the buffer is not aligned, speedup improvement is so small to be near at 0%.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein