From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 01:49:57 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 323DA16A492 for ; Fri, 22 Dec 2006 01:49:57 +0000 (UTC) (envelope-from markir@paradise.net.nz) Received: from smtp3.clear.net.nz (smtp3.clear.net.nz [203.97.33.64]) by mx1.freebsd.org (Postfix) with ESMTP id E5D0113C457 for ; Fri, 22 Dec 2006 01:49:56 +0000 (UTC) (envelope-from markir@paradise.net.nz) Received: from [192.168.1.11] (121-72-65-158.dsl.telstraclear.net [121.72.65.158]) by smtp3.clear.net.nz (CLEAR Net Mail) with ESMTP id <0JAN00FSKJR6G730@smtp3.clear.net.nz> for freebsd-performance@freebsd.org; Fri, 22 Dec 2006 14:49:56 +1300 (NZDT) Date: Fri, 22 Dec 2006 14:49:53 +1300 From: Mark Kirkwood In-reply-to: <458B3651.8090601@paradise.net.nz> To: freebsd-performance@freebsd.org Message-id: <458B39C1.2080906@paradise.net.nz> MIME-version: 1.0 Content-type: multipart/mixed; boundary=------------050506000400060507050102 References: <458B3651.8090601@paradise.net.nz> User-Agent: Thunderbird 1.5.0.8 (X11/20061129) Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 01:49:57 -0000 This is a multi-part message in MIME format. --------------050506000400060507050102 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Mark Kirkwood wrote: > Anyway on to the results: I used the attached program to read a cached Silly bug in attached program : lseek failure test has 1 instead of -1 (finger trouble). --------------050506000400060507050102 Content-Type: text/x-patch; name="readtest.c.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="readtest.c.patch" *** readtest.c.orig Fri Dec 22 14:43:42 2006 --- readtest.c Fri Dec 22 14:43:24 2006 *************** *** 103,109 **** } } else { offset = (off_t) (random() % (numblocks - 1)) * blocksz; ! if (lseek(fd, offset, SEEK_SET) == 1) { perror("seek failed"); exit(1); } --- 103,109 ---- } } else { offset = (off_t) (random() % (numblocks - 1)) * blocksz; ! if (lseek(fd, offset, SEEK_SET) == -1) { perror("seek failed"); exit(1); } --------------050506000400060507050102-- From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 01:50:18 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0672716A47C for ; Fri, 22 Dec 2006 01:50:17 +0000 (UTC) (envelope-from markir@paradise.net.nz) Received: from smtp3.clear.net.nz (smtp3.clear.net.nz [203.97.33.64]) by mx1.freebsd.org (Postfix) with ESMTP id 6E82213C469 for ; Fri, 22 Dec 2006 01:50:17 +0000 (UTC) (envelope-from markir@paradise.net.nz) Received: from [192.168.1.11] (121-72-65-158.dsl.telstraclear.net [121.72.65.158]) by smtp3.clear.net.nz (CLEAR Net Mail) with ESMTP id <0JAN00F7PJ2QG730@smtp3.clear.net.nz> for freebsd-performance@freebsd.org; Fri, 22 Dec 2006 14:35:15 +1300 (NZDT) Date: Fri, 22 Dec 2006 14:35:13 +1300 From: Mark Kirkwood To: freebsd-performance@freebsd.org Message-id: <458B3651.8090601@paradise.net.nz> MIME-version: 1.0 Content-type: multipart/mixed; boundary=------------080706040303080408030509 User-Agent: Thunderbird 1.5.0.8 (X11/20061129) X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 01:50:18 -0000 This is a multi-part message in MIME format. --------------080706040303080408030509 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit I recently did some testing on the performance of cached reads using two (almost identical) systems, one running FreeBSD 6.2PRE and the other running Gentoo Linux - the latter acting as a control. I initially started a thread of the same name on -stable, but it was suggested I submit a mail here. My background for wanting to examine this is that I work with developing database software (postgres internals related) and cached read performance is pretty important - since we typically try hard to encourage cached access whenever possible. Anyway on to the results: I used the attached program to read a cached 781MB file sequentially and randomly with a specified block size (see below). The conclusion I came to was that our (i.e FreeBSD) cached read performance (particularly for smaller block sizes) could perhaps be improved... now I'm happy to help in any way - the machine I've got running STABLE can be upgraded to CURRENT in order to try out patches (or in fact to see if CURRENT is faster at this already!)... Best wishes Mark ----------------------results-etc--------------------------------- Machines ======== FreeBSD (6.2-PRERELEASE #7: Mon Nov 27 19:32:33 NZDT 2006): - Supermicro P3TDER - 2xSL5QL 1.26 GHz PIII - 2xKingston PC133 RCC Registered 1GB DIMMS - 3Ware 7506 4x Maxtor Plus 9 ATA-133 7200 80G - Kernal GENERIC + SMP - /etc/malloc.conf -> >aj - ufs2 32k blocksize, 4K fragments - RAID0 256K stripe using twe driver Gentoo (2.6.18-gentoo-r3 ): - Supermicro P3TDER - 2xSL5QL 1.26 GHz PIII - 2xKingston PC133 RCC Registered 1GB DIMMS - Promise TX4000 4x Maxtor plus 8 ATA-133 7200 40G - default make CFLAGS (-O2 -march-i686) - xfs stripe width 2 - RAID0 256K stripe using md driver (software RAID) Given the tests were about cached I/O, the differences in RAID controller and the disks themselves were seen as not significant (indeed booting the FreeBSD box with the Gentoo livecd and running the tests there confirmed this). Results ======= FreeBSD: -------- $ ./readtest /data0/dump/file 8192 0 random reads: 100000 of: 8192 bytes elapsed: 4.4477s io rate: 184186327 bytes/s $ ./readtest /data0/dump/file 8192 1 sequential reads: 100000 of: 8192 bytes elapsed: 1.9797s io rate: 413804878 bytes/s $ ./readtest /data0/dump/file 32768 0 random reads: 25000 of: 32768 bytes elapsed: 2.0076s io rate: 408040469 bytes/s $ ./readtest /data0/dump/file 32768 1 sequential reads: 25000 of: 32768 bytes elapsed: 1.7068s io rate: 479965034 bytes/s $ ./readtest /data0/dump/file 65536 0 random reads: 12500 of: 65536 bytes elapsed: 1.7856s io rate: 458778279 bytes/s $ ./readtest /data0/dump/file 65536 1 sequential reads: 12500 of: 65536 bytes elapsed: 1.6611s io rate: 493158866 bytes/s Gentoo: ------- $ ./readtest /data0/dump/file 8192 0 random reads: 100000 of: 8192 bytes elapsed: 1.2698s io rate: 645155193 bytes/s $ ./readtest /data0/dump/file 8192 1 sequential reads: 100000 of: 8192 bytes elapsed: 1.1329s io rate: 723129371 bytes/s $ ./readtest /data0/dump/file 32768 0 random reads: 25000 of: 32768 bytes elapsed: 1.1583s io rate: 707244595 bytes/s $ ./readtest /data0/dump/file 32768 1 sequential reads: 25000 of: 32768 bytes elapsed: 1.1178s io rate: 732838631 bytes/s $ ./readtest /data0/dump/file 65536 0 random reads: 12500 of: 65536 bytes elapsed: 1.1478s io rate: 713742417 bytes/s $ ./readtest /data0/dump/file 65536 1 sequential reads: 12500 of: 65536 bytes elapsed: 1.1012s io rate: 743921133 bytes/s --------------080706040303080408030509-- From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 02:08:10 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from [127.0.0.1] (localhost [127.0.0.1]) by hub.freebsd.org (Postfix) with ESMTP id A119B16A40F; Fri, 22 Dec 2006 02:08:09 +0000 (UTC) (envelope-from davidxu@freebsd.org) Message-ID: <458B3E0C.6090104@freebsd.org> Date: Fri, 22 Dec 2006 10:08:12 +0800 From: David Xu User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.13) Gecko/20061204 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Mark Kirkwood References: <458B3651.8090601@paradise.net.nz> In-Reply-To: <458B3651.8090601@paradise.net.nz> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-performance@freebsd.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 02:08:10 -0000 Mark Kirkwood wrote: > I recently did some testing on the performance of cached reads using two > (almost identical) systems, one running FreeBSD 6.2PRE and the other > running Gentoo Linux - the latter acting as a control. I initially > started a thread of the same name on -stable, but it was suggested I > submit a mail here. > > My background for wanting to examine this is that I work with developing > database software (postgres internals related) and cached read > performance is pretty important - since we typically try hard to > encourage cached access whenever possible. > > Anyway on to the results: I used the attached program to read a cached > 781MB file sequentially and randomly with a specified block size (see > below). The conclusion I came to was that our (i.e FreeBSD) cached read > performance (particularly for smaller block sizes) could perhaps be > improved... now I'm happy to help in any way - the machine I've got > running STABLE can be upgraded to CURRENT in order to try out patches > (or in fact to see if CURRENT is faster at this already!)... > > Best wishes > > Mark > I suspect in such a test, memory copying speed will be a key factor, I don't have number to back up my idea, but I think Linux has lots of tweaks, such as using MMX instruction to copy data. Regards, David Xu From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 02:31:42 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0012B16A416 for ; Fri, 22 Dec 2006 02:31:41 +0000 (UTC) (envelope-from markir@paradise.net.nz) Received: from smtp3.clear.net.nz (smtp3.clear.net.nz [203.97.33.64]) by mx1.freebsd.org (Postfix) with ESMTP id B9E2F13C455 for ; Fri, 22 Dec 2006 02:31:41 +0000 (UTC) (envelope-from markir@paradise.net.nz) Received: from [192.168.1.11] (121-72-65-158.dsl.telstraclear.net [121.72.65.158]) by smtp3.clear.net.nz (CLEAR Net Mail) with ESMTP id <0JAN00BS2LOO3200@smtp3.clear.net.nz>; Fri, 22 Dec 2006 15:31:36 +1300 (NZDT) Date: Fri, 22 Dec 2006 15:31:35 +1300 From: Mark Kirkwood In-reply-to: <458B3E0C.6090104@freebsd.org> To: David Xu Message-id: <458B4387.9090409@paradise.net.nz> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1; format=flowed Content-transfer-encoding: 7bit References: <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org> User-Agent: Thunderbird 1.5.0.8 (X11/20061129) Cc: freebsd-performance@freebsd.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 02:31:42 -0000 David Xu wrote: > Mark Kirkwood wrote: >> I recently did some testing on the performance of cached reads using >> two (almost identical) systems, one running FreeBSD 6.2PRE and the >> other running Gentoo Linux - the latter acting as a control. I >> initially started a thread of the same name on -stable, but it was >> suggested I submit a mail here. >> >> My background for wanting to examine this is that I work with >> developing database software (postgres internals related) and cached >> read performance is pretty important - since we typically try hard to >> encourage cached access whenever possible. >> >> Anyway on to the results: I used the attached program to read a cached >> 781MB file sequentially and randomly with a specified block size (see >> below). The conclusion I came to was that our (i.e FreeBSD) cached >> read performance (particularly for smaller block sizes) could perhaps >> be improved... now I'm happy to help in any way - the machine I've got >> running STABLE can be upgraded to CURRENT in order to try out patches >> (or in fact to see if CURRENT is faster at this already!)... >> >> Best wishes >> >> Mark >> > > I suspect in such a test, memory copying speed will be a key factor, > I don't have number to back up my idea, but I think Linux has lots > of tweaks, such as using MMX instruction to copy data. > > Regards, > David Xu > David - very interesting - checking 2.6.18 sources I see: arch/i386/lib/memcpy.c:7-> void *memcpy(void *to, const void *from, size_t n) { #ifdef CONFIG_X86_USE_3DNOW return __memcpy3d(to, from, n); #else return __memcpy(to, from, n); #endif } If I understand this correctly, I need CONFIG_X86_USE_3DNOW (or perhaps CONFIG_M586MMX) set in my Linux kernel config to be using these.... which I don't appear to have (I'll do some more digging and see if maybe profiling tells us anything useful). Cheers Mark From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 04:35:34 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 81BFF16A412 for ; Fri, 22 Dec 2006 04:35:34 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.224]) by mx1.freebsd.org (Postfix) with ESMTP id 43A7413C442 for ; Fri, 22 Dec 2006 04:35:34 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: by wx-out-0506.google.com with SMTP id s18so2520909wxc for ; Thu, 21 Dec 2006 20:35:33 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=am3X+zUz5qUnJsAfgKLztZq8ruboybSPE8RS6hAHW6p2oP8/zsC6U310dTaAPlHmqJzzR5EpGuwNtIEigHpad6K+MOorS25d8yTF7+Mq8iVv2Tw+35ZQPLi/+3TFe70wLxDEJwRdN5ELZcVpo7hSGj79FwSK4uYgk5tgWkqn0jI= Received: by 10.90.49.19 with SMTP id w19mr9064152agw.1166760588204; Thu, 21 Dec 2006 20:09:48 -0800 (PST) Received: by 10.90.31.12 with HTTP; Thu, 21 Dec 2006 20:09:48 -0800 (PST) Message-ID: Date: Fri, 22 Dec 2006 13:09:48 +0900 From: "Adrian Chadd" Sender: adrian.chadd@gmail.com To: "David Xu" In-Reply-To: <458B3E0C.6090104@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org> X-Google-Sender-Auth: 471ea5e266d692cf Cc: freebsd-performance@freebsd.org, Mark Kirkwood Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 04:35:34 -0000 On 22/12/06, David Xu wrote: > I suspect in such a test, memory copying speed will be a key factor, > I don't have number to back up my idea, but I think Linux has lots > of tweaks, such as using MMX instruction to copy data. I had the oppertunity to study the AMD Athlon XP Optimisation guide and noted their example copy routine, optimised for the chipset, was quite a hell of a lot faster over a straight block copy. Has anyone here done any similar modifications to optimise copyin/copyout? I can't imagine it'd be a bad thing to have. Adrian From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 05:44:51 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2C03C16A407 for ; Fri, 22 Dec 2006 05:44:51 +0000 (UTC) (envelope-from anderson@centtech.com) Received: from mh1.centtech.com (moat3.centtech.com [64.129.166.50]) by mx1.freebsd.org (Postfix) with ESMTP id DC82813C43E for ; Fri, 22 Dec 2006 05:44:50 +0000 (UTC) (envelope-from anderson@centtech.com) Received: from [192.168.42.21] (andersonbox1.centtech.com [192.168.42.21]) by mh1.centtech.com (8.13.8/8.13.8) with ESMTP id kBM5GVhD085201; Thu, 21 Dec 2006 23:16:32 -0600 (CST) (envelope-from anderson@centtech.com) Message-ID: <458B6A39.5040902@centtech.com> Date: Thu, 21 Dec 2006 23:16:41 -0600 From: Eric Anderson User-Agent: Thunderbird 1.5.0.7 (X11/20061015) MIME-Version: 1.0 To: Mark Kirkwood References: <458B3651.8090601@paradise.net.nz> In-Reply-To: <458B3651.8090601@paradise.net.nz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.88.4/2367/Thu Dec 21 10:35:52 2006 on mh1.centtech.com X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=8.0 tests=AWL,BAYES_00 autolearn=ham version=3.1.6 X-Spam-Checker-Version: SpamAssassin 3.1.6 (2006-10-03) on mh1.centtech.com Cc: freebsd-performance@freebsd.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 05:44:51 -0000 On 12/21/06 19:35, Mark Kirkwood wrote: > I recently did some testing on the performance of cached reads using two > (almost identical) systems, one running FreeBSD 6.2PRE and the other > running Gentoo Linux - the latter acting as a control. I initially > started a thread of the same name on -stable, but it was suggested I > submit a mail here. > > My background for wanting to examine this is that I work with developing > database software (postgres internals related) and cached read > performance is pretty important - since we typically try hard to > encourage cached access whenever possible. > > Anyway on to the results: I used the attached program to read a cached > 781MB file sequentially and randomly with a specified block size (see > below). The conclusion I came to was that our (i.e FreeBSD) cached read > performance (particularly for smaller block sizes) could perhaps be > improved... now I'm happy to help in any way - the machine I've got > running STABLE can be upgraded to CURRENT in order to try out patches > (or in fact to see if CURRENT is faster at this already!)... > > Best wishes > > Mark > > > ----------------------results-etc--------------------------------- > Machines > ======== > > FreeBSD (6.2-PRERELEASE #7: Mon Nov 27 19:32:33 NZDT 2006): > - Supermicro P3TDER > - 2xSL5QL 1.26 GHz PIII > - 2xKingston PC133 RCC Registered 1GB DIMMS > - 3Ware 7506 4x Maxtor Plus 9 ATA-133 7200 80G > - Kernal GENERIC + SMP > - /etc/malloc.conf -> >aj > - ufs2 32k blocksize, 4K fragments > - RAID0 256K stripe using twe driver > > Gentoo (2.6.18-gentoo-r3 ): > - Supermicro P3TDER > - 2xSL5QL 1.26 GHz PIII > - 2xKingston PC133 RCC Registered 1GB DIMMS > - Promise TX4000 4x Maxtor plus 8 ATA-133 7200 40G > - default make CFLAGS (-O2 -march-i686) > - xfs stripe width 2 > - RAID0 256K stripe using md driver (software RAID) > > Given the tests were about cached I/O, the differences in RAID > controller and the disks themselves were seen as not significant (indeed > booting the FreeBSD box with the Gentoo livecd and running the tests > there confirmed this). [..snip of useful results..] Aren't you also slightly testing parts of the file system code? Why not (since it is only read-only you are interested in) use FreeBSD's xfs support (only in -CURRENT however) and run the tests also? I'm just curious if it would make any difference - I would bet not much of any though. Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology An undefined problem has an infinite number of solutions. ------------------------------------------------------------------------ From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 06:52:21 2006 Return-Path: X-Original-To: performance@freebsd.org Delivered-To: freebsd-performance@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9377716A40F for ; Fri, 22 Dec 2006 06:52:21 +0000 (UTC) (envelope-from mikej@rogers.com) Received: from smtp103.rog.mail.re2.yahoo.com (smtp103.rog.mail.re2.yahoo.com [206.190.36.81]) by mx1.freebsd.org (Postfix) with SMTP id 560FE13C428 for ; Fri, 22 Dec 2006 06:52:21 +0000 (UTC) (envelope-from mikej@rogers.com) Received: (qmail 16986 invoked from network); 22 Dec 2006 06:25:40 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=rogers.com; h=Received:X-YMail-OSG:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=mIMfsNWL2ByrIhBd2jYUl0qOITBD8b6wRSk2ITPm++a41iOICYc2mY1+I39VLR5HxVuD27IbyrEBgNj+6rxQbX+npqRlNODeerTCsgUbYVR3P5IBAEwTjzv2D/j4bI+MR2TJeZIUcl7Y9Xdbawx1Tq1Re7thefQxR/F/Mb6gTtA= ; Received: from unknown (HELO ?172.16.0.200?) (mikej@rogers.com@74.111.253.239 with plain) by smtp103.rog.mail.re2.yahoo.com with SMTP; 22 Dec 2006 06:25:40 -0000 X-YMail-OSG: psLcbC0VM1nS9tg49t_5kWL6MW1GOe1W_wPT3BgU6SriCOoxRdve7ZP1tdn670BZffHvjQEgX2peJezkVxPjTdGrT88egSpln7tiVGLzJAlMxqQoONt..nveiiC1hETgnN7ukLU37PY9KsA- Message-ID: <458B7A86.5060908@rogers.com> Date: Fri, 22 Dec 2006 01:26:14 -0500 From: Mike Jakubik User-Agent: Thunderbird 1.5.0.9 (Windows/20061207) MIME-Version: 1.0 To: performance@freebsd.org References: <45888C68.10305@paradise.net.nz> <200612200816.51043.joao@matik.com.br> <4589128F.9030404@paradise.net.nz> <200612201536.25497.pieter@degoeje.nl> <458A606E.6080008@paradise.net.nz> <20061221184535.GF41566@turion.vk2pj.dyndns.org> In-Reply-To: <20061221184535.GF41566@turion.vk2pj.dyndns.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Subject: Re: Cached file read performance with 6.2-PRERELEASE X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 06:52:21 -0000 Has anyone tried these tests with 4.x? Well, i did, and i was surprised how good the performance is, it gave me the highest number of all tests, even compared to much faster HW. Although this is all different hardware, it seems like the performance drops the higher the version of FreeBSD is, specifically right after 6.1. Is there a possibility that there was some performance problem introduced around that time? All tests done with "dd if=/dev/zero of=/tmp/file bs=8k count=30000", a 234M file. --- FreeBSD 4.11-STABLE, Pentium(R) 4 CPU 2.60GHz, 512MB # dd of=/dev/null if=/tmp/file bs=8k 30000+0 records in 30000+0 records out 245760000 bytes transferred in 0.298992 secs (821962015 bytes/sec) # dd of=/dev/null if=/tmp/file bs=32k 7500+0 records in 7500+0 records out 245760000 bytes transferred in 0.221009 secs (1111990834 bytes/sec) FreeBSD 6.1-STABLE, Dual Core AMD Opteron(tm) Processor 170 (2009.27-MHz K8-class CPU), 1GB # dd of=/dev/null if=/tmp/file bs=8k 30000+0 records in 30000+0 records out 245760000 bytes transferred in 0.289550 secs (848765132 bytes/sec) # dd of=/dev/null if=/tmp/file bs=32k 7500+0 records in 7500+0 records out 245760000 bytes transferred in 0.243281 secs (1010190329 bytes/sec) FreeBSD 6.1-STABLE, Intel(R) Pentium(R) D CPU 3.20GHz (3118.91-MHz 686-class CPU), 1GB # dd of=/dev/null if=/tmp/file bs=8k 30000+0 records in 30000+0 records out 245760000 bytes transferred in 0.354899 secs (692478377 bytes/sec) # dd of=/dev/null if=/tmp/file bs=32k 7500+0 records in 7500+0 records out 245760000 bytes transferred in 0.285909 secs (859574388 bytes/sec) FreeBSD 6.2-PRERELEASE, AMD Athlon(tm) 64 Processor 3000+ (2002.58-MHz K8-class CPU), 512MB # dd of=/dev/null if=/tmp/file bs=8k 30000+0 records in 30000+0 records out 245760000 bytes transferred in 0.354382 secs (693488872 bytes/sec) # dd of=/dev/null if=/tmp/file bs=32k 7500+0 records in 7500+0 records out 245760000 bytes transferred in 0.356816 secs (688758249 bytes/sec) FreeBSD 6.2-PRERELEASE, Intel(R) Pentium(R) 4 CPU 1.80GHz (1796.94-MHz 686-class CPU), 512MB # dd of=/dev/null if=/tmp/file bs=8k 30000+0 records in 30000+0 records out 245760000 bytes transferred in 0.483906 secs (507867448 bytes/sec) # dd of=/dev/null if=/tmp/file bs=32k 7500+0 records in 7500+0 records out 245760000 bytes transferred in 0.390824 secs (628825123 bytes/sec) FreeBSD 7.0-CURRENT (all debugging off), AMD Athlon(tm) Processor (1410.21-MHz 686-class CPU), 512MB # dd of=/dev/null if=/tmp/file bs=8k 30000+0 records in 30000+0 records out 245760000 bytes transferred in 0.846895 secs (290189464 bytes/sec) # dd of=/dev/null if=/tmp/file bs=32k 7500+0 records in 7500+0 records out 245760000 bytes transferred in 0.794950 secs (309151516 bytes/sec) From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 11:18:09 2006 Return-Path: X-Original-To: freebsd-performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3DCFB16A403 for ; Fri, 22 Dec 2006 11:18:09 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.freebsd.org (Postfix) with ESMTP id D430B13C428 for ; Fri, 22 Dec 2006 11:18:08 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout2.pacific.net.au (Postfix) with ESMTP id CBDF410B218; Fri, 22 Dec 2006 22:18:05 +1100 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (Postfix) with ESMTP id A147F2740C; Fri, 22 Dec 2006 22:18:05 +1100 (EST) Date: Fri, 22 Dec 2006 22:18:04 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Mark Kirkwood In-Reply-To: <458B3651.8090601@paradise.net.nz> Message-ID: <20061222171431.L18486@delplex.bde.org> References: <458B3651.8090601@paradise.net.nz> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-performance@FreeBSD.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 11:18:09 -0000 On Fri, 22 Dec 2006, Mark Kirkwood wrote: > I recently did some testing on the performance of cached reads using two > (almost identical) systems, one running FreeBSD 6.2PRE and the other running > Gentoo Linux - the latter acting as a control. I initially started a thread > of the same name on -stable, but it was suggested I submit a mail here. Linux has less bloat in the file system and (cached, at least) block i/o paths, so it won't be competed with fully any time soon. However, the differences shouldn't be more than a factor of 2. > conclusion I came to was that our (i.e FreeBSD) cached read performance > (particularly for smaller block sizes) could perhaps be improved... now I'm None was attached. > Machines > ======== > > FreeBSD (6.2-PRERELEASE #7: Mon Nov 27 19:32:33 NZDT 2006): > - Supermicro P3TDER > - 2xSL5QL 1.26 GHz PIII > - 2xKingston PC133 RCC Registered 1GB DIMMS > - 3Ware 7506 4x Maxtor Plus 9 ATA-133 7200 80G > - Kernal GENERIC + SMP > - /etc/malloc.conf -> >aj > - ufs2 32k blocksize, 4K fragments ^^^^^^^^^^^^^^^^^^ Try using an unpessimized block size. Block sizes larger than BKVASIZE (default 16K) fragment the buffer cache virtual memory. However, I couldn't see much difference between block sizes of 16, 32 and 64K for a small (32MB) md-malloced file system with a simple test program. All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10% an an AthlonXP with ~PC3200 memory). On this system, half of the bandwidth of main memory is (apparently) unavailable for reads because it has to go through the CPU caches (only nontemporal writes go at full speed), and another 1/2 of the bandwidth is lost to system overheads, so 800MB/S is within a factor of 2 of the best possible. > - RAID0 256K stripe using twe driver > > Gentoo (2.6.18-gentoo-r3 ): > - Supermicro P3TDER > - 2xSL5QL 1.26 GHz PIII > - 2xKingston PC133 RCC Registered 1GB DIMMS > - Promise TX4000 4x Maxtor plus 8 ATA-133 7200 40G > - default make CFLAGS (-O2 -march-i686) > - xfs stripe width 2 > - RAID0 256K stripe using md driver (software RAID) PIII's and PC133 are very slow these days. I could never get more than a couple of hundred MB/s main memory copy bandwidth out of PC100. PC133 and the read bandwidth are not much faster. The read bandwidth on freefall (800 MHz PIII) with a block size of 4MB is now 500MB/S for my best read methods. > Given the tests were about cached I/O, the differences in RAID controller and > the disks themselves were seen as not significant (indeed booting the FreeBSD > box with the Gentoo livecd and running the tests there confirmed this). Yes, if the disk LED blinks then the test is invalid. > -------- > > $ ./readtest /data0/dump/file 8192 0 > random reads: 100000 of: 8192 bytes elapsed: 4.4477s io rate: 184186327 > bytes/s > $ ./readtest /data0/dump/file 8192 1 > sequential reads: 100000 of: 8192 bytes elapsed: 1.9797s io rate: 413804878 > bytes/s The speed seems to be limited mainly by main memory bandwidth for sequential reads and by system overheads for random reads. > $ ./readtest /data0/dump/file 32768 0 > random reads: 25000 of: 32768 bytes elapsed: 2.0076s io rate: 408040469 > bytes/s > $ ./readtest /data0/dump/file 32768 1 > sequential reads: 25000 of: 32768 bytes elapsed: 1.7068s io rate: 479965034 > bytes/s Now the difference is acceptably small. This also indicates that the system overhead for random accesses with non-large blocks is too large. > Gentoo: > ------- > > $ ./readtest /data0/dump/file 8192 0 > random reads: 100000 of: 8192 bytes elapsed: 1.2698s io rate: 645155193 > bytes/s > $ ./readtest /data0/dump/file 8192 1 > sequential reads: 100000 of: 8192 bytes elapsed: 1.1329s io rate: 723129371 > bytes/s :-(. I thought that PC133 couldn't go that fast even for a pure memory benchmark. Bruce From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 12:38:05 2006 Return-Path: X-Original-To: freebsd-performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1D7EE16A403; Fri, 22 Dec 2006 12:38:05 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.freebsd.org (Postfix) with ESMTP id 74DF013C455; Fri, 22 Dec 2006 12:38:04 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout2.pacific.net.au (Postfix) with ESMTP id 74F6510B21F; Fri, 22 Dec 2006 23:37:55 +1100 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (Postfix) with ESMTP id E47FE2741A; Fri, 22 Dec 2006 23:37:53 +1100 (EST) Date: Fri, 22 Dec 2006 23:37:53 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Adrian Chadd In-Reply-To: Message-ID: <20061222222757.G18486@delplex.bde.org> References: <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-performance@FreeBSD.org, David Xu , Mark Kirkwood Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 12:38:05 -0000 On Fri, 22 Dec 2006, Adrian Chadd wrote: > On 22/12/06, David Xu wrote: > >> I suspect in such a test, memory copying speed will be a key factor, >> I don't have number to back up my idea, but I think Linux has lots >> of tweaks, such as using MMX instruction to copy data. > > I had the oppertunity to study the AMD Athlon XP Optimisation guide > and noted their example copy routine, optimised for the chipset, was > quite a hell of a lot faster over a straight block copy. > > Has anyone here done any similar modifications to optimise > copyin/copyout? I can't imagine it'd be a bad thing to have. Sure. It's a larger win mainly in benchmarks. It's a twisty MD maze. It's a small loss for the MMX method used in linux (2.6.10 at least) on the original poster's machine (PIII). The main win is from using nontemporal writes, but that requires SSE2 and the kernel already uses these in the most important case (sse2_pagezero(); other cases have tradeoffs). Times for some copy methods: freefall (800MHz PIII), block size 4K (fully cached) %%% copy0: 3448148133 B/s ( 29001 us) ( 25037077 tsc) (movsl) copy1: 1840531252 B/s ( 54332 us) ( 46333183 tsc) (unroll 16) copy2: 1571211313 B/s ( 63645 us) ( 52383615 tsc) (unroll 16 prefetch) copy3: 2246932794 B/s ( 44505 us) ( 36824018 tsc) (unroll 64 i586-opt) copy4: 1970554791 B/s ( 50747 us) ( 43268191 tsc) (unroll 64 i586-opt prefetch) copy5: 2117741296 B/s ( 47220 us) ( 38651415 tsc) (unroll 64 i586-opx prefetch) copy6: 1684092760 B/s ( 59379 us) ( 48916219 tsc) (unroll 32 prefetch 2) copy7: 1506746384 B/s ( 66368 us) ( 54357751 tsc) (unroll 64 fp++) copy8: 1574228925 B/s ( 63523 us) ( 52241051 tsc) (unroll 128 fp i-prefetch) copy9: 1579051367 B/s ( 63329 us) ( 51821088 tsc) (unroll 64 fp reordered) copyA: 1625298552 B/s ( 61527 us) ( 51037242 tsc) (unroll 256 fp reordered++) copyB: 1633849261 B/s ( 61205 us) ( 50367459 tsc) (unroll 512 fp reordered++) copyC: 452936367 B/s ( 220781 us) (181557329 tsc) (Terje cksum) copyD: 1449124640 B/s ( 69007 us) ( 56524152 tsc) (kernel bcopy (unroll 64 fp i-prefetch)) copyE: 1525968138 B/s ( 65532 us) ( 53339199 tsc) (unroll 64 fp i-prefetch++) copyF: 1513634002 B/s ( 66066 us) ( 54251674 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++)) copyG: 3389821831 B/s ( 29500 us) ( 23686951 tsc) (memcpy (movsl)) copyK: 3522482088 B/s ( 28389 us) ( 23081104 tsc) (movq) copyL: 3018586815 B/s ( 33128 us) ( 27671714 tsc) (movq with prefetchnta) copyM: 3057441649 B/s ( 32707 us) ( 27641525 tsc) (movq with block prefetch) copya: 2584306603 B/s ( 38695 us) ( 31756341 tsc) (~i686_memcpy (movaps)) %%% movsl/memcpy is simplest and best here. copyL is like the Linux-2.6.10 _mmx_memcpy(). The latter uses `prefetch' which is older than prefetchnta and differently unportable. I've never noticed much difference between these, but the older instruction might work better on older CPUs like PIII's. Note that prefetchnta is slower than explicit block prefetch. This happens on AthlonXP's too, and IIRC the XP optimisation guide points this out and uses block prefetch for its last and biggest copy optimization. Note that prefetching is just a loss for the fully cached case. The main point of interest here is that block prefetch still beats prefetchnta by an insignificant amount (it might be expected to lose because it takes more instructions and the bottleneck in the fully cached case is is instruction execution. There aren't many methods using XMM registers here because the methods here are limited to ones that work on machines with only plain SSE and I couldn't find any such machines where using either MMX or XMM was any use. CopyL uses MMX registers. copya uses XMM registers. copya loses significantly to movsl/memcpy and copyL. freefall, block size 4096K (fully uncached) %%% copy0: 199343794 B/s ( 493138 us) (613912221 tsc) (movsl) copy1: 185455801 B/s ( 530067 us) (636100521 tsc) (unroll 16) copy2: 181088365 B/s ( 542851 us) (474134548 tsc) (unroll 16 prefetch) copy3: 183647620 B/s ( 535286 us) (456166441 tsc) (unroll 64 i586-opt) copy4: 177010464 B/s ( 555357 us) (466214836 tsc) (unroll 64 i586-opt prefetch) copy5: 176540627 B/s ( 556835 us) (465979821 tsc) (unroll 64 i586-opx prefetch) copy6: 181682761 B/s ( 541075 us) (457801523 tsc) (unroll 32 prefetch 2) copy7: 174978240 B/s ( 561807 us) (486332757 tsc) (unroll 64 fp++) copy8: 192576224 B/s ( 510468 us) (429718012 tsc) (unroll 128 fp i-prefetch) copy9: 177291074 B/s ( 554478 us) (473659591 tsc) (unroll 64 fp reordered) copyA: 179384243 B/s ( 548008 us) (476730841 tsc) (unroll 256 fp reordered++) copyB: 182308792 B/s ( 539217 us) (455082354 tsc) (unroll 512 fp reordered++) copyC: 132747808 B/s ( 740532 us) (621009558 tsc) (Terje cksum) copyD: 191875581 B/s ( 512332 us) (434236713 tsc) (kernel bcopy (unroll 64 fp i-prefetch)) copyE: 192663787 B/s ( 510236 us) (430394085 tsc) (unroll 64 fp i-prefetch++) copyF: 192714776 B/s ( 510101 us) (431859413 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++)) copyG: 184343619 B/s ( 533265 us) (451905971 tsc) (memcpy (movsl)) copyK: 182133150 B/s ( 539737 us) (479260121 tsc) (movq) copyL: 185353345 B/s ( 530360 us) (449688860 tsc) (movq with prefetchnta) copyM: 187979371 B/s ( 522951 us) (442852446 tsc) (movq with block prefetch) copya: 185523701 B/s ( 529873 us) (465249860 tsc) (~i686_memcpy (movaps)) %%% movsl/memcpy is still simplest and best. Other methods are only slightly slower (except copyC, which does a checksum in parallel with read/dwrite; extra operations combined with copying are free on some machines, but not here, even in the fully uncached case). freefall's times may be inaccurate since freefall is loaded, and the tsc's may be very inaccurate because freefall is SMP, but the following are very accurate since the machine is unloaded !SMP: Athlon XP2600, 193MHz FSB, 8-3-3-2.5 memory (not quite PC3200), block size 4K: %%% copy0: 6492646669 B/s ( 15402 us) ( 34282451 tsc) (movsl) copy1: 5815290998 B/s ( 17196 us) ( 38282332 tsc) (unroll 16) copy2: 5099686063 B/s ( 19609 us) ( 44504640 tsc) (unroll 16 prefetch) copy3: 6580229256 B/s ( 15197 us) ( 33837406 tsc) (unroll 64 i586-opt) copy4: 6608931597 B/s ( 15131 us) ( 33685684 tsc) (unroll 64 i586-opt prefetch) copy5: 6620745763 B/s ( 15104 us) ( 33624302 tsc) (unroll 64 i586-opx prefetch) copy6: 5371132452 B/s ( 18618 us) ( 41448096 tsc) (unroll 32 prefetch 2) copy7: 7544303584 B/s ( 13255 us) ( 29523386 tsc) (unroll 64 fp++) copy8: 8178600147 B/s ( 12227 us) ( 28765763 tsc) (unroll 128 fp i-prefetch) copy9: 9280718701 B/s ( 10775 us) ( 25879055 tsc) (unroll 64 fp reordered) copyA: 8625128860 B/s ( 11594 us) ( 25817196 tsc) (unroll 256 fp reordered++) copyB: 8883338723 B/s ( 11257 us) ( 25161370 tsc) (unroll 512 fp reordered++) copyC: 2927478673 B/s ( 34159 us) ( 76030527 tsc) (Terje cksum) copyD: 7751918140 B/s ( 12900 us) ( 28727306 tsc) (kernel bcopy (unroll 64 fp i-prefetch)) copyE: 7834514572 B/s ( 12764 us) ( 28403840 tsc) (unroll 64 fp i-prefetch++) copyF: 7818588272 B/s ( 12790 us) ( 28475409 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++)) copyG: 6419292849 B/s ( 15578 us) ( 34670640 tsc) (memcpy (movsl)) copyH: 2950106027 B/s ( 33897 us) ( 75467527 tsc) (movntps) copyI: 2939094286 B/s ( 34024 us) ( 77103399 tsc) (movntps with prefetchnta) copyJ: 2940477064 B/s ( 34008 us) ( 77144512 tsc) (movntps with block prefetch) copyK: 11064366453 B/s ( 9038 us) ( 20691582 tsc) (movq) copyL: 9832816519 B/s ( 10170 us) ( 22685200 tsc) (movq with prefetchnta) copyM: 9853162282 B/s ( 10149 us) ( 22599677 tsc) (movq with block prefetch) copyN: 2950018998 B/s ( 33898 us) ( 75452984 tsc) (movntq) copyO: 2933576156 B/s ( 34088 us) ( 77122605 tsc) (movntq with prefetchnta) copyP: 2885246083 B/s ( 34659 us) ( 77147363 tsc) (movntq with block prefetch) copyQ: 6749442765 B/s ( 14816 us) ( 32985677 tsc) (movdqa) copya: 7504108059 B/s ( 13326 us) ( 29680371 tsc) (~i686_memcpy (movaps)) %%% Now movq is best. It is almost twice as fast as movsl. This is because movsl only issues 32-bit accesses and the number of those per cycle has the same limit as 64-bit accesses, at least for read/write in parallel (AXP's have some asymmetry for read/write that gets in the way of other access mixes. A64's are better here). Even the old PI FPU method easily beats movsl. It was turned off because it was a large loss on PII's. There are now some SSE+ extensions (movnt*). These use an AthlonXP extension of SSE. Thes are just a loss in the fully cached case (and in all cases for small data unless you know that the target shouldn't be cached). AthlonXP... block size 4096K: %%% copy0: 636873680 B/s ( 154354 us) (344356579 tsc) (movsl) copy1: 649887944 B/s ( 151263 us) (337326810 tsc) (unroll 16) copy2: 582949855 B/s ( 168632 us) (376274011 tsc) (unroll 16 prefetch) copy3: 736911544 B/s ( 133400 us) (315267117 tsc) (unroll 64 i586-opt) copy4: 683944313 B/s ( 143731 us) (320308617 tsc) (unroll 64 i586-opt prefetch) copy5: 684006179 B/s ( 143718 us) (320114790 tsc) (unroll 64 i586-opx prefetch) copy6: 656704054 B/s ( 149693 us) (333513466 tsc) (unroll 32 prefetch 2) copy7: 675350371 B/s ( 145560 us) (324722661 tsc) (unroll 64 fp++) copy8: 793971554 B/s ( 123813 us) (276326666 tsc) (unroll 128 fp i-prefetch) copy9: 679120150 B/s ( 144752 us) (322757764 tsc) (unroll 64 fp reordered) copyA: 650429743 B/s ( 151137 us) (336686142 tsc) (unroll 256 fp reordered++) copyB: 686849773 B/s ( 143123 us) (318835219 tsc) (unroll 512 fp reordered++) copyC: 656370811 B/s ( 149769 us) (333826275 tsc) (Terje cksum) copyD: 777715366 B/s ( 126401 us) (282197950 tsc) (kernel bcopy (unroll 64 fp i-prefetch)) copyE: 779930499 B/s ( 126042 us) (280900317 tsc) (unroll 64 fp i-prefetch++) copyF: 773888810 B/s ( 127026 us) (283770359 tsc) (new kernel bcopy (unroll 64 fp i-prefetch++)) copyG: 636189490 B/s ( 154520 us) (344278918 tsc) (memcpy (movsl)) copyH: 1056702749 B/s ( 93029 us) (207224289 tsc) (movntps) copyI: 1072590588 B/s ( 91651 us) (204188841 tsc) (movntps with prefetchnta) copyJ: 1395630138 B/s ( 70437 us) (156912756 tsc) (movntps with block prefetch) copyK: 708242075 B/s ( 138800 us) (309879060 tsc) (movq) copyL: 706770485 B/s ( 139089 us) (311075317 tsc) (movq with prefetchnta) copyM: 814300625 B/s ( 120722 us) (269160923 tsc) (movq with block prefetch) copyN: 1076549051 B/s ( 91314 us) (203659502 tsc) (movntq) copyO: 1066898198 B/s ( 92140 us) (205514511 tsc) (movntq with prefetchnta) copyP: 1413551133 B/s ( 69544 us) (155496730 tsc) (movntq with block prefetch) copyQ: 680954822 B/s ( 144362 us) (321945223 tsc) (movdqa) copya: 710699826 B/s ( 138320 us) (308106574 tsc) (~i686_memcpy (movaps)) %%% Now the movnt* methods win easily. Block prefetch wins easily over prefetchnta. (Unlike for PIII's, I know that it is preferred to plain "prefetch".) Athlon64's behave significantly differently here (details not shown): - movsl is still quite slow - movsq/memcpy has the same speed as movq (MMX) and movq(64-bit integer) - the memory system is better relative to the CPU, so the fully cached case is not so much faster, especially with DDR2 - prefetchnta now wins over block prefetch, since the memory system now actually understands prefetchnta - movnt* us a larger win. Memcpy (movsq) is simplest and best again unless movnt* is used. amd64 already uses simplest and best methods except for large copyin/copyout's where it should probably use movnt*. It is unclear whether a block size of 8K is large -- in cases where the application actually uses the data, it may be best to not use movnt*. movnt* for 8K writes is more likely to right, since in many cases the kernel's only "use" of the data is to DMA it to a disk drive and for that it should never be put in the CPU's caches. Bruce From owner-freebsd-performance@FreeBSD.ORG Fri Dec 22 20:22:30 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 29F2F16A40F; Fri, 22 Dec 2006 20:22:30 +0000 (UTC) (envelope-from alexander@leidinger.net) Received: from redbull.bpaserver.net (redbullneu.bpaserver.net [213.198.78.217]) by mx1.freebsd.org (Postfix) with ESMTP id CBCD213C49D; Fri, 22 Dec 2006 20:22:29 +0000 (UTC) (envelope-from alexander@leidinger.net) Received: from outgoing.leidinger.net (p54A5E998.dip.t-dialin.net [84.165.233.152]) by redbull.bpaserver.net (Postfix) with ESMTP id 3D65A2E192; Fri, 22 Dec 2006 20:30:45 +0100 (CET) Received: from Magellan.Leidinger.net (Magellan.Leidinger.net [192.168.1.1]) by outgoing.leidinger.net (Postfix) with ESMTP id 3B27C5B480D; Fri, 22 Dec 2006 20:29:34 +0100 (CET) Date: Fri, 22 Dec 2006 20:29:33 +0100 From: Alexander Leidinger To: Bruce Evans Message-ID: <20061222202933.709d2279@Magellan.Leidinger.net> In-Reply-To: <20061222222757.G18486@delplex.bde.org> References: <458B3651.8090601@paradise.net.nz> <458B3E0C.6090104@freebsd.org> <20061222222757.G18486@delplex.bde.org> X-Mailer: Claws Mail 2.6.1 (GTK+ 2.10.6; i386-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BPAnet-MailScanner-Information: Please contact the ISP for more information X-BPAnet-MailScanner: Found to be clean X-BPAnet-MailScanner-SpamCheck: not spam, SpamAssassin (not cached, score=-14.787, required 6, autolearn=not spam, BAYES_00 -15.00, DK_POLICY_SIGNSOME 0.00, FORGED_RCVD_HELO 0.14, TW_CP 0.08) X-BPAnet-MailScanner-From: alexander@leidinger.net X-Spam-Status: No Cc: Adrian Chadd , rookie@gufi.org, freebsd-performance@FreeBSD.org, Mark Kirkwood , David Xu Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Dec 2006 20:22:30 -0000 Quoting Bruce Evans (Fri, 22 Dec 2006 23:37:53 +1100 (EST)): > On Fri, 22 Dec 2006, Adrian Chadd wrote: > > > On 22/12/06, David Xu wrote: > > > >> I suspect in such a test, memory copying speed will be a key factor, > >> I don't have number to back up my idea, but I think Linux has lots > >> of tweaks, such as using MMX instruction to copy data. > > > > I had the oppertunity to study the AMD Athlon XP Optimisation guide > > and noted their example copy routine, optimised for the chipset, was > > quite a hell of a lot faster over a straight block copy. > > > > Has anyone here done any similar modifications to optimise > > copyin/copyout? I can't imagine it'd be a bad thing to have. > > Sure. It's a larger win mainly in benchmarks. It's a twisty MD maze. I want to point out http://www.freebsd.org/projects/ideas/#p-memcpy here. Just in case someone wants to play around a little bit. Bye, Alexander. -- I like your SNOOPY POSTER!! http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 From owner-freebsd-performance@FreeBSD.ORG Sat Dec 23 01:00:45 2006 Return-Path: X-Original-To: freebsd-performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0157116A506 for ; Sat, 23 Dec 2006 01:00:45 +0000 (UTC) (envelope-from markir@paradise.net.nz) Received: from smtp4.clear.net.nz (smtp4.clear.net.nz [203.97.37.64]) by mx1.freebsd.org (Postfix) with ESMTP id BC1D513C448 for ; Sat, 23 Dec 2006 01:00:44 +0000 (UTC) (envelope-from markir@paradise.net.nz) Received: from [192.168.1.11] (121-72-69-162.dsl.telstraclear.net [121.72.69.162]) by smtp4.clear.net.nz (CLEAR Net Mail) with ESMTP id <0JAP006OEC4XLV10@smtp4.clear.net.nz> for freebsd-performance@FreeBSD.org; Sat, 23 Dec 2006 14:00:34 +1300 (NZDT) Date: Sat, 23 Dec 2006 14:00:33 +1300 From: Mark Kirkwood In-reply-to: <20061222171431.L18486@delplex.bde.org> To: Bruce Evans Message-id: <458C7FB1.9020002@paradise.net.nz> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1; format=flowed Content-transfer-encoding: 7bit References: <458B3651.8090601@paradise.net.nz> <20061222171431.L18486@delplex.bde.org> User-Agent: Thunderbird 1.5.0.8 (X11/20061129) Cc: freebsd-performance@FreeBSD.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Dec 2006 01:00:45 -0000 Bruce Evans wrote: > > None was attached. > (meaning the c prog yes?) I notice that it is stripped out from the web archive... so here's a link: http://homepages.paradise.net.nz/markir/download/freebsd/readtest.c >> Machines >> ======== >> - ufs2 32k blocksize, 4K fragments > ^^^^^^^^^^^^^^^^^^ > > Try using an unpessimized block size. Block sizes larger than BKVASIZE > (default 16K) fragment the buffer cache virtual memory. Right - I should have said, I saw a comment to that effect in src/sys/sys/param.h, and so I tested with 8K, 16K too, interestingly on my system 32K seemed to be faster, even for the bigger files (of course - hard to know if it was really significant...). > However, I > couldn't see much difference between block sizes of 16, 32 and 64K for > a small (32MB) md-malloced file system with a simple test program. > All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10% > an an AthlonXP with ~PC3200 memory). Cheers Mark From owner-freebsd-performance@FreeBSD.ORG Sat Dec 23 05:21:57 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from localhost.my.domain (localhost [127.0.0.1]) by hub.freebsd.org (Postfix) with ESMTP id BBA1B16A403; Sat, 23 Dec 2006 05:21:56 +0000 (UTC) (envelope-from davidxu@freebsd.org) From: David Xu To: freebsd-performance@freebsd.org Date: Sat, 23 Dec 2006 13:21:51 +0800 User-Agent: KMail/1.8.2 References: <458B3651.8090601@paradise.net.nz> <20061222222757.G18486@delplex.bde.org> <20061222202933.709d2279@Magellan.Leidinger.net> In-Reply-To: <20061222202933.709d2279@Magellan.Leidinger.net> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200612231321.52178.davidxu@freebsd.org> Cc: Alexander Leidinger , Adrian Chadd , Mark Kirkwood , Bruce Evans , rookie@gufi.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Dec 2006 05:21:57 -0000 On Saturday 23 December 2006 03:29, Alexander Leidinger wrote: > I want to point out http://www.freebsd.org/projects/ideas/#p-memcpy > here. Just in case someone wants to play around a little bit. > > Bye, > Alexander. I have read the code, if a buffer is not aligned at 16 bytes boundary, it will not use FPU to copy data, but user buffer is not always 16 bytes aligned. David Xu From owner-freebsd-performance@FreeBSD.ORG Sat Dec 23 09:38:10 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1DC7116A412 for ; Sat, 23 Dec 2006 09:38:10 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1-3.pacific.net.au [61.8.2.210]) by mx1.freebsd.org (Postfix) with ESMTP id 9608413C45A for ; Sat, 23 Dec 2006 09:38:09 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout1.pacific.net.au (Postfix) with ESMTP id DBA07328229; Sat, 23 Dec 2006 20:38:07 +1100 (EST) Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 40C058C0D; Sat, 23 Dec 2006 20:38:05 +1100 (EST) Date: Sat, 23 Dec 2006 20:38:04 +1100 (EST) From: Bruce Evans X-X-Sender: bde@epsplex.bde.org To: Mark Kirkwood In-Reply-To: <458C7FB1.9020002@paradise.net.nz> Message-ID: <20061223175413.W1116@epsplex.bde.org> References: <458B3651.8090601@paradise.net.nz> <20061222171431.L18486@delplex.bde.org> <458C7FB1.9020002@paradise.net.nz> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-performance@freebsd.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Dec 2006 09:38:10 -0000 On Sat, 23 Dec 2006, Mark Kirkwood wrote: > Bruce Evans wrote: >> >> None was attached. >> > > (meaning the c prog yes?) I notice that it is stripped out from the web > archive... so here's a link: > > http://homepages.paradise.net.nz/markir/download/freebsd/readtest.c >> However, I >> couldn't see much difference between block sizes of 16, 32 and 64K for >> a small (32MB) md-malloced file system with a simple test program. >> All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10% >> an an AthlonXP with ~PC3200 memory). Now I see the problem with a normal file system. The main difference in my quick test was probably that 32MB is too small to show the problem. 32MB fits in the buffer cache, but slightly larger files only fit in the VMIO cache, and the main problem is the interaction of these caches. This behaviour is easy to understand using kernel profiling: Part of a profile for the random case (reading 400MB with a block size of 4K -- smaller block sizes make larger differences): %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 2.70 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 22.5 0.608 0.608 102466 5933 5933 copyout [13] 21.2 1.180 0.573 0 100.00% mcount [14] 10.2 1.457 0.277 0 100.00% mexitcount [17] 10.2 1.733 0.276 450823 612 612 buf_splay [18] 9.7 1.995 0.262 348917 751 751 vm_page_splay [20] 5.7 2.148 0.153 0 100.00% cputime [22] 2.0 2.202 0.054 348017 154 179 vm_page_unwire [26] 1.8 2.252 0.050 87127 573 3487 getnewbuf [16] 1.7 2.298 0.047 0 100.00% user [29] 1.3 2.332 0.034 87132 388 388 pmap_qremove [31] 1.1 2.363 0.031 87127 351 4025 allocbuf [15] 1.0 2.388 0.026 348505 74 117 vm_page_wire [30] %%% Sequential case: %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 1.35 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 39.3 0.530 0.530 102443 5178 5178 copyout [11] 23.7 0.850 0.320 0 100.00% mcount [12] 11.5 1.004 0.154 0 100.00% mexitcount [13] 6.3 1.090 0.085 0 100.00% cputime [16] 3.0 1.130 0.040 102816 389 389 vm_page_splay [19] 1.6 1.151 0.021 409846 52 59 _lockmgr [22] 1.3 1.168 0.017 0 100.00% user [23] 0.9 1.180 0.012 102617 117 117 buf_splay [26] ... 0.7 1.200 0.009 25603 356 1553 getnewbuf [20] ... 0.6 1.208 0.009 25603 337 2197 allocbuf [17] ... 0.6 1.224 0.008 101915 78 96 vm_page_unwire [29] ... 0.5 1.239 0.007 25608 274 274 pmap_qremove [32] ... 0.2 1.316 0.002 102409 20 35 vm_page_wire [44] %%% It is a buffer-cache/vm problem like I suspected. The file system block size is 16K, so with a read size of 4K, random reads allocate a new buffer about 16K/4K = 4 times more often than sequential reads. Allocation involves vm stuff which is very expensive (takes about 1.25 times as long as the actual copying). I believe it was even more expensive before it used splay trees. More details from separate runs: Random: %%% ----------------------------------------------- 0.00 0.00 5/102404 breadn [237] 0.01 0.84 102399/102404 cluster_read [10] [11] 31.3 0.01 0.84 102404 getblk [11] 0.03 0.32 87126/87126 allocbuf [15] 0.05 0.25 87126/87126 getnewbuf [16] 0.01 0.12 189530/189530 gbincore [23] 0.00 0.06 87126/87126 bgetvp [27] 0.00 0.00 15278/409852 _lockmgr [32] 0.00 0.00 15278/15278 bremfree [144] ----------------------------------------------- %%% Sequential: %%% ----------------------------------------------- 0.00 0.00 6/102404 breadn [371] 0.01 0.12 102398/102404 cluster_read [14] [15] 9.5 0.01 0.12 102404 getblk [15] 0.01 0.05 25603/25605 allocbuf [17] 0.01 0.03 25603/25603 getnewbuf [18] 0.00 0.01 128007/128007 gbincore [31] 0.00 0.00 76801/409846 _lockmgr [22] 0.00 0.00 25603/25603 bgetvp [39] 0.00 0.00 76801/76801 bremfree [66] ----------------------------------------------- %%% getblk() is called the same number of times for each. In the sequential case, it uses a previously allocated buffer (almost always one allocated just before) with a probabilty of almost exactly 0.75, but in the random case it uses a previosly allocated buffer with a probability of about 0.13. The second probably is only larger than epsilon because there is a buffer pool with a size of a few thousand. Sometimes you get a hit in this pool, but for large working data sets, mostly you don't; then the buffer must be consituted from vm (or the disk). This problem is (now) fairly small because most working data sets aren't large compared with the buffer pool. It was much larger 10 years ago when the size of the buffer pool was only a few hundred. It was much larger still more than about 12 years ago in FreeBSD before the buffer cache was merged with vm. Then there was only the buffer pool with nothing between it and the disk, and it was too small. Linux might not have this problem because it is still using a simple and better buffer cache. At least 10-15 years ago, its buffer cache had a fixed block size of 1K where FreeBSD's buffer cache had a variable block size with the usual size equal to the ffs file system block size of 4K or 8K. With a block size of 1K, at least 4 times as many buffers are needed to compete on storage with a block size of 4K, and the buffer allocation routines need to be at least 4 times as efficient to compete on efficiency. Linux actually had a much larger multiple than 4 for the storage. I'm not sure about the efficiency factor, but it wasn't too bad (any in-memory buffer management is better than waiting for the disk, the small fixed size of 1K is much easier to manage than larger, variable sizes). The FreeBSD buffer management was and is especially unsuited to file systems with small block sizes like msdsofs floppies (512-blocks) and the original version of Linux's extfs (1K-blocks). With a buffer cache (pool) size of 256, you can manage a whole 128KB comprised of 512-blocks and got enormous thrashing for accessing a 1200KB floppy. With vm backing and a buffer cache size of a few thousand, the thrashing only occurs in memory, and a 1200KB floppy now barely fits in the buffer cache (pool). Also, no one uses 1200KB floppies. More practically, this problem makes msdosfs on hard disks (normally 4K-blocks) and ext2fs on hard disks (1K or 4K blocks) slower than they should be under FreeBSD. vm backing and clustering masks only some of the slowness. The problem becomes smaller as the read block size appoaches the file system block size and vanishes when the sizes are identical. Then there is apparently a different (smaller) problem: Read size 16K, random: %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 49.1 0.565 0.565 25643 22037 22037 copyout [11] 12.6 0.710 0.145 0 100.00% mcount [14] 8.8 0.811 0.101 87831 1153 1153 vm_page_splay [17] 7.0 0.892 0.081 112906 715 715 buf_splay [19] 6.1 0.962 0.070 0 100.00% mexitcount [20] 3.4 1.000 0.039 0 100.00% cputime [22] 1.2 1.013 0.013 86883 153 181 vm_page_unwire [28] 1.1 1.027 0.013 0 100.00% user [29] 1.1 1.040 0.013 21852 595 3725 getnewbuf [18] %%% Read size 16K, sequential: %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 57.1 0.550 0.550 25643 21464 21464 copyout [11] 14.2 0.687 0.137 0 100.00% mcount [12] 6.9 0.754 0.066 0 100.00% mexitcount [15] 4.2 0.794 0.040 102830 391 391 vm_page_splay [19] 3.8 0.830 0.037 0 100.00% cputime [20] 1.4 0.844 0.013 102588 130 130 buf_splay [22] 1.3 0.856 0.012 25603 488 1920 getnewbuf [17] 1.0 0.866 0.009 25606 368 368 pmap_qremove [24] %%% Now the splay routines are called almost the same number of times, but take much longer in the random case. buf_splay() seems to be unrelated to vm -- it is called from gbincore() even if the buffer is already in the buffer cache. It seems quite slow for that -- almost 1 uS just to look up compared with 21 uS to copyout a 16K buffer. Linux-sized buffers would take only 1.5 uS and then 1 uS to look them up is clearly too much. Another benchmark shows gbincore() taking 501 nS each to look up 64 in-buffer-cache buffers for 1MB file -- this must be the best case for it (all these times are for -current on an Athlon XP2700 overclocked to 2025MHz). The generic hash function used in my compiler takes 40 nS to hash a 16-byte string on this machine. The merged vm/buffer cache is clearly implemented suboptimally. Direct access to VMIO pages might be better, but it isn't clear how to implement it without getting the slowest parts of vm for all accesses. The buffer cache is now essentially just a cache of vm mappings, with vm mapping being so slow that it needs to be cached. The last thing that you want to do is throw away this cache and have to do a slow mapping for every access. I think the correct method is to wait for larger virtual address spaces (already here) and use sparse mappings more. Bruce From owner-freebsd-performance@FreeBSD.ORG Sat Dec 23 10:27:42 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A876A16A416 for ; Sat, 23 Dec 2006 10:27:42 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.freebsd.org (Postfix) with ESMTP id 464D913C448 for ; Sat, 23 Dec 2006 10:27:42 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout2.pacific.net.au (Postfix) with ESMTP id 9055010994D; Sat, 23 Dec 2006 21:27:39 +1100 (EST) Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 567BC8C0A; Sat, 23 Dec 2006 21:27:39 +1100 (EST) Date: Sat, 23 Dec 2006 21:27:38 +1100 (EST) From: Bruce Evans X-X-Sender: bde@epsplex.bde.org To: Mark Kirkwood In-Reply-To: <20061223175413.W1116@epsplex.bde.org> Message-ID: <20061223205324.B1533@epsplex.bde.org> References: <458B3651.8090601@paradise.net.nz> <20061222171431.L18486@delplex.bde.org> <458C7FB1.9020002@paradise.net.nz> <20061223175413.W1116@epsplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-performance@freebsd.org Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Dec 2006 10:27:42 -0000 On Sat, 23 Dec 2006, I wrote: > The problem becomes smaller as the read block size appoaches the file > system block size and vanishes when the sizes are identical. Then > there is apparently a different (smaller) problem: > > Read size 16K, random: > %%% > granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds > > % cumulative self self total > time seconds seconds calls ns/call ns/call name > 49.1 0.565 0.565 25643 22037 22037 copyout [11] > 12.6 0.710 0.145 0 100.00% mcount [14] > 8.8 0.811 0.101 87831 1153 1153 vm_page_splay [17] > 7.0 0.892 0.081 112906 715 715 buf_splay [19] > 6.1 0.962 0.070 0 100.00% mexitcount [20] > 3.4 1.000 0.039 0 100.00% cputime [22] > 1.2 1.013 0.013 86883 153 181 vm_page_unwire [28] > 1.1 1.027 0.013 0 100.00% user [29] > 1.1 1.040 0.013 21852 595 3725 getnewbuf [18] > %%% > > Read size 16K, sequential: > %%% > granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds > > % cumulative self self total > time seconds seconds calls ns/call ns/call name > 57.1 0.550 0.550 25643 21464 21464 copyout [11] > 14.2 0.687 0.137 0 100.00% mcount [12] > 6.9 0.754 0.066 0 100.00% mexitcount [15] > 4.2 0.794 0.040 102830 391 391 vm_page_splay [19] > 3.8 0.830 0.037 0 100.00% cputime [20] > 1.4 0.844 0.013 102588 130 130 buf_splay [22] > 1.3 0.856 0.012 25603 488 1920 getnewbuf [17] > 1.0 0.866 0.009 25606 368 368 pmap_qremove [24] > %%% > > Now the splay routines are called almost the same number of times, but > take much longer in the random case. buf_splay() seems to be unrelated > to vm -- it is called from gbincore() even if the buffer is already > in the buffer cache. It seems quite slow for that -- almost 1 uS just > to look up compared with 21 uS to copyout a 16K buffer. Linux-sized > buffers would take only 1.5 uS and then 1 uS to look them up is clearly > too much. Another benchmark shows gbincore() taking 501 nS each to > look up 64 in-buffer-cache buffers for 1MB file -- this must be the > best case for it (all these times are for -current on an Athlon XP2700 > overclocked to 2025MHz). The generic hash function used in my compiler > takes 40 nS to hash a 16-byte string on this machine. FreeBSD-~4.10 is faster. The difference is especially noticeable when the read size is the same as the fs block size (16K, as above). Then I get the following speeds: ~4.10, random: 580MB/S ~4.10, sequential: 580MB/S ~5.2, random: 575MB/S ~5.2, sequential: 466MB/S All with kernel profiling not configured, and no INVARIANTS etc. ~5.2 is quite different from -current, but it has buf_splay() and vm_page_splay(), and behaves similarly in this benchmark. With profiling ~4.10, read size 16K, sequential +some random: %%% % cumulative self self total time seconds seconds calls ns/call ns/call name 51.1 0.547 0.547 25643 21323 21323 generic_copyout [9] 17.3 0.732 0.185 0 100.00% mcount [10] 7.9 0.817 0.085 0 100.00% mexitcount [13] 5.0 0.870 0.053 0 100.00% cputime [16] 1.9 0.891 0.020 51207 395 395 gbincore [20] (424 for random) 1.4 0.906 0.015 102418 150 253 vm_page_wire [18] (322) 1.3 0.920 0.014 231218 62 62 splvm [23] 1.3 0.934 0.014 25603 541 2092 allocbuf [15] (2642) 1.0 0.945 0.010 566947 18 18 splx [25] 1.0 0.955 0.010 102122 100 181 vm_page_unwire [21] 0.9 0.964 0.009 25606 370 370 pmap_qremove [27] 0.9 0.973 0.009 25603 359 2127 getnewbuf [14] (2261) %%% There is little difference for the sequential case, but the old gbincore() and buffer allocation routines are much faster for the random case. With profiling ~4.10, read size 4K, random: %%% granularity: each sample hit covers 16 byte(s) for 0.00% of 2.63 seconds % cumulative self self total time seconds seconds calls ns/call ns/call name 27.3 0.720 0.720 0 100.00% mcount [8] 22.5 1.312 0.592 102436 5784 5784 generic_copyout [10] 12.6 1.643 0.331 0 100.00% mexitcount [13] 7.9 1.850 0.207 0 100.00% cputime [15] 2.9 1.926 0.076 189410 402 402 gbincore [20] 2.3 1.988 0.061 348029 176 292 vm_page_wire [18] 2.2 2.045 0.058 87010 662 2500 allocbuf [14] 2.0 2.099 0.053 783280 68 68 splvm [22] 1.6 2.142 0.043 0 99.33% user [24] 1.6 2.184 0.042 2041759 20 20 splx [26] 1.3 2.217 0.034 347298 97 186 vm_page_unwire [21] 1.2 2.249 0.032 86895 370 370 pmap_qremove [28] 1.1 2.279 0.029 87006 337 2144 getnewbuf [16] 0.9 2.303 0.024 86891 280 1617 vfs_vmio_release [17] %%% Now the result is little different from -current -- the random case is almost as slow as in -current according to the total time, although this may be an artifact of profiling (allocbuf takes 2500 nS total in ~4.10 vs 4025 nS in -current). Bruce From owner-freebsd-performance@FreeBSD.ORG Sat Dec 23 17:07:33 2006 Return-Path: X-Original-To: freebsd-performance@freebsd.org Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2CFD416A403 for ; Sat, 23 Dec 2006 17:07:33 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from nf-out-0910.google.com (nf-out-0910.google.com [64.233.182.187]) by mx1.freebsd.org (Postfix) with ESMTP id BB7A413C44B for ; Sat, 23 Dec 2006 17:07:30 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: by nf-out-0910.google.com with SMTP id x37so3577588nfc for ; Sat, 23 Dec 2006 09:07:29 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=DJC2zEUnGP9IJOajhK19FGfpnDEVFG4WHD6YVNwC665gX9xmadNOKWw1M0BKJC5F2ZzCnt1WwxKtvPEBmfegm7BVEpIDlzX6S6Zwva94R3RXIVl9lcoQTH2VP2EiYmrZk9+sFhARWIin97KriasCgZ0mGb7Ax0gsA6UqjJzXndY= Received: by 10.82.136.4 with SMTP id j4mr2296203bud.1166892025490; Sat, 23 Dec 2006 08:40:25 -0800 (PST) Received: by 10.82.178.4 with HTTP; Sat, 23 Dec 2006 08:40:25 -0800 (PST) Message-ID: <3bbf2fe10612230840u7ffb2855y8d6151d2f24ace4@mail.gmail.com> Date: Sat, 23 Dec 2006 17:40:25 +0100 From: "Attilio Rao" Sender: asmrookie@gmail.com To: "David Xu" In-Reply-To: <200612231321.52178.davidxu@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <458B3651.8090601@paradise.net.nz> <20061222222757.G18486@delplex.bde.org> <20061222202933.709d2279@Magellan.Leidinger.net> <200612231321.52178.davidxu@freebsd.org> X-Google-Sender-Auth: 463bc17a13ab91f1 X-Mailman-Approved-At: Sat, 23 Dec 2006 19:22:26 +0000 Cc: Mark Kirkwood , Alexander Leidinger , Adrian Chadd , freebsd-performance@freebsd.org, Bruce Evans Subject: Re: Cached file read performance X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Dec 2006 17:07:33 -0000 2006/12/23, David Xu : > On Saturday 23 December 2006 03:29, Alexander Leidinger wrote: > > > I want to point out http://www.freebsd.org/projects/ideas/#p-memcpy > > here. Just in case someone wants to play around a little bit. > > > > Bye, > > Alexander. > > I have read the code, if a buffer is not aligned at 16 bytes boundary, > it will not use FPU to copy data, but user buffer is not always 16 bytes > aligned. If the buffer is not aligned, speedup improvement is so small to be near at 0%. Attilio -- Peace can only be achieved by understanding - A. Einstein