From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 08:00:46 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EC903106564A; Sun, 5 Jul 2009 08:00:46 +0000 (UTC) (envelope-from gary.jennejohn@freenet.de) Received: from mout3.freenet.de (mout3.freenet.de [IPv6:2001:748:100:40::2:5]) by mx1.freebsd.org (Postfix) with ESMTP id 4A9B68FC0A; Sun, 5 Jul 2009 08:00:46 +0000 (UTC) (envelope-from gary.jennejohn@freenet.de) Received: from [195.4.92.27] (helo=17.mx.freenet.de) by mout3.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port 25) (Exim 4.69 #92) id 1MNMeS-0006Tc-Ux; Sun, 05 Jul 2009 10:00:45 +0200 Received: from tb438.t.pppool.de ([89.55.180.56]:33676 helo=ernst.jennejohn.org) by 17.mx.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port 25) (Exim 4.69 #79) id 1MNMeS-0006uC-KX; Sun, 05 Jul 2009 10:00:44 +0200 Date: Sun, 5 Jul 2009 10:00:44 +0200 From: Gary Jennejohn To: Alexander Motin Message-ID: <20090705100044.4053e2f9@ernst.jennejohn.org> In-Reply-To: <4A4FAA2D.3020409@FreeBSD.org> References: <4A4FAA2D.3020409@FreeBSD.org> X-Mailer: Claws Mail 3.7.1 (GTK+ 2.16.2; amd64-portbld-freebsd8.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-purgate-ID: 149285::1246780844-00000BB6-227D931B/0-0/0-0 Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: gary.jennejohn@freenet.de List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 08:00:47 -0000 On Sat, 04 Jul 2009 22:14:53 +0300 Alexander Motin wrote: > Can somebody explain me a difference between DFLTPHYS and MAXPHYS > constants? As I understand, the last one is a maximal amount of memory, > that can be mapped to the kernel, or passed to the hardware drivers. But > why then DFLTPHYS is used in so many places and what does it mean? > There's a pretty good comment on these in /sys/conf/NOTES. > Isn't it a time to review their values for increasing? 64KB looks funny, > comparing to modern memory sizes and data rates. It just increases > interrupt rates, but I don't think it really need to be so small to > improve interactivity now. > Probably historical from the days when memory was scarce. There's nothing preventing the user from upping these values in his kernel config file. But note the warning in NOTES about possibly making the kernel unbootable. It's not clear whether this warning is still valid given todays larger memory footprints and the inmproved VM system. I wonder whether all drivers can correctly handle larger values for DFLTPHYS. --- Gary Jennejohn From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 08:38:38 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A9DE1106564A for ; Sun, 5 Jul 2009 08:38:38 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id 33B9B8FC12 for ; Sun, 5 Jul 2009 08:38:37 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from [212.86.226.226] (account mav@alkar.net HELO mavbook.mavhome.dp.ua) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPSA id 247673270; Sun, 05 Jul 2009 11:38:35 +0300 Message-ID: <4A50667F.7080608@FreeBSD.org> Date: Sun, 05 Jul 2009 11:38:23 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.21 (X11/20090405) MIME-Version: 1.0 To: gary.jennejohn@freenet.de References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> In-Reply-To: <20090705100044.4053e2f9@ernst.jennejohn.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 08:38:38 -0000 Gary Jennejohn wrote: > On Sat, 04 Jul 2009 22:14:53 +0300 > Alexander Motin wrote: > >> Can somebody explain me a difference between DFLTPHYS and MAXPHYS >> constants? As I understand, the last one is a maximal amount of memory, >> that can be mapped to the kernel, or passed to the hardware drivers. But >> why then DFLTPHYS is used in so many places and what does it mean? > > There's a pretty good comment on these in /sys/conf/NOTES. But it does not explains why. >> Isn't it a time to review their values for increasing? 64KB looks funny, >> comparing to modern memory sizes and data rates. It just increases >> interrupt rates, but I don't think it really need to be so small to >> improve interactivity now. > > Probably historical from the days when memory was scarce. > > There's nothing preventing the user from upping these values in his > kernel config file. But note the warning in NOTES about possibly > making the kernel unbootable. It's not clear whether this warning is > still valid given todays larger memory footprints and the inmproved > VM system. > > I wonder whether all drivers can correctly handle larger values for > DFLTPHYS. There are always will be drivers/devices with limitations. They should just be able to report that limitations to system. This is possible with GEOM, but it doesn't looks tuned well for all providers. There are many places, when DFLTPHYS used just with hope that it will work. IMHO if driver unable to adapt to any defined DFLTPHYS value, it should not use it, but instead should announce some specific value that it really supports. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 14:11:55 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 48E6F1065670; Sun, 5 Jul 2009 14:11:55 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id D7B688FC16; Sun, 5 Jul 2009 14:11:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-161-96.carlnfd1.nsw.optusnet.com.au (c122-106-161-96.carlnfd1.nsw.optusnet.com.au [122.106.161.96]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n65EBp4B013112 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 6 Jul 2009 00:11:52 +1000 Date: Mon, 6 Jul 2009 00:11:51 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Alexander Motin In-Reply-To: <4A50667F.7080608@FreeBSD.org> Message-ID: <20090705223126.I42918@delplex.bde.org> References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: gary.jennejohn@freenet.de, freebsd-arch@FreeBSD.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 14:11:55 -0000 On Sun, 5 Jul 2009, Alexander Motin wrote: > Gary Jennejohn wrote: >> On Sat, 04 Jul 2009 22:14:53 +0300 >> Alexander Motin wrote: >> >>> Can somebody explain me a difference between DFLTPHYS and MAXPHYS >>> constants? As I understand, the last one is a maximal amount of memory, >>> that can be mapped to the kernel, or passed to the hardware drivers. But >>> why then DFLTPHYS is used in so many places and what does it mean? >> >> There's a pretty good comment on these in /sys/conf/NOTES. > > But it does not explains why. DFLTPHYS is the default -- the size to be used when the correct size is not known. However, this is mostly broken: - the correct size should always be known at a low level. You have to know the maximum size for a device to know that this size is larger than the default, else using the default size won't work. Also, you have to know that the default size is a multiple of the minimum size. Both of these are usually true accidentally, so things sort of work. - the default size is defaulted inconsistently. Geom hides the device maximum i/o size (d_maxsize, which is normally either 64K or DFLTPHYS which happen to be the same) from the top level of devices (it reblocks if necessary so that sizes up to (s_iosize_max, which is always MAXPHYS) work, so it is difficult to see the the low-level size or to use an i/o size that is a multiple of the device maximum i/o size if the latter is not a divisor or MAXPHYS. This means that hard-coding MAXPHYS would work best in most places above the driver level, but most places have a mess of buggy layering (mnt_iosize_max is supposed to default to DFLTPHYS and then be changed to si_iosize_max when the latter is known, but some file systems forget to do this). >>> Isn't it a time to review their values for increasing? 64KB looks funny, >>> comparing to modern memory sizes and data rates. It just increases >>> interrupt rates, but I don't think it really need to be so small to >>> improve interactivity now. 64K is large enough to bust modern L1 caches and old L2 caches. Make the size bigger to bust modern L2 caches too. Interrupt rates don't matter when you are transfering 64K items per interrupt. >> I wonder whether all drivers can correctly handle larger values for >> DFLTPHYS. Most can't, since their hardware can't. They can fake it (ata used to) but there is negative point in this for most drivers, since geom already reblocks for disk devices and reblocking would be wrong for devices like tapes. > There are always will be drivers/devices with limitations. They should just > be able to report that limitations to system. This is possible with GEOM, but > it doesn't looks tuned well for all providers. There are many places, when > DFLTPHYS used just with hope that it will work. IMHO if driver unable to > adapt to any defined DFLTPHYS value, it should not use it, but instead should > announce some specific value that it really supports. cam scsi devices seem to be the only important ones that still hard-code d_maxsize to DFLTPHYS. Strangely, pre-cam scsi had the beginnings (or remnants) of more sophisticated i/o size limiting. In FreeBSD-1, it has an xxminphys() function for every scsi device. I think it was supposed to be possible to ask any device for any i/o size, and minphys was used for reblocking at a low level. minphys was only implemented for scsi drivers and wasn't part of the physio() as in Net/2 (?). For the aha1542 driver, minphys was: % void % ahaminphys(bp) % struct buf *bp; % { % /* aha seems to explode with 17 segs (64k may require 17 segs) */ % /* on old boards so use a max of 16 segs if you have problems here */ % if (bp->b_bcount > ((AHA_NSEG - 1) * PAGESIZ)) { % bp->b_bcount = ((AHA_NSEG - 1) * PAGESIZ); % } % } FreeBSD-1 doesn't have DFLTPHYS, and barely uses MAXPHYS. MAXPHYS was 64K. I think MAXBSIZE = 64K limited most transfers. However, physio() uses a buffer of size 256K, larger than it does today!, so apparently, device drivers were responsible for lots of reblocking. In the wd driver, the reblocking consisted of doing 1 512-block at a time (I think it didn't even do multiple sectors per interrupt then). Bruce From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 14:16:13 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5ED031065675; Sun, 5 Jul 2009 14:16:13 +0000 (UTC) (envelope-from gary.jennejohn@freenet.de) Received: from mout1.freenet.de (mout1.freenet.de [IPv6:2001:748:100:40::2:3]) by mx1.freebsd.org (Postfix) with ESMTP id DFF208FC08; Sun, 5 Jul 2009 14:16:12 +0000 (UTC) (envelope-from gary.jennejohn@freenet.de) Received: from [195.4.92.16] (helo=6.mx.freenet.de) by mout1.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port 25) (Exim 4.69 #92) id 1MNSVn-0005vt-B6; Sun, 05 Jul 2009 16:16:11 +0200 Received: from tb832.t.pppool.de ([89.55.184.50]:37641 helo=ernst.jennejohn.org) by 6.mx.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port 25) (Exim 4.69 #79) id 1MNSVn-0005yg-1L; Sun, 05 Jul 2009 16:16:11 +0200 Date: Sun, 5 Jul 2009 16:16:10 +0200 From: Gary Jennejohn To: Alexander Motin Message-ID: <20090705161610.52e01954@ernst.jennejohn.org> In-Reply-To: <4A50667F.7080608@FreeBSD.org> References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> X-Mailer: Claws Mail 3.7.2 (GTK+ 2.16.2; amd64-portbld-freebsd8.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-purgate-ID: 149285::1246803371-00000AA3-92B87048/0-0/0-0 Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: gary.jennejohn@freenet.de List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 14:16:13 -0000 On Sun, 05 Jul 2009 11:38:23 +0300 Alexander Motin wrote: > Gary Jennejohn wrote: > > I wonder whether all drivers can correctly handle larger values for > > DFLTPHYS. > > There are always will be drivers/devices with limitations. They should > just be able to report that limitations to system. This is possible with > GEOM, but it doesn't looks tuned well for all providers. There are many > places, when DFLTPHYS used just with hope that it will work. IMHO if > driver unable to adapt to any defined DFLTPHYS value, it should not use > it, but instead should announce some specific value that it really supports. > This would be the correct way to do things. I remember back in the good-old-days, circa 1985, disk drivers _always_ did their own PHYS handling so that utilities could pass in whatever value they wanted to use for the size. Of course, that meant that each driver reinvented the wheel. --- Gary Jennejohn From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 14:37:21 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B12011065670 for ; Sun, 5 Jul 2009 14:37:21 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id 335458FC17 for ; Sun, 5 Jul 2009 14:37:20 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from [212.86.226.226] (account mav@alkar.net HELO mavbook.mavhome.dp.ua) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPSA id 247686623; Sun, 05 Jul 2009 17:37:17 +0300 Message-ID: <4A50BA9A.9080005@FreeBSD.org> Date: Sun, 05 Jul 2009 17:37:14 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.21 (X11/20090405) MIME-Version: 1.0 To: Bruce Evans References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> In-Reply-To: <20090705223126.I42918@delplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: gary.jennejohn@freenet.de, freebsd-arch@FreeBSD.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 14:37:21 -0000 Bruce Evans wrote: > On Sun, 5 Jul 2009, Alexander Motin wrote: >>>> Isn't it a time to review their values for increasing? 64KB looks >>>> funny, comparing to modern memory sizes and data rates. It just >>>> increases interrupt rates, but I don't think it really need to be so >>>> small to improve interactivity now. > > 64K is large enough to bust modern L1 caches and old L2 caches. Make the > size bigger to bust modern L2 caches too. Interrupt rates don't matter > when you are transfering 64K items per interrupt. How cache size related to it, if DMA transfers data directly to RAM? Sure, CPU will invalidate related cache lines, but why it should invalidate everything? Small transfers give more work to all levels from GEOM down to CAM/ATA, controllers and drives. It is not just a context switching. >>> I wonder whether all drivers can correctly handle larger values for >>> DFLTPHYS. > > Most can't, since their hardware can't. They can fake it (ata used to) > but there is negative point in this for most drivers, since geom already > reblocks for disk devices and reblocking would be wrong for devices like > tapes. I am not speaking about reblocking. I am speaking about best possible hardware usage. I can't say about the most, but at least AHCI and modern SiI SATA chips, I have worked closely, practically have no limits for transaction size, except the amount of memory their drivers allocate for S/G table. My new drivers are able to self-tune for any MAXPHYS value. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 16:46:40 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EE31E106566C; Sun, 5 Jul 2009 16:46:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id 88ADC8FC0A; Sun, 5 Jul 2009 16:46:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-161-96.carlnfd1.nsw.optusnet.com.au [122.106.161.96]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n65GkbhA030233 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 6 Jul 2009 02:46:38 +1000 Date: Mon, 6 Jul 2009 02:46:37 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alexander Motin In-Reply-To: <4A50BA9A.9080005@FreeBSD.org> Message-ID: <20090706005851.L1439@besplex.bde.org> References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: gary.jennejohn@freenet.de, freebsd-arch@FreeBSD.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 16:46:41 -0000 On Sun, 5 Jul 2009, Alexander Motin wrote: > Bruce Evans wrote: >> On Sun, 5 Jul 2009, Alexander Motin wrote: >>>>> Isn't it a time to review their values for increasing? 64KB looks funny, >>>>> comparing to modern memory sizes and data rates. It just increases >>>>> interrupt rates, but I don't think it really need to be so small to >>>>> improve interactivity now. >> >> 64K is large enough to bust modern L1 caches and old L2 caches. Make the >> size bigger to bust modern L2 caches too. Interrupt rates don't matter >> when you are transfering 64K items per interrupt. > > How cache size related to it, if DMA transfers data directly to RAM? Sure, > CPU will invalidate related cache lines, but why it should invalidate > everything? I was thinking more of transfers to userland. Increasing user buffer sizes above about half the L2 cache size guarantees busting the L2 cache, if the application actually looks at all of its data. If the data is read using read(), then the L2 cache will be busted twice (or a bit less with nontemporal copying), first by copying out the data and then by looking at it. If the data is read using mmap(), then the L2 cache will only be busted once. This effect has always been very noticeable using dd. Larger buffer sizes are also bad for latency. > Small transfers give more work to all levels from GEOM down to CAM/ATA, > controllers and drives. It is not just a context switching. Yes, I can't see any cache busting below the level of copyout(). Also, after you convert all applications to use mmap() instead of read(), the cache busting should become per-CPU. >>>> I wonder whether all drivers can correctly handle larger values for >>>> DFLTPHYS. >> >> Most can't, since their hardware can't. They can fake it (ata used to) >> but there is negative point in this for most drivers, since geom already >> reblocks for disk devices and reblocking would be wrong for devices like >> tapes. > > I am not speaking about reblocking. I am speaking about best possible > hardware usage. I can't say about the most, but at least AHCI and modern SiI > SATA chips, I have worked closely, practically have no limits for transaction > size, except the amount of memory their drivers allocate for S/G table. My > new drivers are able to self-tune for any MAXPHYS value. The main limit above ata seems to be only MAXPHYS and its use in pbufs. DFLTPHYS seems to only be used in buggy unimportant cases. Bruce From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 17:12:16 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 75AD01065670 for ; Sun, 5 Jul 2009 17:12:16 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id B4AE58FC17 for ; Sun, 5 Jul 2009 17:12:15 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from [212.86.226.226] (account mav@alkar.net HELO mavbook.mavhome.dp.ua) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPSA id 247691562; Sun, 05 Jul 2009 20:12:12 +0300 Message-ID: <4A50DEE8.6080406@FreeBSD.org> Date: Sun, 05 Jul 2009 20:12:08 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.21 (X11/20090405) MIME-Version: 1.0 To: Bruce Evans References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> In-Reply-To: <20090706005851.L1439@besplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 17:12:16 -0000 Bruce Evans wrote: > On Sun, 5 Jul 2009, Alexander Motin wrote: >> Bruce Evans wrote: >>> On Sun, 5 Jul 2009, Alexander Motin wrote: >>> 64K is large enough to bust modern L1 caches and old L2 caches. Make >>> the >>> size bigger to bust modern L2 caches too. Interrupt rates don't matter >>> when you are transfering 64K items per interrupt. >> >> How cache size related to it, if DMA transfers data directly to RAM? >> Sure, CPU will invalidate related cache lines, but why it should >> invalidate everything? > > I was thinking more of transfers to userland. Increasing user buffer > sizes above about half the L2 cache size guarantees busting the L2 > cache, if the application actually looks at all of its data. If the > data is read using read(), then the L2 cache will be busted twice (or > a bit less with nontemporal copying), first by copying out the data > and then by looking at it. If the data is read using mmap(), then the > L2 cache will only be busted once. This effect has always been very > noticeable using dd. Larger buffer sizes are also bad for latency. > >> Small transfers give more work to all levels from GEOM down to >> CAM/ATA, controllers and drives. It is not just a context switching. > > Yes, I can't see any cache busting below the level of copyout(). Also, > after you convert all applications to use mmap() instead of read(), > the cache busting should become per-CPU. As soon as file data usually passing via buffer cache, they will anyway be read to the different memory areas and copied-out from them. So I don't see much difference there between doing single big and several small transactions. Cache trashing by user-level also will depends only on user-level application buffer size, but not on kernel. How to reproduce that dd experiment? I have my system running with MAXPHYS of 512K and here is what I have: # dd if=/dev/ada0 of=/dev/null bs=512k count=1000 1000+0 records in 1000+0 records out 524288000 bytes transferred in 2.471564 secs (212128024 bytes/sec) # dd if=/dev/ada0 of=/dev/null bs=256k count=2000 2000+0 records in 2000+0 records out 524288000 bytes transferred in 2.666643 secs (196609752 bytes/sec) # dd if=/dev/ada0 of=/dev/null bs=128k count=4000 4000+0 records in 4000+0 records out 524288000 bytes transferred in 2.759498 secs (189993969 bytes/sec) # dd if=/dev/ada0 of=/dev/null bs=64k count=8000 8000+0 records in 8000+0 records out 524288000 bytes transferred in 2.718900 secs (192830927 bytes/sec) CPU load instead grows from 10% at 512K to 15% at 64K. May be trashing effect will only be noticeable at block comparable to cache size, but modern CPUs have megabytes of cache. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 18:32:15 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 424C7106566C; Sun, 5 Jul 2009 18:32:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 99B0C8FC14; Sun, 5 Jul 2009 18:32:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-161-96.carlnfd1.nsw.optusnet.com.au [122.106.161.96]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n65IWBpC005294 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 6 Jul 2009 04:32:13 +1000 Date: Mon, 6 Jul 2009 04:32:11 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alexander Motin In-Reply-To: <4A50DEE8.6080406@FreeBSD.org> Message-ID: <20090706034250.C2240@besplex.bde.org> References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 18:32:15 -0000 On Sun, 5 Jul 2009, Alexander Motin wrote: > Bruce Evans wrote: >> I was thinking more of transfers to userland. Increasing user buffer >> sizes above about half the L2 cache size guarantees busting the L2 >> cache, if the application actually looks at all of its data. If the >> data is read using read(), then the L2 cache will be busted twice (or >> a bit less with nontemporal copying), first by copying out the data >> and then by looking at it. If the data is read using mmap(), then the >> L2 cache will only be busted once. This effect has always been very >> noticeable using dd. Larger buffer sizes are also bad for latency. > ... > How to reproduce that dd experiment? I have my system running with MAXPHYS of > 512K and here is what I have: I used a regular file with the same size as main memory (1G), and for today's test, not quite dd, but a program that throws away the data (so as to avoid overcall for write syscalls) and prints status info in a more suitable form than even dd's ^T. Your results show that physio() behaves quite differently than copying reading a regular file. I see similar behaviour input from a disk file. > # dd if=/dev/ada0 of=/dev/null bs=512k count=1000 > 1000+0 records in > 1000+0 records out > 524288000 bytes transferred in 2.471564 secs (212128024 bytes/sec) 512MB would be too small with buffering for a regular file, but should be OK with a disk file. > # dd if=/dev/ada0 of=/dev/null bs=256k count=2000 > 2000+0 records in > 2000+0 records out > 524288000 bytes transferred in 2.666643 secs (196609752 bytes/sec) > # dd if=/dev/ada0 of=/dev/null bs=128k count=4000 > 4000+0 records in > 4000+0 records out > 524288000 bytes transferred in 2.759498 secs (189993969 bytes/sec) > # dd if=/dev/ada0 of=/dev/null bs=64k count=8000 > 8000+0 records in > 8000+0 records out > 524288000 bytes transferred in 2.718900 secs (192830927 bytes/sec) > > CPU load instead grows from 10% at 512K to 15% at 64K. May be trashing effect > will only be noticeable at block comparable to cache size, but modern CPUs > have megabytes of cache. I used systat -v to estimate the load. Its average jumps around more than I like, but I don't have anything better. Sys time from dd and others is even more useless than it used to be since lots of the i/o runs in threads and the system doesn't know how to charge the application for thread time. My results (MAXPHYS is 64K, transfer rate 50MB/S, under FreeBSD-~5.2 de-geomed): regular file: block size %idle ---------- ----- 1M 87 16K 91 4K 88 (?) 512 72 (?) disk file: block size %idle ---------- ----- 1M 96 64K 96 32K 93 16K 87 8K 82 (firmware can't keep up and rate drops to 37MB/S) In the case of the regular file, almost all i/o is clustered so the driver sees mainly the cluster size (driver max size of 64K before geom). Upper layers then do a good job of only adding a few percent CPU when declustering to 16K fs-blocks. In the case of the disk file, I can't explain why the overhead is so low (~0.5% intr 3.5% sys) for large block sizes. Uncached copies on the test machine go at 850MB/S so 50MB/S should take 1/19 of the CPU or 5.3%. Another difference with the disk file test is that physio() uses a single pbuf so the test doesn't thrash the buffer cache's memory. dd of a large regular file will thrash the L2 cache even if the user buffer size is small, but still goes faster with a smaller user buffer since the user buffer stays cached. Faster disks will of course want larger block sizes. I'm still suprised that this makes more difference to CPU than throughput. Maybe it doesn't really, but the measurement becomes differently accurate when the CPU becomes more loaded. At 100% load there would be nowhere to hide things like speculative cache fetches. Bruce From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 18:51:13 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9CB741065672 for ; Sun, 5 Jul 2009 18:51:13 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id DC3CE8FC15 for ; Sun, 5 Jul 2009 18:51:12 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from [212.86.226.226] (account mav@alkar.net HELO mavbook.mavhome.dp.ua) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPSA id 247694239; Sun, 05 Jul 2009 21:51:09 +0300 Message-ID: <4A50F619.4020101@FreeBSD.org> Date: Sun, 05 Jul 2009 21:51:05 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.21 (X11/20090405) MIME-Version: 1.0 To: Bruce Evans References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> In-Reply-To: <20090706034250.C2240@besplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 18:51:13 -0000 Bruce Evans wrote: > On Sun, 5 Jul 2009, Alexander Motin wrote: > >> Bruce Evans wrote: >>> I was thinking more of transfers to userland. Increasing user buffer >>> sizes above about half the L2 cache size guarantees busting the L2 >>> cache, if the application actually looks at all of its data. If the >>> data is read using read(), then the L2 cache will be busted twice (or >>> a bit less with nontemporal copying), first by copying out the data >>> and then by looking at it. If the data is read using mmap(), then the >>> L2 cache will only be busted once. This effect has always been very >>> noticeable using dd. Larger buffer sizes are also bad for latency. >> ... >> How to reproduce that dd experiment? I have my system running with >> MAXPHYS of 512K and here is what I have: > > I used a regular file with the same size as main memory (1G), and for > today's test, not quite dd, but a program that throws away the data > (so as to avoid overcall for write syscalls) and prints status info > in a more suitable form than even dd's ^T. > > Your results show that physio() behaves quite differently than copying > reading a regular file. I see similar behaviour input from a disk file. > >> # dd if=/dev/ada0 of=/dev/null bs=512k count=1000 >> 1000+0 records in >> 1000+0 records out >> 524288000 bytes transferred in 2.471564 secs (212128024 bytes/sec) > > 512MB would be too small with buffering for a regular file, but should > be OK with a disk file. > >> # dd if=/dev/ada0 of=/dev/null bs=256k count=2000 >> 2000+0 records in >> 2000+0 records out >> 524288000 bytes transferred in 2.666643 secs (196609752 bytes/sec) >> # dd if=/dev/ada0 of=/dev/null bs=128k count=4000 >> 4000+0 records in >> 4000+0 records out >> 524288000 bytes transferred in 2.759498 secs (189993969 bytes/sec) >> # dd if=/dev/ada0 of=/dev/null bs=64k count=8000 >> 8000+0 records in >> 8000+0 records out >> 524288000 bytes transferred in 2.718900 secs (192830927 bytes/sec) >> >> CPU load instead grows from 10% at 512K to 15% at 64K. May be trashing >> effect will only be noticeable at block comparable to cache size, but >> modern CPUs have megabytes of cache. > > I used systat -v to estimate the load. Its average jumps around more > than I > like, but I don't have anything better. Sys time from dd and others is > even > more useless than it used to be since lots of the i/o runs in threads and > the system doesn't know how to charge the application for thread time. > > My results (MAXPHYS is 64K, transfer rate 50MB/S, under FreeBSD-~5.2 > de-geomed): > > regular file: > > block size %idle > ---------- ----- > 1M 87 > 16K 91 > 4K 88 (?) > 512 72 (?) > > disk file: > > block size %idle > ---------- ----- > 1M 96 > 64K 96 > 32K 93 > 16K 87 > 8K 82 (firmware can't keep up and rate drops to 37MB/S) > > In the case of the regular file, almost all i/o is clustered so the driver > sees mainly the cluster size (driver max size of 64K before geom). Upper > layers then do a good job of only adding a few percent CPU when > declustering > to 16K fs-blocks. In this tests you've got almost only negative side of effect, as you have said, due to cache misses. Do you really have CPU with so small L2 cache? Some kind of P3 or old Celeron? But with 64K MAXPHYS you just didn't get any benefit from using bigger block size. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 19:16:35 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 79D4D1065674 for ; Sun, 5 Jul 2009 19:16:35 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id EE6AA8FC1D for ; Sun, 5 Jul 2009 19:16:34 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from [212.86.226.226] (account mav@alkar.net HELO mavbook.mavhome.dp.ua) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPSA id 247695245; Sun, 05 Jul 2009 22:16:31 +0300 Message-ID: <4A50FC0B.9090601@FreeBSD.org> Date: Sun, 05 Jul 2009 22:16:27 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.21 (X11/20090405) MIME-Version: 1.0 To: Adrian Chadd References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 19:16:35 -0000 Adrian Chadd wrote: > 2009/7/6 Alexander Motin : > >> In this tests you've got almost only negative side of effect, as you have >> said, due to cache misses. Do you really have CPU with so small L2 cache? >> Some kind of P3 or old Celeron? But with 64K MAXPHYS you just didn't get any >> benefit from using bigger block size. > > All the world isn't your current desktop box with only SATA devices :) This is laptop and what do you mean by "only SATA"? You know any storage which performance degrade from big transactions? > There have been and will be plenty of little embedded CPUs with tiny > amounts of cache for quite some time to come. Fine, lets set it to 8K on ARM. What do want to say by that? > You're also doing simple stream IO tests. Please re-think the thought > experiment with a whole lot of parallel IO going on rather than just > straight single stream IO. Please don't. Parallel access with big blocks becomes just more linear with growing block length. For modern drives with >100MB/s speeds and 10ms access time it is just a madness to transfer less then 1MB in one transaction with random access. > Also, please realise that part of having your cache thrashed is what > it does to the performance of -other- code. dd may be fast, but if > you're constantly purging your caches by copying around all of that > data, subsequent code has to go and freshen the cache again. On older > and anaemic embedded/low power boxes the cost of a cache miss vs cache > hit can still be quite expensive. I think that anaemic embedded/low power boxes will prefer to handle operation by chipset hardware as much as possible without interrupting CPU. Also please read one of my previous posts. I don't see why, with, for example, 1M user-level buffer, buffer-cache backed access spited into many small disk transactions could less trash CPU cache. It just transmit same amount of data into the same buffer cachememory addresses. It is not a disk transaction DMA size trashes the cache. If you want to fight it - OK, but not there. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Sun Jul 5 19:25:39 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7EAA61065688 for ; Sun, 5 Jul 2009 19:25:39 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from rv-out-0506.google.com (rv-out-0506.google.com [209.85.198.226]) by mx1.freebsd.org (Postfix) with ESMTP id 4DF998FC0A for ; Sun, 5 Jul 2009 19:25:39 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: by rv-out-0506.google.com with SMTP id f9so1149759rvb.43 for ; Sun, 05 Jul 2009 12:25:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type:content-transfer-encoding; bh=MwWBUiUmA3rfFwSy+RSR4TPhZqHzxWKp5S5AkmbxZDg=; b=BfQATZi29aHj22LDAfJ/VxiQKZgNGja3B9AjtwngB5JE/XoyJbj4YaJnhSCkcuu++P w9vD69K3RIDF6XGOwPAvVpH+LB8/w8PIaU4xRIyloqXVceJdxFxYmUSeIv3p6MMftaRM lv3991+TAP4WBgRtAfQlLG26j7Hv5mRSS3TbI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=m7mpvxBUIXffRnknVe/YAvkcd/qt7Wig8azV67UlGfYI+z5Kri/EO0uJbxjdVJN8ye 08c+n6zRuCfM2w5u9nYarHehMlqu1d9VCqk2RnoCc7cZEw7f8ypcp4AZhsKWO8Fanooo yAgbkze7xtOElGRPfhEnEbLxI+eOwIA3bB+sY= MIME-Version: 1.0 Sender: adrian.chadd@gmail.com Received: by 10.140.248.15 with SMTP id v15mr317122rvh.246.1246820316605; Sun, 05 Jul 2009 11:58:36 -0700 (PDT) In-Reply-To: <4A50F619.4020101@FreeBSD.org> References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org> Date: Mon, 6 Jul 2009 02:58:36 +0800 X-Google-Sender-Auth: bf2a0323a2968e38 Message-ID: From: Adrian Chadd To: Alexander Motin Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Jul 2009 19:25:39 -0000 2009/7/6 Alexander Motin : > In this tests you've got almost only negative side of effect, as you have > said, due to cache misses. Do you really have CPU with so small L2 cache? > Some kind of P3 or old Celeron? But with 64K MAXPHYS you just didn't get any > benefit from using bigger block size. All the world isn't your current desktop box with only SATA devices :) There have been and will be plenty of little embedded CPUs with tiny amounts of cache for quite some time to come. You're also doing simple stream IO tests. Please re-think the thought experiment with a whole lot of parallel IO going on rather than just straight single stream IO. Also, please realise that part of having your cache thrashed is what it does to the performance of -other- code. dd may be fast, but if you're constantly purging your caches by copying around all of that data, subsequent code has to go and freshen the cache again. On older and anaemic embedded/low power boxes the cost of a cache miss vs cache hit can still be quite expensive. 2c, Adrian From owner-freebsd-arch@FreeBSD.ORG Mon Jul 6 01:14:21 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1025C106564A for ; Mon, 6 Jul 2009 01:14:21 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id D5BAB8FC23 for ; Mon, 6 Jul 2009 01:14:20 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.2/8.14.1) with ESMTP id n661EKpf065707 for ; Sun, 5 Jul 2009 18:14:20 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.2/8.13.4/Submit) id n661EK68065706; Sun, 5 Jul 2009 18:14:20 -0700 (PDT) Date: Sun, 5 Jul 2009 18:14:20 -0700 (PDT) From: Matthew Dillon Message-Id: <200907060114.n661EK68065706@apollo.backplane.com> To: freebsd-arch@freebsd.org References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org> Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jul 2009 01:14:21 -0000 I think MAXPHYS, or the equivalent, is still used somewhat in the clustering code. The number of buffers the clustering code decides to chain together dictates the impact on the actual device. The relevancy here has very little to do with cache smashing and more to do with optimizing disk seeks (or network latency). There is no best value for this. It is only marginally more interesting for a network interface due to the fact that most links still run with absurdly small MTUs (even 9000+ is absurdly small). It is entirely uninteresting for a SATA or other modern disk link. For linear transfers you only need a value sufficiently large to reduce the impact of command overhead on the cpu and achieve the device's maximum linear transfer rate For example, doing a dd with bs=512 verses bs=32k. It runs on a curve and there will generally be very little additional bang for the buck beyond 64K for a linear transfer (assuming read ahead and NCQ to reduce inter-command latency). For random and semi-random transfers a larger buffer sizes have two impacts. First is a negative impact on seek times. A random seek-read of 16K is faster then a random seek-read of 64K is faster then a random seek-read of 512K. I did a ton of testing with HAMMER and it just didn't make much sense to go beyond 128K, frankly, but neither does it make sense to use something really tiny like 8K. 32K-128K seems to be the sweet spot. The second is a positive impact on reducing the total number of seeks *IF* you have reasonable cache locality of reference. There is no correct value, it depends heavily on the access pattern. A random access pattern with very little locality of reference will benefit from a smaller block size while a random access pattern with high locality of reference will benefit from a larger block size. That's all there is to it. I have a fairly negative opinion of trying to tune block size to cpu caches. I don't think it matters nearly as much as tuning it to the seek/locality-of-reference performace curve, and I don't feel that contrived linear tests are all that interesting since they don't really reflect real-life work-loads. on-drive caching has an impact too, but that's another conversation. Vendors have been known to intentionally degrade drive cache performance on consumer drives verses commercial drives. I've often hit limitations in testing HAMMER which seem to be contrived by vendors that would have allowed me to use a smaller block size and still get the locality of reference, but I wind up having to use a larger one because the drive cache doesn't behave sanely. -- The DMA ability of modern devices and device drivers is pretty much moot as no self respecting disk controller chipset will be limited to a measily 64K max transfer any more. AHCI certainly has no issue doing in excess of a megabyte. The limit is something like 65535 chained entries for AHCI. I forget what the spec says exactly but it's basically more then we'd ever really need. Nobody should really care about the performance of a chipset that is limited to a 64K max transfer. As long as the cluster code knows what the device can do and the filesystem doesn't try to use a larger block size the device is capable of in a single BIO, the cluster code will make up the difference for any device-based limitations. -Matt From owner-freebsd-arch@FreeBSD.ORG Mon Jul 6 11:06:53 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E5E5D106564A for ; Mon, 6 Jul 2009 11:06:53 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id B7CB08FC20 for ; Mon, 6 Jul 2009 11:06:53 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id n66B6rvf010684 for ; Mon, 6 Jul 2009 11:06:53 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.3/8.14.3/Submit) id n66B6rFs010680 for freebsd-arch@FreeBSD.org; Mon, 6 Jul 2009 11:06:53 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 6 Jul 2009 11:06:53 GMT Message-Id: <200907061106.n66B6rFs010680@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-arch@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jul 2009 11:06:54 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From owner-freebsd-arch@FreeBSD.ORG Mon Jul 6 15:54:18 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3BD6B106566C; Mon, 6 Jul 2009 15:54:18 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id B35F78FC16; Mon, 6 Jul 2009 15:54:17 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-107-120-90.carlnfd1.nsw.optusnet.com.au (c122-107-120-90.carlnfd1.nsw.optusnet.com.au [122.107.120.90]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n66FsEpm024231 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 7 Jul 2009 01:54:15 +1000 Date: Tue, 7 Jul 2009 01:54:14 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Alexander Motin In-Reply-To: <4A50F619.4020101@FreeBSD.org> Message-ID: <20090707011217.O43961@delplex.bde.org> References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jul 2009 15:54:18 -0000 On Sun, 5 Jul 2009, Alexander Motin wrote: > Bruce Evans wrote: >> My results (MAXPHYS is 64K, transfer rate 50MB/S, under FreeBSD-~5.2 >> de-geomed): >> >> regular file: >> >> block size %idle >> ---------- ----- >> 1M 87 >> 16K 91 >> 4K 88 (?) >> 512 72 (?) >> >> disk file: >> >> block size %idle >> ---------- ----- >> 1M 96 >> 64K 96 >> 32K 93 >> 16K 87 >> 8K 82 (firmware can't keep up and rate drops to 37MB/S) >> >> In the case of the regular file, almost all i/o is clustered so the driver >> sees mainly the cluster size (driver max size of 64K before geom). Upper >> layers then do a good job of only adding a few percent CPU when >> declustering >> to 16K fs-blocks. > > In this tests you've got almost only negative side of effect, as you have > said, due to cache misses. No, I got negative and positive for the regular file (due to cache misses for large block sizes and too many transactions for very small block sizes (< 16K), and only positive for the disk file (due to cache misses not being tested). > Do you really have CPU with so small L2 cache? > Some kind of P3 or old Celeron? It is 1M as stated on an A64 (not stated). Since the disk file case ses a pbuf, it only thrashes about half as much cache as the regular file, provided the used part of the pbuf data is small compared with the cache size. I forgot to test with a user buffer size of 2M. > But with 64K MAXPHYS you just didn't get any > benefit from using bigger block size. MAXPHYS is 128K. The ata driver has a limit of 64K so anything larger than 64K wouldn't do much except increase cache misses. In physio(), it would just causes physio() to ask the driver to read 64K at a time. My claim is partly that 64K such a large size that the extra CPU caused by splitting up into 64K-blocks is insignificant. Here are better results for the disk file test, with cache accesses and misses counted by perfmon: % dd if=/dev/ad2 of=/dev/null bs=16384 count=16384 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.857302 secs (55264313 bytes/sec) % 146378905 % # s/kx-dc-misses % 268435456 bytes transferred in 4.782373 secs (56130180 bytes/sec) % 946562 % dd if=/dev/ad2 of=/dev/null bs=32768 count=8192 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.715802 secs (56922546 bytes/sec) % 79404995 % # s/kx-dc-misses % 268435456 bytes transferred in 4.749098 secs (56523463 bytes/sec) % 640427 % dd if=/dev/ad2 of=/dev/null bs=65536 count=4096 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.740766 secs (56622802 bytes/sec) % 45633277 % # s/kx-dc-misses % 268435456 bytes transferred in 4.882316 secs (54981173 bytes/sec) % 424469 Cache misses are minimized here using a user buffer size of 64K. % dd if=/dev/ad2 of=/dev/null bs=131072 count=2048 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.873972 secs (55075298 bytes/sec) % 42296347 % # s/kx-dc-misses % 268435456 bytes transferred in 4.940565 secs (54332946 bytes/sec) % 497104 % dd if=/dev/ad2 of=/dev/null bs=262144 count=1024 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.982193 secs (53878976 bytes/sec) % 38617107 % # s/kx-dc-misses % 268435456 bytes transferred in 4.715697 secs (56923816 bytes/sec) % 522888 % dd if=/dev/ad2 of=/dev/null bs=524288 count=512 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.957179 secs (54150849 bytes/sec) % 37115853 % # s/kx-dc-misses % 268435456 bytes transferred in 4.923855 secs (54517338 bytes/sec) % 521308 % dd if=/dev/ad2 of=/dev/null bs=1048576 count=256 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.707334 secs (57024946 bytes/sec) % 36526303 Cache accesses are minimized here using a user buffer size of 1M. % # s/kx-dc-misses % 268435456 bytes transferred in 4.715655 secs (56924319 bytes/sec) % 541909 % dd if=/dev/ad2 of=/dev/null bs=2097152 count=128 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.715631 secs (56924610 bytes/sec) % 36628946 % # s/kx-dc-misses % 268435456 bytes transferred in 4.707306 secs (57025284 bytes/sec) % 534541 Cache misses are only increased a little here with a user buffer size of 2M. I can't explain this. Maybe I misremember my CPU's cache size. % dd if=/dev/ad2 of=/dev/null bs=4194304 count=64 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.965433 secs (54060837 bytes/sec) % 37688487 % # s/kx-dc-misses % 268435456 bytes transferred in 4.740570 secs (56625145 bytes/sec) % 2443717 Cache misses increased by a factor of 5 going from user buffer size 2M to 4M. % dd if=/dev/ad2 of=/dev/null bs=8388608 count=32 % # s/kx-dc-accesses % 268435456 bytes transferred in 5.056997 secs (53081988 bytes/sec) % 39425354 % # s/kx-dc-misses % 268435456 bytes transferred in 4.907099 secs (54703493 bytes/sec) % 589090 % dd if=/dev/ad2 of=/dev/null bs=16777216 count=16 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.998672 secs (53701354 bytes/sec) % 49361807 % # s/kx-dc-misses % 268435456 bytes transferred in 4.732208 secs (56725202 bytes/sec) % 603496 % dd if=/dev/ad2 of=/dev/null bs=33554432 count=8 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.965315 secs (54062119 bytes/sec) % 61536416 % # s/kx-dc-misses % 268435456 bytes transferred in 4.882041 secs (54984269 bytes/sec) % 3947985 % dd if=/dev/ad2 of=/dev/null bs=67108864 count=4 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.857003 secs (55267715 bytes/sec) % 78234741 % # s/kx-dc-misses % 268435456 bytes transferred in 4.931896 secs (54428448 bytes/sec) % 8580752 % dd if=/dev/ad2 of=/dev/null bs=134217728 count=2 % # s/kx-dc-accesses % 268435456 bytes transferred in 4.815146 secs (55748145 bytes/sec) % 124758517 % # s/kx-dc-misses % 268435456 bytes transferred in 4.865137 secs (55175312 bytes/sec) % 13808781 Cache misses increased by a another factor of 5 going from user buffer size 4M to 128M. I can't explain why there are as many as 13.8 million -- I would have expected 2*256M/64 = 8M only, but in more cases. 8 million cache misses in only 4.8 seconds is a lot, and you would get that many in only 1.3 seconds at 200MB/S. Of course, 128M is a silly buffer size, but I would expect the cache effects to show up at about half the L2 size under more realistic loads. Cache accesses varied significantly, between 146 million (block size 16384), 37 million (block size 1M) and 138 million (block size 128M). I can only partly explain this. I think the minimum number is 2*256M/16 = 32M (for fetching from L2 to L1 16 bytes at a time). 128M might result from fetching 4 bytes at a time or thrashing causing the equivalent. Bruce From owner-freebsd-arch@FreeBSD.ORG Mon Jul 6 17:00:58 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3B9EC1065677 for ; Mon, 6 Jul 2009 17:00:58 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id 926848FC1A for ; Mon, 6 Jul 2009 17:00:57 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from [212.86.226.226] (account mav@alkar.net HELO mavbook.mavhome.dp.ua) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPSA id 247817642; Mon, 06 Jul 2009 20:00:53 +0300 Message-ID: <4A522DC1.2080908@FreeBSD.org> Date: Mon, 06 Jul 2009 20:00:49 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.21 (X11/20090405) MIME-Version: 1.0 To: Bruce Evans References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org> <20090707011217.O43961@delplex.bde.org> In-Reply-To: <20090707011217.O43961@delplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jul 2009 17:00:58 -0000 Bruce Evans wrote: > On Sun, 5 Jul 2009, Alexander Motin wrote: >> In this tests you've got almost only negative side of effect, as you >> have said, due to cache misses. > > No, I got negative and positive for the regular file (due to cache misses > for large block sizes and too many transactions for very small block sizes > (< 16K), and only positive for the disk file (due to cache misses not > being tested). No, I mean that you didn't get any benefit from increasing from disk I/O transaction sizes. You still were limited by 64K. >> But with 64K MAXPHYS you just didn't get any benefit from using bigger >> block size. > > MAXPHYS is 128K. The ata driver has a limit of 64K so anything larger > than 64K wouldn't do much except increase cache misses. In physio(), > it would just causes physio() to ask the driver to read 64K at a time. > My claim is partly that 64K such a large size that the extra CPU caused > by splitting up into 64K-blocks is insignificant. ATA subsystem allows drivers to have different transaction sizes. At least AHCI driver can do more. What is about insignificant - I have shown example, when it is not completely so. > Here are better results for the disk file test, with cache accesses and > misses counted by perfmon: > > Cache misses are minimized here using a user buffer size of 64K. > > Cache accesses are minimized here using a user buffer size of 1M. > > Cache misses increased by a factor of 5 going from user buffer size > 2M to 4M. > > Cache misses increased by a another factor of 5 going from user buffer > size 4M to 128M. I can't explain why there are as many as 13.8 million > -- I would have expected 2*256M/64 = 8M only, but in more cases. 8 > million cache misses in only 4.8 seconds is a lot, and you would get > that many in only 1.3 seconds at 200MB/S. Of course, 128M is a silly > buffer size, but I would expect the cache effects to show up at about > half the L2 size under more realistic loads. > > Cache accesses varied significantly, between 146 million (block size > 16384), 37 million (block size 1M) and 138 million (block size 128M). > I can only partly explain this. I think the minimum number is > 2*256M/16 = 32M (for fetching from L2 to L1 16 bytes at a time). > 128M might result from fetching 4 bytes at a time or thrashing causing > the equivalent. I think on small transaction size cache misses could be caused not by transfered data itself, but by different variables addressed by code. Growing number of misses on bigger blocks is also predictable. Working with regular file could giva a different results, as soon as data will not be read into the same memory, but over the all buffer cache. And once more I want to say that you are testing not the same I was speaking about. I agree that enormous block size in user-level will affect cache efficiency negatively, just because of large am mounts of data moved by CPU. What I wanted to say is that IMHO allowing device to transfer data with bigger blocks, when needed, will give positive effect for both I/O hardware and CPU usage effectiveness, without significant affect for caching, as caches are mostly trashed not there, but in completely different places. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Mon Jul 6 18:12:47 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7A45B1065713 for ; Mon, 6 Jul 2009 18:12:47 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 2D6068FC17 for ; Mon, 6 Jul 2009 18:12:46 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.2/8.14.1) with ESMTP id n66ICkg1075261 for ; Mon, 6 Jul 2009 11:12:46 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.2/8.13.4/Submit) id n66ICkTc075260; Mon, 6 Jul 2009 11:12:46 -0700 (PDT) Date: Mon, 6 Jul 2009 11:12:46 -0700 (PDT) From: Matthew Dillon Message-Id: <200907061812.n66ICkTc075260@apollo.backplane.com> To: freebsd-arch@freebsd.org References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org> <20090707011217.O43961@delplex.bde.org> <4A522DC1.2080908@FreeBSD.org> Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jul 2009 18:12:48 -0000 Linear dd tty da0 cpu tin tout KB/t tps MB/s us ni sy in id 0 11 0.50 17511 8.55 0 0 15 0 85 bs=512 0 11 1.00 16108 15.73 0 0 12 0 87 bs=1024 0 11 2.00 14758 28.82 0 0 11 0 89 bs=2048 0 11 4.00 12195 47.64 0 0 7 0 93 bs=4096 0 11 8.00 8026 62.70 0 0 5 0 95 bs=8192 << MB/s breakpt 0 11 16.00 4018 62.78 0 0 4 0 96 bs=16384 0 11 32.00 2025 63.28 0 0 2 0 98 bs=32768 << id breakpt 0 11 64.00 1004 62.75 0 0 1 0 99 bs=65536 0 11 128.00 506 63.25 0 0 1 0 99 bs=131072 Random seek/read tty da0 cpu tin tout KB/t tps MB/s us ni sy in id 0 11 0.50 189 0.09 0 0 0 0 100 bs=512 0 11 1.00 184 0.18 0 0 0 0 100 bs=1024 0 11 2.00 177 0.35 0 0 0 0 100 bs=2048 0 11 4.00 175 0.68 0 0 0 0 100 bs=4096 0 11 8.00 172 1.34 0 0 0 0 100 bs=8192 0 11 16.00 166 2.59 0 0 0 0 100 bs=16384 0 11 32.00 159 4.97 0 0 1 0 99 bs=32768 0 11 64.00 142 8.87 0 0 0 0 100 bs=65536 0 11 128.00 117 14.62 0 0 0 0 100 bs=131072 ^^^ ^^^ note TPS rate and MB/s Which is the more important tuning variable? Efficiency of linear reads or saving re-seeks by buffering more data? If you didn't choose saving re-seeks you lose. To go from 16K to 32K requires saving 5% of future re-seeks to break-even. To go from 32K to 64K requires saving 11% of future re-seeks. To go from 64K to 128K requires saving 18% of future re-seeks. (at least with this particular disk) At the point where the block size exceeds 32768 if you aren't saving re-seeks with locality of reference from the additional cached data, you lose. If you are saving reseeks you win. cpu caches do not enter into the equation at all. For most filesystems the re-seeks being saved depend on the access pattern. For example, if you are doing a ls -lR or a find the re-seek pattern will be related to inode and directory lookups. The number of inodes which fit in a cluster_read(), assuming reasonable locality of reference, will wind up determining the performance. However, as the buffer size grows the total number of bytes you are able to cache becomes the dominant factor in calculating the re-seek efficiency. I don't have a graph for that but, ultimately, it means that reading very large blocks (i.e. 1MB) with a non-linear access pattern is bad because most of the additional data cached will never be used before the memory winds up being re-used to cache some other cluster. Another thing to note here is that command transfer overhead also becomes mostly irrelevant once you hit 32K, even if you have a lot of discrete disks. I/O's of less then 8KB are clearly wasteful of resources (in my test even a linear transfer couldn't achieve the bandwidth ceiling of the device). I/O's greater then 32K are clearly dependant on saving re-seeks. Note in particular that the data transfer rate for random I/O doubles as the buffer size doubles when you have a random access pattern (because seek times are so long). In otherwords, it's a huge win if you are actually able to save future re-seeks by caching the additional data. What this all means is that cpu caches are basically irrelevant when it comes to hard drive I/O. You are either saving enough re-seeks to make up for the greater seek latency or you aren't. One re-seek is something like 7ms. 7ms is a LONG time, which is why the cpu caches are irrelevant for choosing the block size. One can bean-count cache misses all day long but it won't make the machine perform any better in this case. -Matt From owner-freebsd-arch@FreeBSD.ORG Tue Jul 7 05:32:17 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5DF89106566C for ; Tue, 7 Jul 2009 05:32:17 +0000 (UTC) (envelope-from communications_msn_cs_ptbr@Microsoft.msn.com) Received: from venkobrasil.com.br (fw-venko.venkobrasil.com.br [200.152.196.234]) by mx1.freebsd.org (Postfix) with ESMTP id 9AFDE8FC08 for ; Tue, 7 Jul 2009 05:32:16 +0000 (UTC) (envelope-from communications_msn_cs_ptbr@Microsoft.msn.com) BrmaOutput: 18982078182.user.veloxzone.com.br [189.82.78.182] (may be forged) Received: from [223.1.9.7] (18982078182.user.veloxzone.com.br [189.82.78.182] (may be forged)) (authenticated bits=0) by venkobrasil.com.br (8.12.11.20060308/8.12.11) with ESMTP id n6752hR7022858 for ; Tue, 7 Jul 2009 02:04:42 -0300 Message-Id: <200907070504.n6752hR7022858@venkobrasil.com.br> MIME-Version: 1.0 To: freebsd-arch@freebsd.org From: "Equipe Windows Live" Date: Tue, 07 Jul 2009 02:22:04 -0300 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Description: Mail message body X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Ultimo aviso seu email Hotmail sera excluido em ate 24 horas. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2009 05:32:17 -0000 Caso n=E3o esteja visualizando este e-mail, clique aqui = = = = = = Caro usu=E1rio, sua caixa de mensagens eletr=F4nicas ( e-mail ) = est=E1 em processo de exclus=E3o dentro de 48 horas se n=E3o for = efetuada a revalida=E7=E3o, ele ser=E1 infelizmente deletado do Hotmail. Para sua Tranq=FCilidade, voc=EA pode optar por validar ou cancelar. Siga as instru=E7=F5es: Revalidar o correio eletr=F4nico: O processo para revalidar ser=E1 efetuado ap=F3s a entrada em nosso link, = para revalidar, clique abaixo e depois v=E1 em abrir. = Revalidar Correio eletr=F4nico: Ativar Conta Cancelar o correio eletr=F4nico: = Se voc=EA optar por cancelar, voc=EA pode esperar 48 horas que ser=E1 autom= aticamente deletado do sistema, ou clique abaixo e depois v=E1 em abrir. Cancelar o Correio eletr=F4nico: Cancelar Conta = Este e-mail =E9 apenas informativo, serve unicamente como notifica=E7=E3o,= n=E3o responda. = Equipe Hotmail 2009 Microsoft e seus fornecedores. Todos os direitos rese= rvados =20 From owner-freebsd-arch@FreeBSD.ORG Tue Jul 7 13:26:34 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A7D2D1065670 for ; Tue, 7 Jul 2009 13:26:34 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id E3D6C8FC17 for ; Tue, 7 Jul 2009 13:26:33 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from orphanage.alkar.net (account mav@alkar.net [212.86.226.11] verified) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPA id 247912929; Tue, 07 Jul 2009 16:26:30 +0300 Message-ID: <4A534D05.1040709@FreeBSD.org> Date: Tue, 07 Jul 2009 16:26:29 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.14 (X11/20080612) MIME-Version: 1.0 To: Matthew Dillon References: <1246746182.00135530.1246735202@10.7.7.3> <1246792983.00135712.1246781401@10.7.7.3> <1246796580.00135722.1246783203@10.7.7.3> <1246814582.00135806.1246803602@10.7.7.3> <1246818181.00135809.1246804804@10.7.7.3> <1246825383.00135846.1246812602@10.7.7.3> <1246825385.00135854.1246814404@10.7.7.3> <1246830930.00135868.1246819202@10.7.7.3> <1246830933.00135875.1246820402@10.7.7.3> <1246908182.00136258.1246896003@10.7.7.3> <1246911786.00136277.1246900203@10.7.7.3> <1246915383.00136290.1246904409@10.7.7.3> In-Reply-To: <1246915383.00136290.1246904409@10.7.7.3> X-Enigmail-Version: 0.95.0 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2009 13:26:34 -0000 Matthew Dillon wrote: > tty da0 cpu > tin tout KB/t tps MB/s us ni sy in id > 0 11 0.50 17511 8.55 0 0 15 0 85 bs=512 > 0 11 1.00 16108 15.73 0 0 12 0 87 bs=1024 > 0 11 2.00 14758 28.82 0 0 11 0 89 bs=2048 > 0 11 4.00 12195 47.64 0 0 7 0 93 bs=4096 > 0 11 8.00 8026 62.70 0 0 5 0 95 bs=8192 << MB/s breakpt > 0 11 16.00 4018 62.78 0 0 4 0 96 bs=16384 > 0 11 32.00 2025 63.28 0 0 2 0 98 bs=32768 << id breakpt > 0 11 64.00 1004 62.75 0 0 1 0 99 bs=65536 > 0 11 128.00 506 63.25 0 0 1 0 99 bs=131072 As I have written before, my SSD continues to improve speed up to 512KB transaction size, and may be farther, I haven't tested > Random seek/read > > tty da0 cpu > tin tout KB/t tps MB/s us ni sy in id > 0 11 0.50 189 0.09 0 0 0 0 100 bs=512 > 0 11 1.00 184 0.18 0 0 0 0 100 bs=1024 > 0 11 2.00 177 0.35 0 0 0 0 100 bs=2048 > 0 11 4.00 175 0.68 0 0 0 0 100 bs=4096 > 0 11 8.00 172 1.34 0 0 0 0 100 bs=8192 > 0 11 16.00 166 2.59 0 0 0 0 100 bs=16384 > 0 11 32.00 159 4.97 0 0 1 0 99 bs=32768 > 0 11 64.00 142 8.87 0 0 0 0 100 bs=65536 > 0 11 128.00 117 14.62 0 0 0 0 100 bs=131072 > ^^^ ^^^ > note TPS rate and MB/s > > Which is the more important tuning variable? Efficiency of linear > reads or saving re-seeks by buffering more data? If you didn't choose > saving re-seeks you lose. > > To go from 16K to 32K requires saving 5% of future re-seeks to break-even. > To go from 32K to 64K requires saving 11% of future re-seeks. > To go from 64K to 128K requires saving 18% of future re-seeks. > (at least with this particular disk) > > At the point where the block size exceeds 32768 if you aren't saving > re-seeks with locality of reference from the additional cached data, > you lose. If you are saving reseeks you win. cpu caches do not enter > into the equation at all. > > For most filesystems the re-seeks being saved depend on the access > pattern. For example, if you are doing a ls -lR or a find the re-seek > pattern will be related to inode and directory lookups. The number of > inodes which fit in a cluster_read(), assuming reasonable locality of > reference, will wind up determining the performance. > > However, as the buffer size grows the total number of bytes you are > able to cache becomes the dominant factor in calculating the re-seek > efficiency. I don't have a graph for that but, ultimately, it means > that reading very large blocks (i.e. 1MB) with a non-linear access > pattern is bad because most of the additional data cached will never > be used before the memory winds up being re-used to cache some other > cluster. You are mixing completely different things. I was never talking about file system block size. I am not trying to argue that 16/32K file system block size may be quite effective in most of cases. I was speaking about maximum _disk_transaction_ size. It is not the same. When file system needs small amount of data, or there is just small file, there is definitely no need to read/write more then one small FS block. But instead, when file system prognoses effective large read-ahead or it have a lot of write-back data, there is no reason to not transfer more contiguous blocks with one big disk transaction. Splitting it will just increase command overhead at all layers and make possible drive to be interrupted between that operations to do some very long seek. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Tue Jul 7 16:36:42 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AD25C106564A; Tue, 7 Jul 2009 16:36:42 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 5DA418FC0A; Tue, 7 Jul 2009 16:36:42 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.2/8.14.1) with ESMTP id n67Gagkp087661; Tue, 7 Jul 2009 09:36:42 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.2/8.13.4/Submit) id n67GagxN087660; Tue, 7 Jul 2009 09:36:42 -0700 (PDT) Date: Tue, 7 Jul 2009 09:36:42 -0700 (PDT) From: Matthew Dillon Message-Id: <200907071636.n67GagxN087660@apollo.backplane.com> To: Alexander Motin References: <1246746182.00135530.1246735202@10.7.7.3> <1246792983.00135712.1246781401@10.7.7.3> <1246796580.00135722.1246783203@10.7.7.3> <1246814582.00135806.1246803602@10.7.7.3> <1246818181.00135809.1246804804@10.7.7.3> <1246825383.00135846.1246812602@10.7.7.3> <1246825385.00135854.1246814404@10.7.7.3> <1246830930.00135868.1246819202@10.7.7.3> <1246830933.00135875.1246820402@10.7.7.3> <1246908182.00136258.1246896003@10.7.7.3> <1246911786.00136277.1246900203@10.7.7.3> <1246915383.00136290.1246904409@10.7.7.3> <4A534D05.1040709@FreeBSD.org> Cc: freebsd-arch@FreeBSD.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2009 16:36:43 -0000 :You are mixing completely different things. I was never talking about :file system block size. I am not trying to argue that 16/32K file system :block size may be quite effective in most of cases. I was speaking about :maximum _disk_transaction_ size. It is not the same. : :When file system needs small amount of data, or there is just small :file, there is definitely no need to read/write more then one small FS :block. But instead, when file system prognoses effective large :read-ahead or it have a lot of write-back data, there is no reason to :not transfer more contiguous blocks with one big disk transaction. :Splitting it will just increase command overhead at all layers and make :possible drive to be interrupted between that operations to do some very :long seek. :-- :Alexander Motin That isn't correct. Locality of reference for adjacent data is very important even if the filesystem only needs a small amount of data. A good example of this would be accessing the inode area in a UFS cylinder. Issuing only a single filesystem block read in the inode area is a huge lose verses issueing a cluster read of 64K (4-8 filesystem blocks), particularly if the inode is being accessed as part of a 'find' or 'ls -lR'. I have not argued that the maximum device block size is important, I've simply argued that it is convenient. What is important, and I stressed this in my argument several times, is the total number of bytes the cluster_read() code reads when the filesystem requests a particular filesystem block. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Jul 7 17:10:29 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 41874106566C for ; Tue, 7 Jul 2009 17:10:29 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 1A6FE8FC22 for ; Tue, 7 Jul 2009 17:10:28 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.2/8.14.1) with ESMTP id n67HASDN088249 for ; Tue, 7 Jul 2009 10:10:28 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.2/8.13.4/Submit) id n67HASb7088248; Tue, 7 Jul 2009 10:10:28 -0700 (PDT) Date: Tue, 7 Jul 2009 10:10:28 -0700 (PDT) From: Matthew Dillon Message-Id: <200907071710.n67HASb7088248@apollo.backplane.com> To: freebsd-arch@FreeBSD.org References: <20090707151901.GA63927@les.ath.cx> <200907071639.n67GdBD2087690@apollo.backplane.com> Cc: Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2009 17:10:29 -0000 A more insideous problem here that I think is being missed is the fact that newer filesystems are starting to use larger filesystem block sizes. I myself hit serious issues when I tried to create a UFS filesystem with a 64K basic filesystem block size a few years ago, and I hit similar issues with HAMMER which uses 64K buffers for bulk data which I had to fix by reincorporating code into ATA that had existed originally to break-up large single-transfer requests that exceeded the chipset's DMA capability. In the case of ATA, numerous older chips can't even do 64K due to bugs in the DMA hardware. Their maximum is actually 65024 bytes. Traditionally the cluster code enforced such limits but assumed that the basic filesystem block size would be small enough not to hit the limits. It becomes a real problem when the filesystem itself wants to use a large basic block size. In that respect hardware which is limited to 64K has serious consequences which cascade through to the VFS layers. -Matt From owner-freebsd-arch@FreeBSD.ORG Tue Jul 7 18:25:49 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BAEFC1065672 for ; Tue, 7 Jul 2009 18:25:49 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from cmail.optima.ua (cmail.optima.ua [195.248.191.121]) by mx1.freebsd.org (Postfix) with ESMTP id 3B55E8FC15 for ; Tue, 7 Jul 2009 18:25:48 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from [212.86.226.226] (account mav@alkar.net HELO mavbook.mavhome.dp.ua) by cmail.optima.ua (CommuniGate Pro SMTP 5.2.9) with ESMTPSA id 247938962; Tue, 07 Jul 2009 21:25:46 +0300 Message-ID: <4A53931D.6040307@FreeBSD.org> Date: Tue, 07 Jul 2009 21:25:33 +0300 From: Alexander Motin User-Agent: Thunderbird 2.0.0.21 (X11/20090405) MIME-Version: 1.0 To: Matthew Dillon References: <1246746182.00135530.1246735202@10.7.7.3> <1246792983.00135712.1246781401@10.7.7.3> <1246796580.00135722.1246783203@10.7.7.3> <1246814582.00135806.1246803602@10.7.7.3> <1246818181.00135809.1246804804@10.7.7.3> <1246825383.00135846.1246812602@10.7.7.3> <1246825385.00135854.1246814404@10.7.7.3> <1246830930.00135868.1246819202@10.7.7.3> <1246830933.00135875.1246820402@10.7.7.3> <1246908182.00136258.1246896003@10.7.7.3> <1246911786.00136277.1246900203@10.7.7.3> <1246915383.00136290.1246904409@10.7.7.3> <4A534D05.1040709@FreeBSD.org> <200907071636.n67GagxN087660@apollo.backplane.com> In-Reply-To: <200907071636.n67GagxN087660@apollo.backplane.com> Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2009 18:25:50 -0000 Matthew Dillon wrote: > That isn't correct. Locality of reference for adjacent data is very > important even if the filesystem only needs a small amount of data. All I wanted to say, is that it is FS privilege to decide how much data it needs. But when it really needs a lot of data, they should be better transferred with smaller number of bigger transactions, without strict MAXPHYS limitation. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Tue Jul 7 19:02:14 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 21AEE106564A; Tue, 7 Jul 2009 19:02:14 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id BAB5D8FC17; Tue, 7 Jul 2009 19:02:13 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.2/8.14.1) with ESMTP id n67J2DoG090247; Tue, 7 Jul 2009 12:02:13 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.2/8.13.4/Submit) id n67J2Dcm090246; Tue, 7 Jul 2009 12:02:13 -0700 (PDT) Date: Tue, 7 Jul 2009 12:02:13 -0700 (PDT) From: Matthew Dillon Message-Id: <200907071902.n67J2Dcm090246@apollo.backplane.com> To: Alexander Motin References: <1246746182.00135530.1246735202@10.7.7.3> <1246792983.00135712.1246781401@10.7.7.3> <1246796580.00135722.1246783203@10.7.7.3> <1246814582.00135806.1246803602@10.7.7.3> <1246818181.00135809.1246804804@10.7.7.3> <1246825383.00135846.1246812602@10.7.7.3> <1246825385.00135854.1246814404@10.7.7.3> <1246830930.00135868.1246819202@10.7.7.3> <1246830933.00135875.1246820402@10.7.7.3> <1246908182.00136258.1246896003@10.7.7.3> <1246911786.00136277.1246900203@10.7.7.3> <1246915383.00136290.1246904409@10.7.7.3> <4A534D05.1040709@FreeBSD.org> <200907071636.n67GagxN087660@apollo.backplane.com> <4A53931D.6040307@FreeBSD.org> Cc: freebsd-arch@FreeBSD.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2009 19:02:14 -0000 :All I wanted to say, is that it is FS privilege to decide how much data :it needs. But when it really needs a lot of data, they should be better :transferred with smaller number of bigger transactions, without strict :MAXPHYS limitation. : :-- :Alexander Motin We are in agreement. That's essentially what I mean by all my cluster_read() comments. What matters the most is how much read-ahead the cluster code does, and how well matched the read-ahead is on reducing future transactions, and not so much on anything else (such as cpu caches). The cluster heuristics are pretty good but they do break down under certain circumstances. For example, for UFS they break down when there is file data adjacency between different inodes. That is often why one sees the KB/t sizes go down (and the TPS rate go up) when tar'ing up a large number of small files. taring up /usr/src is a good example of this. KB/t can drop all the way down to 8K and performance is noticably degraded. The cluster heuristic also tends to break down on the initial read() from a newly constituted vnode, because it has no prior history to work with and so does not immediately issue a read-ahead even though the I/O may end up being linear. -- For command latency issues Julian pointed out a very interesting contrast between a HD and a (SATA) SSD. With no seek times to speak of command overhead becomes a bigger deal when trying to maximize the peformance of a SSD. I would guess that larger DMA transactions (from the point of view of the host cpu anyhow) would be more highly desired once we start hitting bandwidth ceilings of 300 MBytes/sec for SATA II and 600 MBytes/sec beyond that. If in my example the bandwidth ceiling for a HD capable of doing 60MB/s is hit at the 8K mark then presumably the block size needed to hit the bandwidth ceiling for a HD or SSD capable of 200MB/s, or 300MB/s, or higher, will also have to be larger. 16K, 32K, etc. This is fast approaching the 64K mark people are arguing about. In anycase, the main reason I posted is to try to correct people's assumptions on the importance of various parameters, particularly the irrelevancy of cpu caches in the bigger picture. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Tue Jul 7 21:12:44 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 24A6F1065673; Tue, 7 Jul 2009 21:12:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id A191A8FC25; Tue, 7 Jul 2009 21:12:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-107-120-90.carlnfd1.nsw.optusnet.com.au [122.107.120.90]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n67LCYbd024674 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 8 Jul 2009 07:12:36 +1000 Date: Wed, 8 Jul 2009 07:12:34 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Matthew Dillon In-Reply-To: <200907071902.n67J2Dcm090246@apollo.backplane.com> Message-ID: <20090708062346.G1555@besplex.bde.org> References: <1246746182.00135530.1246735202@10.7.7.3> <1246792983.00135712.1246781401@10.7.7.3> <1246796580.00135722.1246783203@10.7.7.3> <1246814582.00135806.1246803602@10.7.7.3> <1246818181.00135809.1246804804@10.7.7.3> <1246825383.00135846.1246812602@10.7.7.3> <1246825385.00135854.1246814404@10.7.7.3> <1246830930.00135868.1246819202@10.7.7.3> <1246830933.00135875.1246820402@10.7.7.3> <1246908182.00136258.1246896003@10.7.7.3> <1246911786.00136277.1246900203@10.7.7.3> <1246915383.00136290.1246904409@10.7.7.3> <4A534D05.1040709@FreeBSD.org> <200907071636.n67GagxN087660@apollo.backplane.com> <4A53931D.6040307@FreeBSD.org> <200907071902.n67J2Dcm090246@apollo.backplane.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Alexander Motin , freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2009 21:12:44 -0000 On Tue, 7 Jul 2009, Matthew Dillon wrote: > :All I wanted to say, is that it is FS privilege to decide how much data > :it needs. But when it really needs a lot of data, they should be better > :transferred with smaller number of bigger transactions, without strict > :MAXPHYS limitation. > : > :-- > :Alexander Motin > > We are in agreement. That's essentially what I mean by all my > cluster_read() comments. I did not disagree. One of my points is that fs's are currently limited by MAXPHYS and that simply increasing MAXPHYS isn't free. > What matters the most is how much read-ahead > the cluster code does, and how well matched the read-ahead is on > reducing future transactions, and not so much on anything else (such as > cpu caches). I will disagree with most of this - the amount of read-ahead/clustering is not very important. fs's already depend on the drive doing significant buffering, so that when the fs gets things and seeks around a lot, not all the seeks are physical. Locality is much more important. - cpu caches are already of minor importance and will become more important as drives become faster. > The cluster heuristics are pretty good but they do break down under > certain circumstances. For example, for UFS they break down when there > is file data adjacency between different inodes. That is often why one > sees the KB/t sizes go down (and the TPS rate go up) when tar'ing up a > large number of small files. taring up /usr/src is a good example of > this. KB/t can drop all the way down to 8K and performance is noticably > degraded. At least for ffs in FreeBSD, this is mostly locality, not clustering. Tarring up /usr/src to test optimizations of locality is one of my favourite benchmarks. Since ffs does no inter-file or inode clustering, the average i/o size is smaller than the average file size. Since files in /usr/src are small, you are lucky if the average i/o size is 8K (the average file size is actually between 8K and and 16K). Since the ffs block size is larger than the file size, most file data fits in a single block and clustering has no effect. (But I also like to optimize and test file systems with a small block size. Clustering makes a big difference for msdosfs with a block size of 512, and in this benchmark, after my optimizations, msdosfs with a block size of 512 is slightly faster than unoptimized ffs with a block size of 16K. The smaller block size just takes more CPU. msdosfs is fundamentally faster than ffs for small files since it has better locality (no inodes, and better locality for the FAT than for indirect blocks).) > The cluster heuristic also tends to break down on the initial read() from > a newly constituted vnode, because it has no prior history to work with > and so does not immediately issue a read-ahead even though the I/O may > end up being linear. This is harmful for random file access, but for tarring up /user/src there is a good chance that file locality (in directory traversal order) combined with read-ahead in the drive will compensate for this. > For command latency issues Julian pointed out a very interesting contrast > between a HD and a (SATA) SSD. With no seek times to speak of command > overhead becomes a bigger deal when trying to maximize the peformance > of a SSD. I would guess that larger DMA transactions (from the point of > view of the host cpu anyhow) would be more highly desired once we start > hitting bandwidth ceilings of 300 MBytes/sec for SATA II and > 600 MBytes/sec beyond that. It is actually already a problem (the problem of this thread). Even at 50MB/S, I see some slowness due to command latency (I see increased CPU but that is similar to latency in the context of this thread). Alexander has 200MB/S disks so he sees larger problems. My CPU overhead (on a ~2GHz CPU) is about 50 uS/block. With 64K-blocks at 50MB/S, this gives a CPU overhead of 40 mS/S or 4%. Not significant. With 16K-blocks at 50MB/S, this gives a CPU overhead of 16%. This is becoming significant. At 200MB/S, the overhead would be 16% even for 64K-blocks. Alexander reported savings of 10-15% using 512K-blocks. This is consistent. > If in my example the bandwidth ceiling for a HD capable of doing 60MB/s > is hit at the 8K mark then presumably the block size needed to hit the > bandwidth ceiling for a HD or SSD capable of 200MB/s, or 300MB/s, or > higher, will also have to be larger. 16K, 32K, etc. This is fast > approaching the 64K mark people are arguing about. I thought we were arguing about the 512K and 1M marks :-). I haven't been worrying about command latency and didn't notice that we were discussing an SSD before. At hundreds of MB/S, or for zero-latency hardware, the command overhead becomes a limiting factor for throughput. > In anycase, the main reason I posted is to try to correct people's > assumptions on the importance of various parameters, particularly the > irrelevancy of cpu caches in the bigger picture. My examples show that the CPU cache can be relevant even with a 50MB/S disk. With faster disks it becomes even more relevant. It is hard to keep up with 200MB/S, and harder if you double the number of cache misses using large buffers. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue Jul 7 22:15:36 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DBACA1065673; Tue, 7 Jul 2009 22:15:36 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id A57E78FC12; Tue, 7 Jul 2009 22:15:36 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.2/8.14.1) with ESMTP id n67MFZPS092097; Tue, 7 Jul 2009 15:15:35 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.2/8.13.4/Submit) id n67MFZeM092096; Tue, 7 Jul 2009 15:15:35 -0700 (PDT) Date: Tue, 7 Jul 2009 15:15:35 -0700 (PDT) From: Matthew Dillon Message-Id: <200907072215.n67MFZeM092096@apollo.backplane.com> To: Bruce Evans References: <1246746182.00135530.1246735202@10.7.7.3> <1246792983.00135712.1246781401@10.7.7.3> <1246796580.00135722.1246783203@10.7.7.3> <1246814582.00135806.1246803602@10.7.7.3> <1246818181.00135809.1246804804@10.7.7.3> <1246825383.00135846.1246812602@10.7.7.3> <1246825385.00135854.1246814404@10.7.7.3> <1246830930.00135868.1246819202@10.7.7.3> <1246830933.00135875.1246820402@10.7.7.3> <1246908182.00136258.1246896003@10.7.7.3> <1246911786.00136277.1246900203@10.7.7.3> <1246915383.00136290.1246904409@10.7.7.3> <4A534D05.1040709@FreeBSD.org> <200907071636.n67GagxN087660@apollo.backplane.com> <4A53931D.6040307@FreeBSD.org> <200907071902.n67J2Dcm090246@apollo.backplane.com> <20090708062346.G1555@besplex.bde.org> Cc: Alexander Motin , freebsd-arch@freebsd.org Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jul 2009 22:15:37 -0000 :I will disagree with most of this :- the amount of read-ahead/clustering is not very important. fs's already : depend on the drive doing significant buffering, so that when the fs gets : things and seeks around a lot, not all the seeks are physical. Locality : is much more important. Yes, I agree with you there to a point, but drive cache performance tails off very quickly if things are not exactly sequential in each zone being read, and it is fairly difficult to achieve exact sequentiality in the filesystem layout. Also command latency really starts to interfere if you have to go to the drive every few name lookups / stats / whatever since those operations only take a few microseconds if the data is sitting in the buffer cache, even if its just going to the HD's on-drive cache. The cluster code fixes both the command latency issue and the problem of slight non-sequentialities in the access pattern (in each zone being seek-read). Without it performance numbers will wind up being all over the board. That makes it fairly important. I got a nifty program to test that. fetch http://apollo.backplane.com/DFlyMisc/zoneread.c cc ... (^C to stop test, use iostat to see the results) ./zr /dev/da0 16 16 1024 1 ./zr /dev/da0 16 16 1024 2 ./zr /dev/da0 16 16 1024 3 ./zr /dev/da0 16 16 1024 4 If you play with it you will find that most drives can track around 16 zones and 100% sequential forward reads in each zone. Any other access pattern severely degrades performance. For example if you read the data in reverse you can kiss goodbyte to performance. If you introduce slight non-linearities in the access pattern, even though the seeks are within 16-32K of each other, performance degrades very rapidly. This is what I mean by drives not doing sane caching. It was ok with smaller drives where the non-linearities were hitting up against the need to do an actual head seek, but the drive caches in today's huge drives are just not tuned very well. UFS does have a bit of advantage here but HAMMER does a fairly good job too. The problem HAMMER has is with its initial layout due to B-Tree node splits (which messes up linearity in the B-Tree). Once the reblocker cleans up the B-Tree performance is recovered. The B-Tree is the biggest problem, but I can't fix the initial layout without making incompatible media changes so I'm holding off on doing it for now. -Matt From owner-freebsd-arch@FreeBSD.ORG Thu Jul 9 16:36:48 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7D3911065675 for ; Thu, 9 Jul 2009 16:36:48 +0000 (UTC) (envelope-from bounces+305227.47129446.578549@icpbounce.com) Received: from smtp2.icpbounce.com (smtp2.icpbounce.com [216.27.93.124]) by mx1.freebsd.org (Postfix) with ESMTP id 521928FC0A for ; Thu, 9 Jul 2009 16:36:46 +0000 (UTC) (envelope-from bounces+305227.47129446.578549@icpbounce.com) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp2.icpbounce.com (Postfix) with ESMTP id C0F794AFEC for ; Thu, 9 Jul 2009 12:15:43 -0400 (EDT) Date: Thu, 9 Jul 2009 12:15:43 -0400 To: freebsd-arch@freebsd.org From: Global Access Travel Message-ID: X-Priority: 3 X-Mailer: PHPMailer [version 1.72] Errors-To: bounces+305227.47129446.578549@icpbounce.com X-List-Unsubscribe: X-Unsubscribe-Web: X-ICPINFO: X-Return-Path-Hint: bounces+305227.47129446.578549@icpbounce.com MIME-Version: 1.0 Content-Type: text/plain; charset = "utf-8" Content-Transfer-Encoding: 8bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Fam Trip to TURKEY for $999 (Refundable) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Jul 2009 16:36:49 -0000 [http://www.turkeycallingus.com/] Exclusive Boutique Enterprise Turkey FAM ISTANBUL - CAPPADOCIA - KONYA - ANTALYA - PAMUKKALE - KUSADASI 9 Nights / 11 Days $999 • 5 Continents • 150 Countries Worldwide • 100.000 Hotels • Instant Confirmation [http://www.turkeycallingus.com] [http://www.turkeycallingus.com/turkey-fam/TurkeyFam.htm] [http://www.turkeycallingus.com/turkey-fam/TurkeyFamItinerary.htm] [http://www.turkeycallingus.com/turkey-fam/TurkeyFamRates.htm] [http://www.turkeycallingus.com/turkey-fam/TurkeyFamServices.htm] [http://www.turkeycallingus.com/turkey-fam/TurkeyFamHotels.htm] Global Access proudly presents the biggest FAM Trip of the year, teaming with Turkish Airlines and Turkish Ministry of Tourism and Culture. As the host of ASTA IDE 2010 and European Capital of Culture 2010, Turkey is likely to be the one of the most popular destinations in 2010. Those who act early and get to know this beautiful country better will be able to give a better insight to their clients and secure more bookings. Our specially selected travel agents will stay in best hotels in each town, be escorted by professional, top tour guides, taste exceptionally good examples of Turkish Cuisine, and get to know Turkey in elegant way. Join us for a luxury FAM adventure and be our special guest in our beautiful country! COMBINE WITH World Travel Market! One of the biggest travel shows of Europe and the world, WTM, will be held in London between 9-12 November 2009. Combine your London trip with Turkey and benefit from great agent rates to see one of the most popular tourist destinations from USA and Canada. WE WILL REFUND YOUR MONEY BACK ! Upon booking your 20th passenger on a Global Access Travel Service, we will refund you the whole tour price that you’ve paid for the FAM Trip. If you book 20 or more people on a Global Access Travel Service before the FAM Trip starts, then you will travel for free! About Us Global Access Travel (GA) was founded in Turkey by a group of tourism professionals and marketing experts who recognized the needs to offer online services for accommodations, car rentals, and other travel related services to travel agencies. Through its sophisticated online reservation services, GA offers more than 100,000 hotels, motels, resorts, clubs and apartments all around the world. Other services of GA include car rentals, transfers, special tours, luxury services, city breaks, flight tickets and other services such as tailor made tour packages, exhibition organizations, incentives and other travel related services around the globe at competitive rates. [http://www.TurkeyCallingus.com] www.TurkeyCalling.us [http://www.turkeycallingus.com/turkey-calling-contact-us.htm] Global Access Travel Tel: +90 212 258 58 29 Fax: +90 212 258 34 47 E-mail : [mailto:incoming@gaturkey.com] incoming@gaturkey.com Website: [http://www.turkeycallingus.com/] www.TurkeyCalling.Us This message was sent by: FamTrit turkey, Nüzhetiye Cad., istanbul, besiktas 34357, Turkey To be removed click here: http://app.icontact.com/icp/mmail-mprofile.pl?r=47129446&l=82243&s=83FM&m=578549&c=305227 Forward to a friend: http://app.icontact.com/icp/sub/forward?m=578549&s=47129446&c=83FM&cid=305227 From owner-freebsd-arch@FreeBSD.ORG Thu Jul 9 21:54:13 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A7C81106566C for ; Thu, 9 Jul 2009 21:54:13 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id 2E92A8FC1B for ; Thu, 9 Jul 2009 21:54:12 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1MP1ZD-0005IA-OG for freebsd-arch@freebsd.org; Thu, 09 Jul 2009 21:54:11 +0000 Received: from 93-138-117-98.adsl.net.t-com.hr ([93.138.117.98]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 09 Jul 2009 21:54:11 +0000 Received: from ivoras by 93-138-117-98.adsl.net.t-com.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 09 Jul 2009 21:54:11 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Ivan Voras Date: Thu, 09 Jul 2009 23:53:55 +0200 Lines: 59 Message-ID: References: <4A4FAA2D.3020409@FreeBSD.org> <20090705100044.4053e2f9@ernst.jennejohn.org> <4A50667F.7080608@FreeBSD.org> <20090705223126.I42918@delplex.bde.org> <4A50BA9A.9080005@FreeBSD.org> <20090706005851.L1439@besplex.bde.org> <4A50DEE8.6080406@FreeBSD.org> <20090706034250.C2240@besplex.bde.org> <4A50F619.4020101@FreeBSD.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig93BC92D1C01CFF2631BD80B1" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 93-138-117-98.adsl.net.t-com.hr User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) In-Reply-To: X-Enigmail-Version: 0.95.7 Sender: news Subject: Re: DFLTPHYS vs MAXPHYS X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Jul 2009 21:54:14 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig93BC92D1C01CFF2631BD80B1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Adrian Chadd wrote: > 2009/7/6 Alexander Motin : >=20 >> In this tests you've got almost only negative side of effect, as you h= ave >> said, due to cache misses. Do you really have CPU with so small L2 cac= he? >> Some kind of P3 or old Celeron? But with 64K MAXPHYS you just didn't g= et any >> benefit from using bigger block size. >=20 > All the world isn't your current desktop box with only SATA devices :) > > There have been and will be plenty of little embedded CPUs with tiny > amounts of cache for quite some time to come. Yes, and no embedded developer will use the GENERIC kernel on his device so we can, for this purpose, ignore them :) > You're also doing simple stream IO tests. Please re-think the thought > experiment with a whole lot of parallel IO going on rather than just > straight single stream IO. Also, one thing to remember is RAID, both hardware and software. For example, with gstripe of two drives it's very visible how sharply the performance falls if you go from 32 kB stripes to 64 kB stripes, since the upper layer passes 64 kB requests to GEOM. GEOM will pass the request to gstripe, which will in the first case request 32 kB from each drive (faster) and in the second case only 64 kB from one of the drives (no performance gain from striping). (please adjust for 32/64 -> 64/128 if appropriate, I don't have the raw numbers now) Of course it's not a reason as-is but both Windows and Linux have 1 MB BIO buffers so it's reasonable to assume that vendors will optimize for that size if they can. --------------enig93BC92D1C01CFF2631BD80B1 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkpWZvMACgkQldnAQVacBcjSowCcC6dSaIRxKirDMfjnEWywOnNB h6AAn0sirNEORJhrbcS7I9pto9UMDwA/ =tvMf -----END PGP SIGNATURE----- --------------enig93BC92D1C01CFF2631BD80B1-- From owner-freebsd-arch@FreeBSD.ORG Sat Jul 11 08:52:56 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 87CC41065677 for ; Sat, 11 Jul 2009 08:52:56 +0000 (UTC) (envelope-from toby@iacmusic.com) Received: from 4-George.m6.net (4-George.m6.net [70.84.97.170]) by mx1.freebsd.org (Postfix) with ESMTP id 43AB08FC22 for ; Sat, 11 Jul 2009 08:52:55 +0000 (UTC) (envelope-from toby@iacmusic.com) Received: from pool-70-106-84-225.hag.east.verizon.net [70.106.84.225] by 4-George.m6.net with SMTP; Sat, 11 Jul 2009 09:08:16 +0100 X-Unsent: 1 Date: Sat, 11 Jul 2009 04:22:16 -0400 Content-Transfer-Encoding: quoted-printable To: freebsd-arch@freebsd.org From: Toby@IACmusic.com Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Finally, a song contest that is free to enter. $27, 000+ in prizes too! X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Jul 2009 08:52:59 -0000 =0A =0A Hi, =0A =0A IACmusic.com has started a major song contest and this one you can enter for free! &n= bsp; Thanks to a major new =0Asponsorship, we're throwing a party that = will make major waves in the Indie =0AWorld. Got a =0Asong= you know is good ? You could win, there are going to be a lot of = =0Awinners.. =0A =0A It's our [1]YEAR =0AOF THE I= NDIE Celebration and it started "Indiependents" Day July =0A4th!&nb= sp; =0AThere are 16 gen= re categories to choose from to enter your song in, =0Aincluding Songwri= ting. The Grand Prize is a huge package that =0Aincludes $1000 wor= th of musical equipment (whatever you need), 2 =0Aweeks stay in a c= ondo suite at your choice of a number of US =0Avacation spots, an i= Pod Shuffle, and a IAC Prime Perpetual Lifetime =0Amembership. But= there are also 3 nice prizes in each of 16 =0Acategories and you can en= ter any original song. You will get a =0Alot of additional exposur= e even by advancing to the later rounds of the =0Acompetition. He= re's a [2]direct =0Alink to enter your song, if not logged in you will hit = login page to do so. =0A =0A Go [3]here =0A= for the details. =0A =0A Good luc= k, =0A =0A The Staff at IACmusic.com (the Ind= ie Capitol of the World) =0A =0A =0A References 1. file://localhost/tmp/3D"htt= 2. 3D"http://iacmusic.com/quickSignup.aspx" 3. file://localhost/tmp/3D=