From owner-freebsd-performance@FreeBSD.ORG Tue Oct 4 00:48:55 2005 Return-Path: X-Original-To: freebsd-performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D390C16A420 for ; Tue, 4 Oct 2005 00:48:55 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2E46C43D45 for ; Tue, 4 Oct 2005 00:48:55 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.0.86]) by mailout2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j940mqGb006413; Tue, 4 Oct 2005 10:48:52 +1000 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j940mmPD019518; Tue, 4 Oct 2005 10:48:50 +1000 Date: Tue, 4 Oct 2005 10:48:48 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: =?ISO-8859-1?Q?Tulio_Guimar=E3es_da_Silva?= In-Reply-To: <434146CA.8010803@pgt.mpt.gov.br> Message-ID: <20051004075806.F45947@delplex.bde.org> References: <20051002170446.78674.qmail@web30303.mail.mud.yahoo.com> <004701c5c77e$a8ab4310$b3db87d4@multiplay.co.uk> <434146CA.8010803@pgt.mpt.gov.br> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-580949596-1128386928=:45947" Cc: freebsd-performance@FreeBSD.org Subject: Re: dd(1) performance when copiing a disk to another X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Oct 2005 00:48:56 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-580949596-1128386928=:45947 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Mon, 3 Oct 2005, [ISO-8859-1] Tulio Guimar=E3es da Silva wrote: > But just to clear out some questions... > 1) Maxtor=B4s full specifications for Diamond Max+ 9 Series refers to max= imum=20 > *sustained* transfer rates of 37MB/s and 67MB/s for "ID" and "OD",=20 > respectively (though I couldn=B4d find exactly what it means, I deduced t= hat=20 > represents the rates for center- and border-parts of the disk - please=20 > correct me if I=B4m wrong), then your tests show you=B4re getting the bes= t out of=20 > it ;) ; > much slower. Another interesting point is that you can often get closer to the maximum rate than the average of the maximum and minumum rate. The outer tracks contain more sectors (about 67/37 times as many with the above spec), so the average rate over all sectors is larger than average of the max and min= , significantly so since 67/37 is a fairly large fraction. Also, you can often partition disks to put less-often accessed stuff in the slow parts. > One last thought, though, for the specialists: iostat showed maximum of= =20 > 128KB/transfer, even though dd should be using 1MB blocks... is that an= =20 > expected behaviour? Shouldn=B4t iostat show 1024Kb/t, then? The expected size is 64K. 128KB is due to a bug in GEOM, one that was fixed a couple of days ago by tegge@. iostat shows the size that reaches the disk driver. The best size to show is the size that reaches the disk hardware, but several layers of abstraction, some excessive, make it impossible to show that size: First there is the disk firmware layer above the disk hardware layer. There is no way for the driver to know exacly what the firmware layer is doing. Good firmware will cluster i/o's and otherwise cache things to minimize seeks and other disk accesses, in much the same way that a good OS will do, but hopefully better because it can understand the hardware better and use more specialized algorithms. Next there is the driver layer. Drivers shouldn't split up i/o, but some at least used to, and they now cannot report such splitting to devstat. I can't see any splitting in the ad driver now -- I can only see reduction of the max size from 255 to 128 sectors in the non-DMA case, and the misnamed struct member atadev->max_iosize in this case (this actually gives the max transfer size; in the DMA case, the max transfer size is the same as the max i/o size, but in the non-DMA case it is the number of sectors transferred per interrupt which is usually much smaller than the max i/o size of DFLTPHYS =3D 64K). The fd driver at least used to split up i/o into single sectors. 20-25 years ago when CPUs were slow even compared with floppies, this used to be a good way to pessimize i/o. A few years later, starting with about 386's, CPUs became fast enough to easily generate new requests in the sector gap time so even poorly written fd drivers could keep floppies streaming except across seeks to another track. The fd driver never reported this internal splitting to devstat, and maybe never should have since it is close enough to the hardware to know that this splitting is normal and/or doesn't affect efficiency. Next there is the GEOM layer. It splits up i/o's requested by the next layer up according to the max size advertised by the driver. The latter is typically DFLTPHYS =3D 64K and often unrelated to the hardware; MAXPHYS =3D 128K would be better if the hardware can handle it. Until a couple of days ago, reporting of this splitting was broken. GEOM reported to devstat the size passed to it and not the size that it passed to drivers. tegge@ fixed this. For writes to raw disks, the next layer up is physread(). (Other cases are even more complicated :-).) physread() splits up i/o's into blocks of max size dev->si_iosize_max. This splitting is wrong for tape-like devices but is almost harmless for disk-like devices. Another bug in GEOM Is bitrot in the setting of dev->si_iosize_max. This should normally be the same as the driver max size, and used to be set to the same in in individual drivers in many cases including the ad driver, but now most drivers don't set it and GEOM normally defaults it to the bogus value MAXPHYS =3D 128K. physread() also defaults it, but to the different, safer, value DFLTPHYS =3D 64K. The different max sizes cause excessive splitting. See below for examples. For writes by dd, there are a few more layers (driver read, devfs read, and write(2) at least). So for writes of 1M from dd to an ad device with DMA enabled and the normal DMA size of 64K, the following reblocking occurs: 1M is split into 8*128K by physio() since dev->si_iosize_max is 128K 8*128K is split into 16*64K by GEOM since dp->d_maxsize is mismatched = (64K) dp->max_size is 63K for a couple of controllers in the DMA case and possibl= y always for the acd driver (see the magic 65534 in atapi-cd.c). Then the bogus splitting is more harmful: 1M is split into 8*128K by physio() (no difference) 8*128K is split into 8 * (2*63K + 1*2K) by GEOM The 1*2K splitting is especially pessimal. The afd driver used to have this bug internally, and still has it in RELENG_4. Its max i/o (DMA) size was 32K for ZIP disks that seem to be IOMEGA ones and 126K for other drives. dd'ing to ZIP drives was fast enough if you used a size smaller than the max i/o size (but not very small), or with nice power of 2 sizes for disks that seem to be IOMEGA ones, but a nice size of 128K caused the following bad splitting for non-IOMEGA ones: 128K =3D 1*126K + 1*2K. Since accesses to ZIP disks take about 20 msec per access, the 2K-block almost halved the transfer speed. The normal ata DMA size of 64*1024 is also too magic -- it just happens to equal DFLTPHYS so it only causes 1 bogus splitting in combination with the other bugs. For writes by dd, these bugs are easy to avoid if you know about them or if you just fear them and test all reasonable block sizes to find the best one. Just use a block size large enough to be efficient but small enough to not cause splitting, or in cases where the mismatches are only off-by-a factor-of 2^n, large enough to cause even splitting. For cases other than writes by dd, the bugs cause pessimal splitting. E.g., file system clustering uses yet another bogusly intitialized max i/o size, vp->v_mount->mnt_iosize_max. This defaults to DFLTPHYS =3D 64K in the top vfs layer, but many file systems, including ffs, set it to devvp->v_rdev->si_iosize_max, so it is normally set to the wrong default set for the latter by GEOM, MAXPHYS =3D 128K. This normally causes excessive splitting which is especially harmful if the driver's max is not a divisor of MAXPHYS. E.g., when the driver's max is 63K, writing a 256KB file to an ffs file system with the default fs-block size of 16K causes the following bogus splitting even if ffs allocates all the blocks optimally (contiguously): At ffs level: =0912*16K (direct data blocks) =091*16K (indirect block; but ffs usually gets this wrong and doesn't =09 allocate it contiguously) =094*16K (data blocks indirected through the indirect block) At clustering level: =0917*16K reblocked to 2*128K + 1*16K At device driver level: =092*128K + 1*16K split into 63K, 63K, 2K, 63K, 63K, 2K, 16K So splitting almost half undoes the gathering done by the clustering level (we start with 17 blocks and end with 7). Ideally we would end with 5 (4*63K + 1*20K). Caching in not-very-old drives (but not ZIP or CD/DVD ones) makes stupid blocking not very harmful for reads, but doesn't help so much for writes. Bruce --0-580949596-1128386928=:45947--