From owner-freebsd-performance@FreeBSD.ORG  Tue Oct  4 00:48:55 2005
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@FreeBSD.org
Delivered-To: freebsd-performance@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D390C16A420
	for <freebsd-performance@FreeBSD.org>;
	Tue,  4 Oct 2005 00:48:55 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2E46C43D45
	for <freebsd-performance@FreeBSD.org>;
	Tue,  4 Oct 2005 00:48:55 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au
	[61.8.0.86])
	by mailout2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	j940mqGb006413; Tue, 4 Oct 2005 10:48:52 +1000
Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246])
	by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id
	j940mmPD019518; Tue, 4 Oct 2005 10:48:50 +1000
Date: Tue, 4 Oct 2005 10:48:48 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@delplex.bde.org
To: =?ISO-8859-1?Q?Tulio_Guimar=E3es_da_Silva?= <tuliogs@pgt.mpt.gov.br>
In-Reply-To: <434146CA.8010803@pgt.mpt.gov.br>
Message-ID: <20051004075806.F45947@delplex.bde.org>
References: <20051002170446.78674.qmail@web30303.mail.mud.yahoo.com>
	<004701c5c77e$a8ab4310$b3db87d4@multiplay.co.uk>
	<434146CA.8010803@pgt.mpt.gov.br>
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="0-580949596-1128386928=:45947"
Cc: freebsd-performance@FreeBSD.org
Subject: Re: dd(1) performance when copiing a disk to another
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 04 Oct 2005 00:48:56 -0000

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-580949596-1128386928=:45947
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Mon, 3 Oct 2005, [ISO-8859-1] Tulio Guimar=E3es da Silva wrote:

> But just to clear out some questions...
> 1) Maxtor=B4s full specifications for Diamond Max+ 9 Series refers to max=
imum=20
> *sustained* transfer rates of 37MB/s and 67MB/s for "ID" and "OD",=20
> respectively (though I couldn=B4d find exactly what it means, I deduced t=
hat=20
> represents the rates for center- and border-parts of the disk - please=20
> correct me if I=B4m wrong), then your tests show you=B4re getting the bes=
t out of=20
> it ;) ;
> much slower.

Another interesting point is that you can often get closer to the maximum
rate than the average of the maximum and minumum rate.  The outer tracks
contain more sectors (about 67/37 times as many with the above spec), so
the average rate over all sectors is larger than average of the max and min=
,
significantly so since 67/37 is a fairly large fraction.  Also, you can
often partition disks to put less-often accessed stuff in the slow parts.

> One last thought, though, for the specialists: iostat showed maximum of=
=20
> 128KB/transfer, even though dd should be using 1MB blocks... is that an=
=20
> expected behaviour? Shouldn=B4t iostat show 1024Kb/t, then?

The expected size is 64K.  128KB is due to a bug in GEOM, one that was
fixed a couple of days ago by tegge@.

iostat shows the size that reaches the disk driver.  The best size to
show is the size that reaches the disk hardware, but several layers
of abstraction, some excessive, make it impossible to show that size:

First there is the disk firmware layer above the disk hardware layer.
There is no way for the driver to know exacly what the firmware layer
is doing.  Good firmware will cluster i/o's and otherwise cache things
to minimize seeks and other disk accesses, in much the same way that
a good OS will do, but hopefully better because it can understand the
hardware better and use more specialized algorithms.

Next there is the driver layer.  Drivers shouldn't split up i/o, but
some at least used to, and they now cannot report such splitting to
devstat.  I can't see any splitting in the ad driver now -- I can only
see reduction of the max size from 255 to 128 sectors in the non-DMA
case, and the misnamed struct member atadev->max_iosize in this case
(this actually gives the max transfer size; in the DMA case, the max
transfer size is the same as the max i/o size, but in the non-DMA case
it is the number of sectors transferred per interrupt which is usually
much smaller than the max i/o size of DFLTPHYS =3D 64K).  The fd driver
at least used to split up i/o into single sectors.  20-25 years ago
when CPUs were slow even compared with floppies, this used to be a
good way to pessimize i/o.  A few years later, starting with about
386's, CPUs became fast enough to easily generate new requests in the
sector gap time so even poorly written fd drivers could keep floppies
streaming except across seeks to another track.  The fd driver never
reported this internal splitting to devstat, and maybe never should
have since it is close enough to the hardware to know that this splitting
is normal and/or doesn't affect efficiency.

Next there is the GEOM layer.  It splits up i/o's requested by the
next layer up according to the max size advertised by the driver.  The
latter is typically DFLTPHYS =3D 64K and often unrelated to the hardware;
MAXPHYS =3D 128K would be better if the hardware can handle it.  Until
a couple of days ago, reporting of this splitting was broken.  GEOM
reported to devstat the size passed to it and not the size that it
passed to drivers.  tegge@ fixed this.

For writes to raw disks, the next layer up is physread().  (Other cases
are even more complicated :-).)  physread() splits up i/o's into blocks
of max size dev->si_iosize_max.  This splitting is wrong for tape-like
devices but is almost harmless for disk-like devices.  Another bug in
GEOM Is bitrot in the setting of dev->si_iosize_max.  This should
normally be the same as the driver max size, and used to be set to the
same in in individual drivers in many cases including the ad driver,
but now most drivers don't set it and GEOM normally defaults it to
the bogus value MAXPHYS =3D 128K.  physread() also defaults it, but to
the different, safer, value DFLTPHYS =3D 64K.  The different max sizes
cause excessive splitting.  See below for examples.

For writes by dd, there are a few more layers (driver read, devfs read,
and write(2) at least).

So for writes of 1M from dd to an ad device with DMA enabled and the
normal DMA size of 64K, the following reblocking occurs:

     1M is split into 8*128K by physio() since dev->si_iosize_max is 128K
     8*128K is split into 16*64K by GEOM since dp->d_maxsize is mismatched =
(64K)

dp->max_size is 63K for a couple of controllers in the DMA case and possibl=
y
always for the acd driver (see the magic 65534 in atapi-cd.c).  Then the
bogus splitting is more harmful:

     1M is split into 8*128K by physio() (no difference)
     8*128K is split into 8 * (2*63K + 1*2K) by GEOM

The 1*2K splitting is especially pessimal.  The afd driver used to have
this bug internally, and still has it in RELENG_4.  Its max i/o (DMA)
size was 32K for ZIP disks that seem to be IOMEGA ones and 126K for
other drives.  dd'ing to ZIP drives was fast enough if you used a size
smaller than the max i/o size (but not very small), or with nice power
of 2 sizes for disks that seem to be IOMEGA ones, but a nice size of
128K caused the following bad splitting for non-IOMEGA ones:
128K =3D 1*126K + 1*2K.  Since accesses to ZIP disks take about 20 msec
per access, the 2K-block almost halved the transfer speed.

The normal ata DMA size of 64*1024 is also too magic -- it just happens
to equal DFLTPHYS so it only causes 1 bogus splitting in combination
with the other bugs.

For writes by dd, these bugs are easy to avoid if you know about them or
if you just fear them and test all reasonable block sizes to find the best
one.  Just use a block size large enough to be efficient but small enough
to not cause splitting, or in cases where the mismatches are only off-by-a
factor-of 2^n, large enough to cause even splitting.

For cases other than writes by dd, the bugs cause pessimal splitting.
E.g., file system clustering uses yet another bogusly intitialized max
i/o size, vp->v_mount->mnt_iosize_max.  This defaults to DFLTPHYS =3D
64K in the top vfs layer, but many file systems, including ffs, set
it to devvp->v_rdev->si_iosize_max, so it is normally set to the wrong
default set for the latter by GEOM, MAXPHYS =3D 128K.  This normally
causes excessive splitting which is especially harmful if the driver's
max is not a divisor of MAXPHYS.  E.g., when the driver's max is 63K,
writing a 256KB file to an ffs file system with the default fs-block
size of 16K causes the following bogus splitting even if ffs allocates
all the blocks optimally (contiguously):

At ffs level:
 =0912*16K (direct data blocks)
 =091*16K (indirect block; but ffs usually gets this wrong and doesn't
 =09       allocate it contiguously)
 =094*16K (data blocks indirected through the indirect block)

At clustering level:
 =0917*16K reblocked to 2*128K + 1*16K

At device driver level:
 =092*128K + 1*16K split into 63K, 63K, 2K, 63K, 63K, 2K, 16K

So splitting almost half undoes the gathering done by the clustering
level (we start with 17 blocks and end with 7).  Ideally we would end
with 5 (4*63K + 1*20K).

Caching in not-very-old drives (but not ZIP or CD/DVD ones) makes
stupid blocking not very harmful for reads, but doesn't help so much
for writes.

Bruce
--0-580949596-1128386928=:45947--