From owner-freebsd-fs@FreeBSD.ORG  Sat Oct 27 17:24:22 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4278C16A418
	for <freebsd-fs@freebsd.org>; Sat, 27 Oct 2007 17:24:22 +0000 (UTC)
	(envelope-from scode@hyperion.scode.org)
Received: from hyperion.scode.org (cl-1361.ams-04.nl.sixxs.net
	[IPv6:2001:960:2:550::2])
	by mx1.freebsd.org (Postfix) with ESMTP id E3A7013C4A6
	for <freebsd-fs@freebsd.org>; Sat, 27 Oct 2007 17:24:21 +0000 (UTC)
	(envelope-from scode@hyperion.scode.org)
Received: by hyperion.scode.org (Postfix, from userid 1001)
	id 9CDFF23C458; Sat, 27 Oct 2007 19:24:20 +0200 (CEST)
Date: Sat, 27 Oct 2007 19:24:20 +0200
From: Peter Schuller <peter.schuller@infidyne.com>
To: freebsd-fs@freebsd.org
Message-ID: <20071027172420.GA64599@hyperion.scode.org>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="KsGdsel6WgEHnImy"
Content-Disposition: inline
User-Agent: Mutt/1.5.16 (2007-06-09)
Subject: zfs/arc tuning to keep disks busy
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 27 Oct 2007 17:24:22 -0000


--KsGdsel6WgEHnImy
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hello,

I am not sure whether this is FreeBSD specific or for ZFS in general,
as I no longer have an OpenSolaris machine to try this on.

One particular very common case whose performance is not entirely
optimal, is simply copying[1] a file from one filesystem to another with
the filesystems being on different physical drives.

For example, in this particular case[2] I am rsync -avWP:ing data from
my current /usr (on ZFS, single disk) onto another ZFS pool on
different drives. The behavior is roughly this, in chronological
order:

  (1) Read data from source at expected rate, saturating the disk.
  (2) After some seconds, switch to writing to destination at expected
      rate, saturating the disk.
  (3) Stop writing to destination, and goto (1).

Optimally it should of course be reading and writing concurrently to
allow for saturation of both source and estination.

Without knowing implementation details my interpretation is that there
are two symptomes here:

  (1) Flushing out data to the destination occurrs too late, such
      that writes block processes for extended periods of time in
      bursts (seconds), rather than pre-emptively flushing to
      prevents writes from blocking other than due to truly
      saturating the destination device(s).

  (2) Even when data is written (like several tens of megabytes
      1-2 seconds), the userspace write does not seem to unblock
      until all pending writes are complete.

The timing of writes seems to coincide with the 5 second commit
period, which is expected if the amount of data written in 5 seconds
fits in the cache. Reads seem to stop slightly after that; which would
be consistent with a decision to not push more data onto the cache,
instead waiting on the commit to finish.

Based on the above observations, my guess is that (1) all dirty data
that is in the cache at the start of the checkpoint process is written
out in a single trnsaction group, and (2) data in the cache is never
evicted until the entire transaction group is fully committed to
disk. This would explain the bahavior, since it would exactly have the
effect that writes start to block once there is no more room for
cached data - and the room becomes available in a burst at commit
time, rather than incrementally as data is written out.

Is the above an accurate description of what is going on?

If so, I wonder if there is a way to force ZFS to pre-emptively start
flushing dirty data out onto disk earlier, presumably when the
percentage usage of the cache (relative to the amount allowed to be
used for writes) is <=3D 50%. If I had to guess that percentage is more
like 80-90% right now. Of course, perhaps the cache does not even work
remotely like this, but the behavior seems consistent with what you
would get if this were the case.

Alternatively can one get ZFS to commit smaller transaction groups,
thus allowing data to be evicted more quickly, rather than commit
*everything* as a single transaction? Though this would go against the
point of minimizing the number of commits.

[1] No concurrent I/O; just a plain rsync -avWP on an otherwise idle
    system.
[2] I have observed this overall, not just in this case.

--=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>'
Key retrieval: Send an E-Mail to getpgpkey@scode.org
E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org


--KsGdsel6WgEHnImy
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)

iD8DBQFHI3RDDNor2+l1i30RAmFaAJ9093f42lvaM1b+alPKN4JTL41IPACeOQeU
gmZS6l1grYtW8qUHqyPsCEE=
=9RwG
-----END PGP SIGNATURE-----

--KsGdsel6WgEHnImy--