From owner-freebsd-current@FreeBSD.ORG  Mon Dec 19 14:22:11 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 118BD106564A
	for <freebsd-current@freebsd.org>; Mon, 19 Dec 2011 14:22:11 +0000 (UTC)
	(envelope-from se@freebsd.org)
Received: from nm20-vm1.bullet.mail.ne1.yahoo.com
	(nm20-vm1.bullet.mail.ne1.yahoo.com [98.138.91.21])
	by mx1.freebsd.org (Postfix) with SMTP id BE4288FC15
	for <freebsd-current@freebsd.org>; Mon, 19 Dec 2011 14:22:10 +0000 (UTC)
Received: from [98.138.90.51] by nm20.bullet.mail.ne1.yahoo.com with NNFMP;
	19 Dec 2011 14:22:10 -0000
Received: from [98.138.226.127] by tm4.bullet.mail.ne1.yahoo.com with NNFMP;
	19 Dec 2011 14:22:10 -0000
Received: from [127.0.0.1] by smtp206.mail.ne1.yahoo.com with NNFMP;
	19 Dec 2011 14:22:10 -0000
X-Yahoo-Newman-Id: 63919.51508.bm@smtp206.mail.ne1.yahoo.com
X-Yahoo-Newman-Property: ymail-3
X-YMail-OSG: 3amkN_wVM1m8NHj0sD00uS43_xbUiYzZgCl4qD5NG8uJTu7
	gcJMaNrSO8UFGRQQuLLiH0Them0MNWJurHs5ceUC9FrLUw.Q9Orfk6baPmwU
	YZpmenlAdHwdHD5nVYzEY97r3KixSsQnjNwfVEJrMmPPBykWjIEeCbs71Bz3
	AuwpvuDhLIW2eIv2cV6WtLkCAedUUeyrel7T2oTiY1OJGUy7.RMKzQlXemE1
	fnVMaPg0I.HwwWpyMWsuOiF5V8jTm6bWf5Ksvs89kq6OSZYASr014Wkl.Blv
	skutvJnePVz9PzZRzg9x_L0g10oRAJyw_XkouDQkqK.gSuuKMd3u9fochexu
	QGASsOWCnkM8v5LBeJOG.HBXz.WbqhYSbHTgPiHVGJ5Hq2QHy22jAwS9Wukn
	kifPkeYt2eEsxd.Y-
X-Yahoo-SMTP: iDf2N9.swBDAhYEh7VHfpgq0lnq.
Received: from [192.168.119.20] (se@81.173.154.224 with plain)
	by smtp206.mail.ne1.yahoo.com with SMTP; 19 Dec 2011 06:22:09 -0800 PST
Message-ID: <4EEF488E.1030904@freebsd.org>
Date: Mon, 19 Dec 2011 15:22:06 +0100
From: Stefan Esser <se@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
	rv:8.0) Gecko/20111105 Thunderbird/8.0
MIME-Version: 1.0
To: FreeBSD Current <freebsd-current@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Subject: Uneven load on drives in ZFS RAIDZ1
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Dec 2011 14:22:11 -0000

Hi ZFS users,

for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
a longer log of 10 second averages logged with gstat:

dT: 10.001s  w: 10.000s  filter: ^a?da?.$
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0    130    106   4134    4.5     23   1033    5.2   48.8| ada0
    0    131    111   3784    4.2     19   1007    4.0   47.6| ada1
    0     90     66   2219    4.5     24   1031    5.1   31.7| ada2
    1     81     58   2007    4.6     22   1023    2.3   28.1| ada3

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1    132    104   4036    4.2     27   1129    5.3   45.2| ada0
    0    129    103   3679    4.5     26   1115    6.8   47.6| ada1
    1     91     61   2133    4.6     30   1129    1.9   29.6| ada2
    0     81     56   1985    4.8     24   1102    6.0   29.4| ada3

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1    148    108   4084    5.3     39   2511    7.2   55.5| ada0
    1    141    104   3693    5.1     36   2505   10.4   54.4| ada1
    1    102     62   2112    5.6     39   2508    5.5   35.4| ada2
    0     99     60   2064    6.0     39   2483    3.7   36.1| ada3

This goes on for minutes, without a change of roles (I had assumed that
other 10 minute samples might show relatively higher load on another
subset of the drives, but it's always the first two, which receive some
50% more read requests than the other two.

The test consisted of minidlna rebuilding its content database for a
media collection held on that pool. The unbalanced distribution of
requests does not depend on the particular application and the
distribution of requests does not change when the drives with highest
load approach 100% busy.

This is a -CURRENT built from yesterdays sources, but the problem exists
for quite some time (and should definitely be reproducible on -STABLE, too).

The pool consists of a 4 drive raidz1 on an ICH10 (H67) without cache or
log devices and without much ZFS tuning (only max. ARC size, should not
at all be relevant in this context):

zpool status -v
  pool: raid1
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        raid1       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p2  ONLINE       0     0     0
            ada1p2  ONLINE       0     0     0
            ada2p2  ONLINE       0     0     0
            ada3p2  ONLINE       0     0     0

errors: No known data errors

Cached configuration:
        version: 28
        name: 'raid1'
        state: 0
        txg: 153899
        pool_guid: 10507751750437208608
        hostid: 3558706393
        hostname: 'se.local'
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 10507751750437208608
            children[0]:
                type: 'raidz'
                id: 0
                guid: 7821125965293497372
                nparity: 1
                metaslab_array: 30
                metaslab_shift: 36
                ashift: 12
                asize: 7301425528832
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 7487684108701568404
                    path: '/dev/ada0p2'
                    phys_path: '/dev/ada0p2'
                    whole_disk: 1
                    create_txg: 4
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 12000329414109214882
                    path: '/dev/ada1p2'
                    phys_path: '/dev/ada1p2'
                    whole_disk: 1
                    create_txg: 4
                children[2]:
                    type: 'disk'
                    id: 2
                    guid: 2926246868795008014
                    path: '/dev/ada2p2'
                    phys_path: '/dev/ada2p2'
                    whole_disk: 1
                    create_txg: 4
                children[3]:
                    type: 'disk'
                    id: 3
                    guid: 5226543136138409733
                    path: '/dev/ada3p2'
                    phys_path: '/dev/ada3p2'
                    whole_disk: 1
                    create_txg: 4

I'd be interested to know, whether this behavior can be reproduced on
other systems with raidz1 pools consisting of 4 or more drives. All it
takes is generating some disk load and running the command:

	gstat -I 10000000 -f '^a?da?.$'

to obtain 10 second averages.

I have not even tried to look at the scheduling of requests in ZFS, but
I'm surprised to see higher than average load on just 2 of the 4 drives,
since RAID parity should be evenly spread over all drives and for each
file system block a different subset of 3 out of 4 drives should be able
to deliver the data without need to reconstruct it from parity (that
would lead to an even distribution of load).

I've got two theories what might cause the obtained behavior:

1) There is some meta data that is only kept on the first two drives.
Data is evenly spread, but meta data accesses lead to additional reads.

2) The read requests are distributed in such a way, that 1/3 goes to
ada0, another 1/3 to ada1, while the remaining 1/3 is evenly distributed
to ada2 and ada3.


So: Can anybody reproduce this distribution requests?

Any idea, why this is happening and whether something should be changed
in ZFS to better distribute the load (leading to higher file system
performance)?

Best regards, STefan