Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 19 Dec 2011 15:36:29 +0100
From:      Olivier Smedts <olivier@gid0.org>
To:        Stefan Esser <se@freebsd.org>
Cc:        FreeBSD Current <freebsd-current@freebsd.org>
Subject:   Re: Uneven load on drives in ZFS RAIDZ1
Message-ID:  <CABzXLYNgFh6T2oRNosdh9mb8Bu7D2xcT3styK9sFdywhHv6D2w@mail.gmail.com>
In-Reply-To: <4EEF488E.1030904@freebsd.org>
References:  <4EEF488E.1030904@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
2011/12/19 Stefan Esser <se@freebsd.org>:
> Hi ZFS users,
>
> for quite some time I have observed an uneven distribution of load
> between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
> a longer log of 10 second averages logged with gstat:
>
> dT: 10.001s =A0w: 10.000s =A0filter: ^a?da?.$
> =A0L(q) =A0ops/s =A0 =A0r/s =A0 kBps =A0 ms/r =A0 =A0w/s =A0 kBps =A0 ms/=
w =A0 %busy Name
> =A0 =A00 =A0 =A0130 =A0 =A0106 =A0 4134 =A0 =A04.5 =A0 =A0 23 =A0 1033 =
=A0 =A05.2 =A0 48.8| ada0
> =A0 =A00 =A0 =A0131 =A0 =A0111 =A0 3784 =A0 =A04.2 =A0 =A0 19 =A0 1007 =
=A0 =A04.0 =A0 47.6| ada1
> =A0 =A00 =A0 =A0 90 =A0 =A0 66 =A0 2219 =A0 =A04.5 =A0 =A0 24 =A0 1031 =
=A0 =A05.1 =A0 31.7| ada2
> =A0 =A01 =A0 =A0 81 =A0 =A0 58 =A0 2007 =A0 =A04.6 =A0 =A0 22 =A0 1023 =
=A0 =A02.3 =A0 28.1| ada3
>
> =A0L(q) =A0ops/s =A0 =A0r/s =A0 kBps =A0 ms/r =A0 =A0w/s =A0 kBps =A0 ms/=
w =A0 %busy Name
> =A0 =A01 =A0 =A0132 =A0 =A0104 =A0 4036 =A0 =A04.2 =A0 =A0 27 =A0 1129 =
=A0 =A05.3 =A0 45.2| ada0
> =A0 =A00 =A0 =A0129 =A0 =A0103 =A0 3679 =A0 =A04.5 =A0 =A0 26 =A0 1115 =
=A0 =A06.8 =A0 47.6| ada1
> =A0 =A01 =A0 =A0 91 =A0 =A0 61 =A0 2133 =A0 =A04.6 =A0 =A0 30 =A0 1129 =
=A0 =A01.9 =A0 29.6| ada2
> =A0 =A00 =A0 =A0 81 =A0 =A0 56 =A0 1985 =A0 =A04.8 =A0 =A0 24 =A0 1102 =
=A0 =A06.0 =A0 29.4| ada3
>
> =A0L(q) =A0ops/s =A0 =A0r/s =A0 kBps =A0 ms/r =A0 =A0w/s =A0 kBps =A0 ms/=
w =A0 %busy Name
> =A0 =A01 =A0 =A0148 =A0 =A0108 =A0 4084 =A0 =A05.3 =A0 =A0 39 =A0 2511 =
=A0 =A07.2 =A0 55.5| ada0
> =A0 =A01 =A0 =A0141 =A0 =A0104 =A0 3693 =A0 =A05.1 =A0 =A0 36 =A0 2505 10=
.4 54.4| ada1
> =A0 =A01 =A0 =A0102 =A0 =A0 62 =A0 2112 =A0 =A05.6 =A0 =A0 39 =A0 2508 =
=A0 =A05.5 =A0 35.4| ada2
> =A0 =A00 =A0 =A0 99 =A0 =A0 60 =A0 2064 =A0 =A06.0 =A0 =A0 39 =A0 2483 =
=A0 =A03.7 =A0 36.1| ada3
>
> This goes on for minutes, without a change of roles (I had assumed that
> other 10 minute samples might show relatively higher load on another
> subset of the drives, but it's always the first two, which receive some
> 50% more read requests than the other two.
>
> The test consisted of minidlna rebuilding its content database for a
> media collection held on that pool. The unbalanced distribution of
> requests does not depend on the particular application and the
> distribution of requests does not change when the drives with highest
> load approach 100% busy.
>
> This is a -CURRENT built from yesterdays sources, but the problem exists
> for quite some time (and should definitely be reproducible on -STABLE, to=
o).
>
> The pool consists of a 4 drive raidz1 on an ICH10 (H67) without cache or
> log devices and without much ZFS tuning (only max. ARC size, should not
> at all be relevant in this context):
>
> zpool status -v
> =A0pool: raid1
> =A0state: ONLINE
> =A0scan: none requested
> config:
>
> =A0 =A0 =A0 =A0NAME =A0 =A0 =A0 =A0STATE =A0 =A0 READ WRITE CKSUM
> =A0 =A0 =A0 =A0raid1 =A0 =A0 =A0 ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 0
> =A0 =A0 =A0 =A0 =A0raidz1-0 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 0
> =A0 =A0 =A0 =A0 =A0 =A0ada0p2 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 0
> =A0 =A0 =A0 =A0 =A0 =A0ada1p2 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 0
> =A0 =A0 =A0 =A0 =A0 =A0ada2p2 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 0
> =A0 =A0 =A0 =A0 =A0 =A0ada3p2 =A0ONLINE =A0 =A0 =A0 0 =A0 =A0 0 =A0 =A0 0
>
> errors: No known data errors
>
> Cached configuration:
> =A0 =A0 =A0 =A0version: 28
> =A0 =A0 =A0 =A0name: 'raid1'
> =A0 =A0 =A0 =A0state: 0
> =A0 =A0 =A0 =A0txg: 153899
> =A0 =A0 =A0 =A0pool_guid: 10507751750437208608
> =A0 =A0 =A0 =A0hostid: 3558706393
> =A0 =A0 =A0 =A0hostname: 'se.local'
> =A0 =A0 =A0 =A0vdev_children: 1
> =A0 =A0 =A0 =A0vdev_tree:
> =A0 =A0 =A0 =A0 =A0 =A0type: 'root'
> =A0 =A0 =A0 =A0 =A0 =A0id: 0
> =A0 =A0 =A0 =A0 =A0 =A0guid: 10507751750437208608
> =A0 =A0 =A0 =A0 =A0 =A0children[0]:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: 'raidz'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0id: 0
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0guid: 7821125965293497372
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nparity: 1
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0metaslab_array: 30
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0metaslab_shift: 36
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ashift: 12
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0asize: 7301425528832
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0is_log: 0
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0create_txg: 4
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0children[0]:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: 'disk'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0id: 0
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0guid: 7487684108701568404
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0path: '/dev/ada0p2'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0phys_path: '/dev/ada0p2'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0whole_disk: 1
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0create_txg: 4
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0children[1]:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: 'disk'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0id: 1
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0guid: 12000329414109214882
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0path: '/dev/ada1p2'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0phys_path: '/dev/ada1p2'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0whole_disk: 1
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0create_txg: 4
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0children[2]:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: 'disk'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0id: 2
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0guid: 2926246868795008014
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0path: '/dev/ada2p2'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0phys_path: '/dev/ada2p2'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0whole_disk: 1
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0create_txg: 4
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0children[3]:
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0type: 'disk'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0id: 3
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0guid: 5226543136138409733
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0path: '/dev/ada3p2'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0phys_path: '/dev/ada3p2'
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0whole_disk: 1
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0create_txg: 4
>
> I'd be interested to know, whether this behavior can be reproduced on
> other systems with raidz1 pools consisting of 4 or more drives. All it
> takes is generating some disk load and running the command:
>
> =A0 =A0 =A0 =A0gstat -I 10000000 -f '^a?da?.$'
>
> to obtain 10 second averages.
>
> I have not even tried to look at the scheduling of requests in ZFS, but
> I'm surprised to see higher than average load on just 2 of the 4 drives,
> since RAID parity should be evenly spread over all drives and for each
> file system block a different subset of 3 out of 4 drives should be able
> to deliver the data without need to reconstruct it from parity (that
> would lead to an even distribution of load).
>
> I've got two theories what might cause the obtained behavior:
>
> 1) There is some meta data that is only kept on the first two drives.
> Data is evenly spread, but meta data accesses lead to additional reads.
>
> 2) The read requests are distributed in such a way, that 1/3 goes to
> ada0, another 1/3 to ada1, while the remaining 1/3 is evenly distributed
> to ada2 and ada3.
>
>
> So: Can anybody reproduce this distribution requests?

Hello,

Stupid question, but are your drives all exactly the same ? I noticed
"ashift: 12" so I think you should have at least one 4k-sector drive,
are you sure they're not mixed with 512B per sector drives ?

>
> Any idea, why this is happening and whether something should be changed
> in ZFS to better distribute the load (leading to higher file system
> performance)?
>
> Best regards, STefan
> _______________________________________________
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org=
"



--=20
Olivier Smedts=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 _
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 ASCII ribbon campaign ( )
e-mail: olivier@gid0.org=A0 =A0 =A0 =A0 - against HTML email & vCards=A0 X
www: http://www.gid0.org=A0 =A0 - against proprietary attachments / \

=A0 "Il y a seulement 10 sortes de gens dans le monde :
=A0 ceux qui comprennent le binaire,
=A0 et ceux qui ne le comprennent pas."



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABzXLYNgFh6T2oRNosdh9mb8Bu7D2xcT3styK9sFdywhHv6D2w>