From owner-freebsd-current@FreeBSD.ORG Mon Dec 19 21:53:22 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1D6FB1065670; Mon, 19 Dec 2011 21:53:22 +0000 (UTC) (envelope-from dan@dan.emsphone.com) Received: from email2.allantgroup.com (email2.emsphone.com [199.67.51.116]) by mx1.freebsd.org (Postfix) with ESMTP id E028A8FC16; Mon, 19 Dec 2011 21:53:21 +0000 (UTC) Received: from dan.emsphone.com (dan.emsphone.com [199.67.51.101]) by email2.allantgroup.com (8.14.4/8.14.4) with ESMTP id pBJLrHhV076442 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 19 Dec 2011 15:53:18 -0600 (CST) (envelope-from dan@dan.emsphone.com) Received: from dan.emsphone.com (smmsp@localhost [127.0.0.1]) by dan.emsphone.com (8.14.5/8.14.5) with ESMTP id pBJLrHLC050116 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 19 Dec 2011 15:53:17 -0600 (CST) (envelope-from dan@dan.emsphone.com) Received: (from dan@localhost) by dan.emsphone.com (8.14.5/8.14.5/Submit) id pBJLrHZH050112; Mon, 19 Dec 2011 15:53:17 -0600 (CST) (envelope-from dan) Date: Mon, 19 Dec 2011 15:53:17 -0600 From: Dan Nelson To: Stefan Esser Message-ID: <20111219215317.GL53453@dan.emsphone.com> References: <4EEF488E.1030904@freebsd.org> <20111219162220.GK53453@dan.emsphone.com> <4EEFA05E.7090507@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4EEFA05E.7090507@freebsd.org> X-OS: FreeBSD 8.2-STABLE User-Agent: Mutt/1.5.21 (2010-09-15) X-Virus-Scanned: clamav-milter 0.97.2 at email2.allantgroup.com X-Virus-Status: Clean X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.6 (email2.allantgroup.com [199.67.51.78]); Mon, 19 Dec 2011 15:53:18 -0600 (CST) X-Scanned-By: MIMEDefang 2.68 on 199.67.51.78 Cc: FreeBSD Current Subject: Re: Uneven load on drives in ZFS RAIDZ1 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Dec 2011 21:53:22 -0000 In the last episode (Dec 19), Stefan Esser said: > Am 19.12.2011 17:22, schrieb Dan Nelson: > > In the last episode (Dec 19), Stefan Esser said: > >> for quite some time I have observed an uneven distribution of load > >> between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt > >> of a longer log of 10 second averages logged with gstat: > >> > >> dT: 10.001s w: 10.000s filter: ^a?da?.$ > >> L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name > >> 0 130 106 4134 4.5 23 1033 5.2 48.8| ada0 > >> 0 131 111 3784 4.2 19 1007 4.0 47.6| ada1 > >> 0 90 66 2219 4.5 24 1031 5.1 31.7| ada2 > >> 1 81 58 2007 4.6 22 1023 2.3 28.1| ada3 > > [...] > > This is a ZFS only system. The first partition on each drive holds just > the gptzfsloader. > > pool alloc free read write read write > ---------- ----- ----- ----- ----- ----- ----- > raid1 4.41T 2.21T 139 72 12.3M 818K > raidz1 4.41T 2.21T 139 72 12.3M 818K > ada0p2 - - 114 17 4.24M 332K > ada1p2 - - 106 15 3.82M 305K > ada2p2 - - 65 20 2.09M 337K > ada3p2 - - 58 18 2.18M 329K > > The same difference of read operations per second as shown by gstat ... I was under the impression that the parity blocks were scattered evenly across all disks, but from reading vdev_raidz.c, it looks like that isn't always the case. See the comment at the bottom of the vdev_raidz_map_alloc() function; it looks like it will toggle parity between the first two disks in a stripe every 1MB. It's not necessarily the first two disks assigned to the zvol, since stripes don't have to span all disks as long as there's one parity block (a small sync write may just hit two disks, essentially being written mirrored). The imbalance is only visible if you're writing full-width stripes in sequence, so if you write a 1TB file in one long stream, chances are that that file's parity blocks will be concentrated on just two disks, so those two disks will get less I/O on later reads. I don't know why the code toggles parity between just the first two columns; rotating it between all columns would give you an even balance. Is it always the last two disks that have less load, or does it slowly rotate to different disks depending on the data that you are reading? An interesting test would be to idle the system, run a "tar cvf /dev/null /raidz1" in one window, and watch iostat output on another window. If the load moves from disk to disk as tar reads different files, then my parity guess is probably right. If ada0 and ada1 are always busier, than you can ignore me :) Since it looks like the algorithm ends up creating two half-cold parity disks instead of one cold disk, I bet a 3-disk RAIDZ would exhibit even worse balancing, and a 5-disk set would be more even. -- Dan Nelson dnelson@allantgroup.com