From owner-freebsd-current@FreeBSD.ORG Tue Dec 20 11:46:54 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 20813106564A for ; Tue, 20 Dec 2011 11:46:54 +0000 (UTC) (envelope-from se@freebsd.org) Received: from nm9.bullet.mail.sp2.yahoo.com (nm9.bullet.mail.sp2.yahoo.com [98.139.91.79]) by mx1.freebsd.org (Postfix) with SMTP id F322F8FC1D for ; Tue, 20 Dec 2011 11:46:53 +0000 (UTC) Received: from [98.139.91.65] by nm9.bullet.mail.sp2.yahoo.com with NNFMP; 20 Dec 2011 11:46:53 -0000 Received: from [208.71.42.193] by tm5.bullet.mail.sp2.yahoo.com with NNFMP; 20 Dec 2011 11:45:53 -0000 Received: from [127.0.0.1] by smtp204.mail.gq1.yahoo.com with NNFMP; 20 Dec 2011 11:45:53 -0000 X-Yahoo-Newman-Id: 384385.72963.bm@smtp204.mail.gq1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: xuowPBAVM1mdX.fF1oalsu9xmQfKPqOSJGToJy6FcoLJFGE pinKueq2.uKMFTzTFCdwc8.L45IT4l72.xIHJybCQhwJePabMgjUqbNMCJ_H 4c8jpu1hx8DcWPiISbZossCLfiW_UdzIm3KSl0LN6B_tTNkbZyvnvGH9IaVT lvd1ZRcBOXO6TWLj59NsQqHiMptF2n2JeB9jMYZXGY24LpUFZwxc3lA_bcmq d2Rn4nLLK80i5Nsx58Zyp4deDHiGewnvHVXzCdM.NLEAftV0yFyXonBRYKAb fKLH_ABP9YlV7OPZFmhgRbO5wTOpDg04aFEhlFJd14hwvEP243FHIRwNOTmA 7uCVpfoxpUEBECx4buQSn8iNSutm3vjRTfVe89hOZBTjykCvxgZo6KnWNRn8 mqd6LWvJRbyF7zn0- X-Yahoo-SMTP: iDf2N9.swBDAhYEh7VHfpgq0lnq. Received: from [192.168.119.20] (se@81.173.153.254 with plain) by smtp204.mail.gq1.yahoo.com with SMTP; 20 Dec 2011 03:45:52 -0800 PST Message-ID: <4EF0756C.3030804@freebsd.org> Date: Tue, 20 Dec 2011 12:45:48 +0100 From: Stefan Esser User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: Dan Nelson References: <4EEF488E.1030904@freebsd.org> <20111219162220.GK53453@dan.emsphone.com> <4EEFA05E.7090507@freebsd.org> <20111219215317.GL53453@dan.emsphone.com> In-Reply-To: <20111219215317.GL53453@dan.emsphone.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: FreeBSD Current Subject: Re: Uneven load on drives in ZFS RAIDZ1 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 20 Dec 2011 11:46:54 -0000 Am 19.12.2011 22:53, schrieb Dan Nelson: > In the last episode (Dec 19), Stefan Esser said: >> pool alloc free read write read write >> ---------- ----- ----- ----- ----- ----- ----- >> raid1 4.41T 2.21T 139 72 12.3M 818K >> raidz1 4.41T 2.21T 139 72 12.3M 818K >> ada0p2 - - 114 17 4.24M 332K >> ada1p2 - - 106 15 3.82M 305K >> ada2p2 - - 65 20 2.09M 337K >> ada3p2 - - 58 18 2.18M 329K >> >> The same difference of read operations per second as shown by gstat ... > > I was under the impression that the parity blocks were scattered evenly > across all disks, but from reading vdev_raidz.c, it looks like that isn't > always the case. See the comment at the bottom of the > vdev_raidz_map_alloc() function; it looks like it will toggle parity between > the first two disks in a stripe every 1MB. It's not necessarily the first Thanks, this is very interesting information, indeed. I observed the problem when minidlna rebuild its index database, which scans all media files, many of them GBytes in length and sequentially written. This is a typical scenario that should trigger the code you point at. The comment explains that an attempt has been made to spread the (read) load more evenly, if large files are sequentially written: * If all data stored spans all columns, there's a danger that parity * will always be on the same device and, since parity isn't read * during normal operation, that that device's I/O bandwidth won't be * used effectively. We therefore switch the parity every 1MB. But they later found, that they failed to implement a good solution: * ... at least that was, ostensibly, the theory. As a practical * matter unless we juggle the parity between all devices evenly, we * won't see any benefit. Further, occasional writes that aren't a * multiple of the LCM of the number of children and the minimum * stripe width are sufficient to avoid pessimal behavior. But I do not understand the reasoning behind: * Unfortunately, this decision created an implicit on-disk format * requirement that we need to support for all eternity, but only * for single-parity RAID-Z. I see how the devidx and offset are swapped between col[0] and col[1], and it appears that this swapping is not explicitly reflected in the meta data. But there is no reason, that the algorithm could not be modified to cover all drives, if some flag is set (which effectively would lead to a 2nd generation raidz1 with incompatible block layout). Anyway, I do not think that the current behavior is so bad, that it needs immediate fixing. > two disks assigned to the zvol, since stripes don't have to span all disks > as long as there's one parity block (a small sync write may just hit two > disks, essentially being written mirrored). The imbalance is only visible > if you're writing full-width stripes in sequence, so if you write a 1TB file > in one long stream, chances are that that file's parity blocks will be > concentrated on just two disks, so those two disks will get less I/O on > later reads. I don't know why the code toggles parity between just the > first two columns; rotating it between all columns would give you an even > balance. Yes, but as the comment indicates, this would require introduction of a different raidz1 (a higher ZFS revision or a flag could trigger that). > Is it always the last two disks that have less load, or does it slowly > rotate to different disks depending on the data that you are reading? An > interesting test would be to idle the system, run a "tar cvf /dev/null > /raidz1" in one window, and watch iostat output on another window. If the > load moves from disk to disk as tar reads different files, then my parity > guess is probably right. If ada0 and ada1 are always busier, than you can > ignore me :) Yes, you are perfectly right! I tested the tar on a spool directory holding DVB-C recordings (typical files length 2GB to 8GB). The dT: 10.001s w: 10.000s filter: ^a?da?.$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 0 935 921 40216 0.4 13 139 0.5 32.8| ada0 0 927 913 36530 0.3 13 108 1.5 31.8| ada1 0 474 460 20110 0.7 14 141 0.9 32.4| ada2 0 474 461 20102 0.7 13 141 0.7 31.6| ada3 L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 0 1046 1041 45503 0.3 5 35 0.9 31.5| ada0 0 1039 1035 41353 0.3 4 23 0.4 31.6| ada1 0 531 526 22827 0.6 5 38 0.4 33.4| ada2 1 523 518 22772 0.6 5 38 0.6 30.8| ada3 L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 0 384 377 16414 0.8 7 46 3.3 30.2| ada0 0 380 373 15857 0.8 6 42 0.4 30.5| ada1 0 553 547 23937 0.5 6 47 1.7 28.0| ada2 1 551 545 22004 0.6 6 38 0.7 32.2| ada3 L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 0 667 656 28633 0.4 11 123 0.6 29.6| ada0 1 660 650 26010 0.5 10 109 0.6 33.4| ada1 0 338 327 14328 0.8 11 126 0.9 25.7| ada2 0 339 328 14303 1.0 11 120 1.0 32.7| ada3 $ iostat -d -n4 3 ada0 ada1 ada2 ada3 KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s 44.0 860 36.94 40.0 860 33.60 44.0 429 18.44 44.0 431 18.50 43.9 814 34.86 39.9 813 31.67 43.8 408 17.45 43.7 408 17.44 43.4 900 38.10 39.4 899 34.64 42.5 463 19.18 42.7 459 19.14 44.0 904 38.86 40.0 904 35.33 44.0 453 19.44 44.0 452 19.42 ! 43.1 571 24.01 41.5 571 23.17 43.4 799 33.85 40.0 801 31.27 ! 44.0 461 19.79 44.0 460 19.74 44.0 920 39.52 40.0 920 35.93 ! 43.9 435 18.65 43.9 435 18.68 44.0 868 37.29 40.0 868 33.91 ! 42.8 390 16.29 42.8 390 16.28 43.4 765 32.42 39.4 767 29.48 ! 44.0 331 14.22 44.0 329 14.12 44.0 659 28.32 40.0 659 25.75 ! 41.8 332 13.55 42.1 326 13.38 42.9 640 26.84 39.0 640 24.38 44.0 452 19.40 42.2 451 18.58 44.0 597 25.66 40.7 595 23.65 = 42.3 589 24.33 39.8 585 22.75 42.1 562 23.14 39.7 561 21.77 = 43.0 569 23.93 40.8 570 22.72 43.0 641 26.95 40.1 642 25.14 44.0 709 30.48 40.9 710 28.41 44.0 607 26.10 41.8 606 24.73 44.0 785 33.73 40.6 784 31.07 44.0 567 24.36 42.4 568 23.50 44.0 899 38.62 40.0 899 35.11 44.0 449 19.30 44.0 450 19.32 44.0 881 37.87 40.0 881 34.43 44.0 441 18.94 44.0 441 18.93 43.4 841 35.61 39.4 841 32.37 42.7 428 17.87 42.7 428 17.84 Hmmm, looking back through hundreds of lines of iostat output I see that ada0 and ada1 see similar request rates, as do ada2 and ada3. But I know that I also observed other combinations on earlier tests (with different data?). > Since it looks like the algorithm ends up creating two half-cold parity > disks instead of one cold disk, I bet a 3-disk RAIDZ would exhibit even > worse balancing, and a 5-disk set would be more even. Yes, this sounds very reasonable. Some iostat results were posted for a 6 disk raidz1, but they were for writes, not reads. I've kept the 3*1TB drives that formed the pool before I replaced them by 4*2TB. I can create a 3 drive raidz1 on them and perform some tests ... BTW: Read throughput in the tar test was far lower than I had expected. The CPU load was 3% user and some 0,2 system time (on an i2600K) and the effective transfer speed of the RAID was only some 115MB/s. The pool has 1/3 empty space and the test files were written in one go and should have been layed out in an optimal way. A dd of a large file (~10GB) gives similar results, independently of the block size (128k vs. 1m). Transfer sizes were only 43KB on average, which matches MAXPHYS=128KB distributed over 3 drives (plus parity in case of writes). This indicates, that in order to be able to read MAXPHYS bytes from each drive, the original request size should have covered 3*MAXPHYS. But the small transfer length does not seem to be the cause of the low transfer rate: # dd if=/dev/ada2p2 of=/dev/null bs=10k count=10000 10000+0 records in 10000+0 records out 102400000 bytes transferred in 0.853374 secs (119994281 bytes/sec) # dd if=/dev/ada1p2 of=/dev/null bs=2k count=50000 50000+0 records in 50000+0 records out 102400000 bytes transferred in 2.668089 secs (38379531 bytes/sec) Even a block size of 2KB will result in 35-40MB/s read throughput ... Any idea, why the read performance is so much lower than possible given the hardware? Regards, STefan