From owner-freebsd-fs@FreeBSD.ORG Tue Aug 9 13:52:14 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BAC2A106566B for ; Tue, 9 Aug 2011 13:52:14 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta13.emeryville.ca.mail.comcast.net (qmta13.emeryville.ca.mail.comcast.net [76.96.27.243]) by mx1.freebsd.org (Postfix) with ESMTP id A2F2B8FC14 for ; Tue, 9 Aug 2011 13:52:14 +0000 (UTC) Received: from omta16.emeryville.ca.mail.comcast.net ([76.96.30.72]) by qmta13.emeryville.ca.mail.comcast.net with comcast id JDq91h0031ZMdJ4ADDsAB0; Tue, 09 Aug 2011 13:52:10 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta16.emeryville.ca.mail.comcast.net with comcast id JDs91h01D1t3BNj8cDs9Ui; Tue, 09 Aug 2011 13:52:10 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id E7D15102C19; Tue, 9 Aug 2011 06:52:12 -0700 (PDT) Date: Tue, 9 Aug 2011 06:52:12 -0700 From: Jeremy Chadwick To: Jeremie Le Hen Message-ID: <20110809135212.GA13334@icarus.home.lan> References: <20110809131057.GA53580@felucia.tataz.chchile.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110809131057.GA53580@felucia.tataz.chchile.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@FreeBSD.org Subject: Re: zfs mirror reads only on one disk X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Aug 2011 13:52:14 -0000 On Tue, Aug 09, 2011 at 03:10:57PM +0200, Jeremie Le Hen wrote: > Please Cc: me when replying, as I've not subscribed. Thanks. > > I'm using FreeBSD 8.2-STABLE, with a mirrored ZFS pool v15: > > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 0 > mirror ONLINE 0 0 0 > ad10s1 ONLINE 0 0 0 > ad6s1 ONLINE 0 0 0 > > ad6: 1907729MB at ata3-master UDMA100 SATA 3Gb/s > ad10: 1907729MB at ata5-master UDMA100 SATA 3Gb/s > > (For those who wonder why I use a sliced disk, this is because the disks > are not the same and this allows me to get the same size on each side. > Besides, ZFS v15 doesn't have the autoexpand property, this is a > workaround.) > > The mirror is correctly synchronized and when I write on it, I get the > following iostat(8) output (3 seconds interval): > > extended device statistics > device r/s w/s kr/s kw/s wait svc_t %b > ad6 0.0 682.8 0.0 41593.3 16 18.7 77 > ad10 0.3 686.8 21.3 41465.4 19 19.4 80 > extended device statistics > device r/s w/s kr/s kw/s wait svc_t %b > ad6 0.0 680.9 0.0 41910.7 16 17.3 78 > ad10 0.0 671.2 0.0 41228.1 16 19.6 80 > > > However, when I read on the mirror, only ad10 is being used: > > extended device statistics > device r/s w/s kr/s kw/s wait svc_t %b > ad6 0.0 0.0 0.0 0.0 0 0.0 0 > ad10 762.7 0.0 48796.7 0.0 2 1.8 82 > extended device statistics > device r/s w/s kr/s kw/s wait svc_t %b > ad6 0.0 0.0 0.0 0.0 0 0.0 0 > ad10 740.2 0.0 47373.1 0.0 1 1.9 81 > extended device statistics > device r/s w/s kr/s kw/s wait svc_t %b > ad6 0.0 0.3 0.0 1.3 0 0.2 0 > ad10 716.2 0.3 45836.0 1.3 2 1.9 82 > > > One of my colleagues told me this was maybe an optimization of ZFS for > sequentials reads, so I tried to run two reading processes in parallel, > with the same unfortunate outcome. > > I also tried to run "cat *" in a highly populated Maildir, so I'm sure > the reads are not sequential, same outcome. > > Do you have any idea why this happens? Since I have a ZFS mirror setup I can test this. Let's take a look: ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 ada1: ATA-8 SATA 3.x device ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) ada3 at ahcich3 bus 0 scbus3 target 0 lun 0 ada3: ATA-8 SATA 2.x device ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada3: Command Queueing enabled ada3: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) Each of these disks can push about 140MByte/sec (sequential) but I don't expect to see that kind of I/O. I do expect to see around 100MByte/sec per disk (just have to trust me; I'm used to my disks! :-) ). zpool-wise, absolutely nothing special (note I am using ZFSv28 on RELENG_8 however), and *VERY* little tuning is done in loader.conf: icarus# zpool status data pool: data state: ONLINE scan: scrub repaired 0 in 0h54m with 0 errors on Tue Jun 14 10:24:49 2011 config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada3 ONLINE 0 0 0 errors: No known data errors icarus# egrep ^vfs.zfs /boot/loader.conf vfs.zfs.arc_max="5120M" So let's test. I have some pretty big files on the data/storage filesystem, so let's try dd'ing one of those while simultaneously using "gstat -I500ms -f 'ada1|ada3'" to watch disk I/O. It's *extremely* important that I dd a file which isn't already in ARC (ARC right now for me takes up about 6GB of RAM, so I'll pick a CD image I haven't accessed since the machine has rebooted). icarus# cd /storage/CD_Images/FreeBSD/7.4-STABLE/ icarus# ls -l *disc1* -rwxr--r-- 1 storage storage 663519232 Mar 4 06:54 FreeBSD-7.4-RELEASE-amd64-disc1.iso icarus# dd if=FreeBSD-7.4-RELEASE-amd64-disc1.iso of=/dev/null bs=64k 10124+1 records in 10124+1 records out 663519232 bytes transferred in 3.965980 secs (167302715 bytes/sec) And in another window: dT: 0.504s w: 0.500s filter: ada1|ada3 L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 10 750 750 94557 13.4 0 0 0.0 100.4| ada1 10 631 631 80771 15.9 0 0 0.0 100.2| ada3 Looks to me like both disks were getting utilised. Let's double check with "zpool iostat -v data 1" and use another file which isn't in the ARC: icarus# cd ../8.2-STABLE/ icarus# ls -l *memstick* -rwxr--r-- 1 storage storage 1087774720 Mar 4 06:17 FreeBSD-8.2-RELEASE-amd64-memstick.img icarus# dd if=FreeBSD-8.2-RELEASE-amd64-memstick.img of=/dev/null bs=64k 16598+1 records in 16598+1 records out 1087774720 bytes transferred in 6.802677 secs (159903917 bytes/sec) capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 0 0 0 0 mirror 278G 650G 0 0 0 0 ada1 - - 0 0 0 0 ada3 - - 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 1.09K 0 138M 0 mirror 278G 650G 1.09K 0 138M 0 ada1 - - 595 0 74.2M 0 ada3 - - 519 0 63.7M 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 1.10K 0 140M 0 mirror 278G 650G 1.10K 0 140M 0 ada1 - - 542 0 66.8M 0 ada3 - - 584 0 73.1M 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 1.32K 0 168M 0 mirror 278G 650G 1.32K 0 168M 0 ada1 - - 724 0 89.3M 0 ada3 - - 626 0 78.3M 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 1.26K 0 161M 0 mirror 278G 650G 1.26K 0 161M 0 ada1 - - 655 0 80.7M 0 ada3 - - 637 0 79.7M 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 1.23K 0 156M 0 mirror 278G 650G 1.23K 0 156M 0 ada1 - - 635 0 78.2M 0 ada3 - - 625 0 78.2M 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 1.17K 0 148M 0 mirror 278G 650G 1.17K 0 148M 0 ada1 - - 600 0 73.8M 0 ada3 - - 595 0 74.4M 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 955 0 119M 0 mirror 278G 650G 955 0 119M 0 ada1 - - 411 0 50.8M 0 ada3 - - 544 0 68.1M 0 ---------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- data 278G 650G 0 0 0 0 mirror 278G 650G 0 0 0 0 ada1 - - 0 0 0 0 ada3 - - 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- Performance was a little less than I estimated (I really don't care to be honest), but this double-confirms that yes, reads do get split across mirror members. Therefore I cannot explain what you're seeing. Maybe consider upgrading to a newer RELENG_8 and ZFSv28 and see if things improve? I wish I had a way to confirm this would fix your problem but I do not. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |