Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 Aug 2011 06:52:12 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Jeremie Le Hen <jeremie@le-hen.org>
Cc:        freebsd-fs@FreeBSD.org
Subject:   Re: zfs mirror reads only on one disk
Message-ID:  <20110809135212.GA13334@icarus.home.lan>
In-Reply-To: <20110809131057.GA53580@felucia.tataz.chchile.org>
References:  <20110809131057.GA53580@felucia.tataz.chchile.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Aug 09, 2011 at 03:10:57PM +0200, Jeremie Le Hen wrote:
> Please Cc: me when replying, as I've not subscribed.  Thanks.
> 
> I'm using FreeBSD 8.2-STABLE, with a mirrored ZFS pool v15:
> 
>         NAME        STATE     READ WRITE CKSUM
>         data        ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             ad10s1  ONLINE       0     0     0
>             ad6s1   ONLINE       0     0     0
> 
>     ad6: 1907729MB <Hitachi HDS723020BLA642 MN6OA180> at ata3-master UDMA100 SATA 3Gb/s
>     ad10: 1907729MB <WDC WD2002FAEX-007BA0 05.01D05> at ata5-master UDMA100 SATA 3Gb/s
> 
> (For those who wonder why I use a sliced disk, this is because the disks
> are not the same and this allows me to get the same size on each side.
> Besides, ZFS v15 doesn't have the autoexpand property, this is a
> workaround.)
> 
> The mirror is correctly synchronized and when I write on it, I get the
> following iostat(8) output (3 seconds interval):
> 
>                             extended device statistics
>     device     r/s   w/s    kr/s    kw/s wait svc_t  %b
>     ad6        0.0 682.8     0.0 41593.3   16  18.7  77
>     ad10       0.3 686.8    21.3 41465.4   19  19.4  80
>                             extended device statistics
>     device     r/s   w/s    kr/s    kw/s wait svc_t  %b
>     ad6        0.0 680.9     0.0 41910.7   16  17.3  78
>     ad10       0.0 671.2     0.0 41228.1   16  19.6  80
> 
> 
> However, when I read on the mirror, only ad10 is being used:
> 
>                             extended device statistics
>     device     r/s   w/s    kr/s    kw/s wait svc_t  %b
>     ad6        0.0   0.0     0.0     0.0    0   0.0   0
>     ad10     762.7   0.0 48796.7     0.0    2   1.8  82
>                             extended device statistics
>     device     r/s   w/s    kr/s    kw/s wait svc_t  %b
>     ad6        0.0   0.0     0.0     0.0    0   0.0   0
>     ad10     740.2   0.0 47373.1     0.0    1   1.9  81
>                             extended device statistics
>     device     r/s   w/s    kr/s    kw/s wait svc_t  %b
>     ad6        0.0   0.3     0.0     1.3    0   0.2   0
>     ad10     716.2   0.3 45836.0     1.3    2   1.9  82
> 
> 
> One of my colleagues told me this was maybe an optimization of ZFS for
> sequentials reads, so I tried to run two reading processes in parallel,
> with the same unfortunate outcome.
> 
> I also tried to run "cat *" in a highly populated Maildir, so I'm sure
> the reads are not sequential, same outcome.
> 
> Do you have any idea why this happens?

Since I have a ZFS mirror setup I can test this.  Let's take a look:

ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD1002FAEX-00Z3A0 05.01D05> ATA-8 SATA 3.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <WDC WD1001FALS-00J7B1 05.00K05> ATA-8 SATA 2.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)

Each of these disks can push about 140MByte/sec (sequential) but I don't
expect to see that kind of I/O.  I do expect to see around 100MByte/sec
per disk (just have to trust me; I'm used to my disks!  :-) ).
zpool-wise, absolutely nothing special (note I am using ZFSv28 on
RELENG_8 however), and *VERY* little tuning is done in loader.conf:

icarus# zpool status data
  pool: data
 state: ONLINE
 scan: scrub repaired 0 in 0h54m with 0 errors on Tue Jun 14 10:24:49 2011
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada3    ONLINE       0     0     0

errors: No known data errors

icarus# egrep ^vfs.zfs /boot/loader.conf
vfs.zfs.arc_max="5120M"

So let's test.  I have some pretty big files on the data/storage
filesystem, so let's try dd'ing one of those while simultaneously using
"gstat -I500ms -f 'ada1|ada3'" to watch disk I/O.  It's *extremely*
important that I dd a file which isn't already in ARC (ARC right now for
me takes up about 6GB of RAM, so I'll pick a CD image I haven't accessed
since the machine has rebooted).

icarus# cd /storage/CD_Images/FreeBSD/7.4-STABLE/
icarus# ls -l *disc1*
-rwxr--r--  1 storage  storage  663519232 Mar  4 06:54 FreeBSD-7.4-RELEASE-amd64-disc1.iso

icarus# dd if=FreeBSD-7.4-RELEASE-amd64-disc1.iso of=/dev/null bs=64k
10124+1 records in
10124+1 records out
663519232 bytes transferred in 3.965980 secs (167302715 bytes/sec)

And in another window:

dT: 0.504s  w: 0.500s  filter: ada1|ada3
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   10    750    750  94557   13.4      0      0    0.0  100.4| ada1
   10    631    631  80771   15.9      0      0    0.0  100.2| ada3

Looks to me like both disks were getting utilised.  Let's double check
with "zpool iostat -v data 1" and use another file which isn't in the
ARC:

icarus# cd ../8.2-STABLE/
icarus# ls -l *memstick*
-rwxr--r--  1 storage  storage  1087774720 Mar  4 06:17 FreeBSD-8.2-RELEASE-amd64-memstick.img

icarus# dd if=FreeBSD-8.2-RELEASE-amd64-memstick.img of=/dev/null bs=64k
16598+1 records in
16598+1 records out
1087774720 bytes transferred in 6.802677 secs (159903917 bytes/sec)

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G      0      0      0      0
  mirror     278G   650G      0      0      0      0
    ada1        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G  1.09K      0   138M      0
  mirror     278G   650G  1.09K      0   138M      0
    ada1        -      -    595      0  74.2M      0
    ada3        -      -    519      0  63.7M      0
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G  1.10K      0   140M      0
  mirror     278G   650G  1.10K      0   140M      0
    ada1        -      -    542      0  66.8M      0
    ada3        -      -    584      0  73.1M      0
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G  1.32K      0   168M      0
  mirror     278G   650G  1.32K      0   168M      0
    ada1        -      -    724      0  89.3M      0
    ada3        -      -    626      0  78.3M      0
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G  1.26K      0   161M      0
  mirror     278G   650G  1.26K      0   161M      0
    ada1        -      -    655      0  80.7M      0
    ada3        -      -    637      0  79.7M      0
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G  1.23K      0   156M      0
  mirror     278G   650G  1.23K      0   156M      0
    ada1        -      -    635      0  78.2M      0
    ada3        -      -    625      0  78.2M      0
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G  1.17K      0   148M      0
  mirror     278G   650G  1.17K      0   148M      0
    ada1        -      -    600      0  73.8M      0
    ada3        -      -    595      0  74.4M      0
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G    955      0   119M      0
  mirror     278G   650G    955      0   119M      0
    ada1        -      -    411      0  50.8M      0
    ada3        -      -    544      0  68.1M      0
----------  -----  -----  -----  -----  -----  -----

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         278G   650G      0      0      0      0
  mirror     278G   650G      0      0      0      0
    ada1        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----

Performance was a little less than I estimated (I really don't care to
be honest), but this double-confirms that yes, reads do get split across
mirror members.

Therefore I cannot explain what you're seeing.  Maybe consider upgrading
to a newer RELENG_8 and ZFSv28 and see if things improve?  I wish I had
a way to confirm this would fix your problem but I do not.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110809135212.GA13334>