Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Jul 2013 12:58:43 +0100
From:      "Steven Hartland" <killing@multiplay.co.uk>
To:        "Daniel Kalchev" <daniel@digsys.bg>, "freebsd-fs" <freebsd-fs@freebsd.org>
Subject:   Re: ZFS vdev I/O questions
Message-ID:  <3472068604314C9887FE3BD4CD42B7C8@multiplay.co.uk>
References:  <51E5316B.9070201@digsys.bg>

next in thread | previous in thread | raw e-mail | index | archive | help
One thing to check with this is to add -d to your gstat to see
if your waiting deletes IO? I doubt it but worth checking.

    Regards
    Steve
----- Original Message ----- 
From: "Daniel Kalchev" <daniel@digsys.bg>
To: "freebsd-fs" <freebsd-fs@freebsd.org>
Sent: Tuesday, July 16, 2013 12:41 PM
Subject: ZFS vdev I/O questions


>I am observing some "strange" behaviour with I/O spread on ZFS vdevs and 
> thought I might ask if someone has observed it too.
> 
> The system hardware is an Supermicro X8DTH-6F board with integrated 
> LSI2008 controller, two Xeon E5620 CPUs and 72GB or RAM (6x4 + 6x8 GB 
> modules).
> Runs 9-stable r252690.
> 
> It has currently 18 drive zpool, split on three 6 drive raidz2 vdevs, 
> plus ZIL and L2ARC on separate SSDs (240GB Intel 520). The ZIL consists 
> of two partitions of the boot SSDs (Intel 320), not mirrored. The zpool 
> layout is
> 
>   pool: storage
>  state: ONLINE
>   scan: scrub canceled on Thu Jul 11 17:14:50 2013
> config:
> 
>         NAME            STATE     READ WRITE CKSUM
>         storage         ONLINE       0     0     0
>           raidz2-0      ONLINE       0     0     0
>             gpt/disk00  ONLINE       0     0     0
>             gpt/disk01  ONLINE       0     0     0
>             gpt/disk02  ONLINE       0     0     0
>             gpt/disk03  ONLINE       0     0     0
>             gpt/disk04  ONLINE       0     0     0
>             gpt/disk05  ONLINE       0     0     0
>           raidz2-1      ONLINE       0     0     0
>             gpt/disk06  ONLINE       0     0     0
>             gpt/disk07  ONLINE       0     0     0
>             gpt/disk08  ONLINE       0     0     0
>             gpt/disk09  ONLINE       0     0     0
>             gpt/disk10  ONLINE       0     0     0
>             gpt/disk11  ONLINE       0     0     0
>           raidz2-2      ONLINE       0     0     0
>             gpt/disk12  ONLINE       0     0     0
>             gpt/disk13  ONLINE       0     0     0
>             gpt/disk14  ONLINE       0     0     0
>             gpt/disk15  ONLINE       0     0     0
>             gpt/disk16  ONLINE       0     0     0
>             gpt/disk17  ONLINE       0     0     0
>         logs
>           ada0p2        ONLINE       0     0     0
>           ada1p2        ONLINE       0     0     0
>         cache
>           da20p2        ONLINE       0     0     0
> 
> 
> zdb output
> 
> storage:
>     version: 5000
>     name: 'storage'
>     state: 0
>     txg: 5258772
>     pool_guid: 17094379857311239400
>     hostid: 3505628652
>     hostname: 'a1.register.bg'
>     vdev_children: 5
>     vdev_tree:
>         type: 'root'
>         id: 0
>         guid: 17094379857311239400
>         children[0]:
>             type: 'raidz'
>             id: 0
>             guid: 2748500753748741494
>             nparity: 2
>             metaslab_array: 33
>             metaslab_shift: 37
>             ashift: 12
>             asize: 18003521961984
>             is_log: 0
>             create_txg: 4
>             children[0]:
>                 type: 'disk'
>                 id: 0
>                 guid: 5074824874132816460
>                 path: '/dev/gpt/disk00'
>                 phys_path: '/dev/gpt/disk00'
>                 whole_disk: 1
>                 DTL: 378
>                 create_txg: 4
>             children[1]:
>                 type: 'disk'
>                 id: 1
>                 guid: 14410366944090513563
>                 path: '/dev/gpt/disk01'
>                 phys_path: '/dev/gpt/disk01'
>                 whole_disk: 1
>                 DTL: 53
>                 create_txg: 4
>             children[2]:
>                 type: 'disk'
>                 id: 2
>                 guid: 3526681390841761237
>                 path: '/dev/gpt/disk02'
>                 phys_path: '/dev/gpt/disk02'
>                 whole_disk: 1
>                 DTL: 52
>                 create_txg: 4
>             children[3]:
>                 type: 'disk'
>                 id: 3
>                 guid: 3773850995072323004
>                 path: '/dev/gpt/disk03'
>                 phys_path: '/dev/gpt/disk03'
>                 whole_disk: 1
>                 DTL: 51
>                 create_txg: 4
>             children[4]:
>                 type: 'disk'
>                 id: 4
>                 guid: 16528489666301728411
>                 path: '/dev/gpt/disk04'
>                 phys_path: '/dev/gpt/disk04'
>                 whole_disk: 1
>                 DTL: 50
>                 create_txg: 4
>             children[5]:
>                 type: 'disk'
>                 id: 5
>                 guid: 11222774817699257051
>                 path: '/dev/gpt/disk05'
>                 phys_path: '/dev/gpt/disk05'
>                 whole_disk: 1
>                 DTL: 44147
>                 create_txg: 4
>         children[1]:
>             type: 'raidz'
>             id: 1
>             guid: 614220834244218709
>             nparity: 2
>             metaslab_array: 39
>             metaslab_shift: 37
>             ashift: 12
>             asize: 18003521961984
>             is_log: 0
>             create_txg: 40
>             children[0]:
>                 type: 'disk'
>                 id: 0
>                 guid: 8076478524731550200
>                 path: '/dev/gpt/disk06'
>                 phys_path: '/dev/gpt/disk06'
>                 whole_disk: 1
>                 DTL: 2914
>                 create_txg: 40
>             children[1]:
>                 type: 'disk'
>                 id: 1
>                 guid: 1689851194543981566
>                 path: '/dev/gpt/disk07'
>                 phys_path: '/dev/gpt/disk07'
>                 whole_disk: 1
>                 DTL: 48
>                 create_txg: 40
>             children[2]:
>                 type: 'disk'
>                 id: 2
>                 guid: 9743236178648200269
>                 path: '/dev/gpt/disk08'
>                 phys_path: '/dev/gpt/disk08'
>                 whole_disk: 1
>                 DTL: 47
>                 create_txg: 40
>             children[3]:
>                 type: 'disk'
>                 id: 3
>                 guid: 10157617457760516410
>                 path: '/dev/gpt/disk09'
>                 phys_path: '/dev/gpt/disk09'
>                 whole_disk: 1
>                 DTL: 46
>                 create_txg: 40
>             children[4]:
>                 type: 'disk'
>                 id: 4
>                 guid: 5035981195206926078
>                 path: '/dev/gpt/disk10'
>                 phys_path: '/dev/gpt/disk10'
>                 whole_disk: 1
>                 DTL: 45
>                 create_txg: 40
>             children[5]:
>                 type: 'disk'
>                 id: 5
>                 guid: 4975835521778875251
>                 path: '/dev/gpt/disk11'
>                 phys_path: '/dev/gpt/disk11'
>                 whole_disk: 1
>                 DTL: 44149
>                 create_txg: 40
>         children[2]:
>             type: 'raidz'
>             id: 2
>             guid: 7453512836015019221
>             nparity: 2
>             metaslab_array: 38974
>             metaslab_shift: 37
>             ashift: 12
>             asize: 18003521961984
>             is_log: 0
>             create_txg: 4455560
>             children[0]:
>                 type: 'disk'
>                 id: 0
>                 guid: 11182458869377968267
>                 path: '/dev/gpt/disk12'
>                 phys_path: '/dev/gpt/disk12'
>                 whole_disk: 1
>                 DTL: 45059
>                 create_txg: 4455560
>             children[1]:
>                 type: 'disk'
>                 id: 1
>                 guid: 5844283175515272344
>                 path: '/dev/gpt/disk13'
>                 phys_path: '/dev/gpt/disk13'
>                 whole_disk: 1
>                 DTL: 44145
>                 create_txg: 4455560
>             children[2]:
>                 type: 'disk'
>                 id: 2
>                 guid: 13095364699938843583
>                 path: '/dev/gpt/disk14'
>                 phys_path: '/dev/gpt/disk14'
>                 whole_disk: 1
>                 DTL: 44144
>                 create_txg: 4455560
>             children[3]:
>                 type: 'disk'
>                 id: 3
>                 guid: 5196507898996589388
>                 path: '/dev/gpt/disk15'
>                 phys_path: '/dev/gpt/disk15'
>                 whole_disk: 1
>                 DTL: 44143
>                 create_txg: 4455560
>             children[4]:
>                 type: 'disk'
>                 id: 4
>                 guid: 12809770017318709512
>                 path: '/dev/gpt/disk16'
>                 phys_path: '/dev/gpt/disk16'
>                 whole_disk: 1
>                 DTL: 44142
>                 create_txg: 4455560
>             children[5]:
>                 type: 'disk'
>                 id: 5
>                 guid: 7339883019925920701
>                 path: '/dev/gpt/disk17'
>                 phys_path: '/dev/gpt/disk17'
>                 whole_disk: 1
>                 DTL: 44141
>                 create_txg: 4455560
>         children[3]:
>             type: 'disk'
>             id: 3
>             guid: 18011869864924559827
>             path: '/dev/ada0p2'
>             phys_path: '/dev/ada0p2'
>             whole_disk: 1
>             metaslab_array: 16675
>             metaslab_shift: 26
>             ashift: 12
>             asize: 8585216000
>             is_log: 1
>             DTL: 86787
>             create_txg: 5182360
>         children[4]:
>             type: 'disk'
>             id: 4
>             guid: 1338775535758010670
>             path: '/dev/ada1p2'
>             phys_path: '/dev/ada1p2'
>             whole_disk: 1
>             metaslab_array: 16693
>             metaslab_shift: 26
>             ashift: 12
>             asize: 8585216000
>             is_log: 1
>             DTL: 86788
>             create_txg: 5182377
>     features_for_read:
> 
> Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640, 
> firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been 
> replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware 
> CC24) -- this is an 4k sector drive of a new generation (notice the 
> relatively 'old' firmware, that can't be upgraded).
> Drives da6-da17 are also Seagate Barracuda 7200.14 (AF) but 
> (ST3000DM001-9YN166, firmware CC4H) -- the more "normal" part number. 
> Some have firmware CC4C which I replace drive by drive (but other than 
> the excessive load counts no other issues so far).
> 
> The only ZFS related tuning is in /etc/sysctl.conf
> # improve ZFS resilver
> vfs.zfs.resilver_delay=0
> vfs.zfs.scrub_delay=0
> vfs.zfs.top_maxinflight=128
> vfs.zfs.resilver_min_time_ms=5000
> vfs.zfs.vdev.max_pending=24
> # L2ARC:
> vfs.zfs.l2arc_norw=0
> vfs.zfs.l2arc_write_max=83886080
> vfs.zfs.l2arc_write_boost=83886080
> 
> 
> The pool of course had dedup and had serious dedup ratios, like over 
> 10x. In general, with the ZIL and L2ARC, the only trouble I have seen 
> with dedup is when deleting lots of data... which this server has seen 
> plenty of. During this experiment, I have moved most data to other 
> server and un-dedup the last remaining TBs.
> 
> While doing zfs destroy on an 2-3TB dataset, I observe very annoying 
> behaviour. The pool would stay mostly idle, accepting almost no I/O and 
> doing small random reads, like this
> 
> $ zpool iostat storage 10
> storage     45.3T  3.45T    466      0  1.82M      0
> storage     45.3T  3.45T     50      0   203K      0
> storage     45.3T  3.45T     45     25   183K  1.70M
> storage     45.3T  3.45T     49      0   199K      0
> storage     45.3T  3.45T     50      0   202K      0
> storage     45.3T  3.45T     51      0   204K      0
> storage     45.3T  3.45T     57      0   230K      0
> storage     45.3T  3.45T     65      0   260K      0
> storage     45.3T  3.45T     68     25   274K  1.70M
> storage     45.3T  3.45T     65      0   260K      0
> storage     45.3T  3.45T     64      0   260K      0
> storage     45.3T  3.45T     67      0   272K      0
> storage     45.3T  3.45T     66      0   266K      0
> storage     45.3T  3.45T     64      0   258K      0
> storage     45.3T  3.45T     62     25   250K  1.70M
> storage     45.3T  3.45T     57      0   231K      0
> storage     45.3T  3.45T     58      0   235K      0
> storage     45.3T  3.45T     66      0   267K      0
> storage     45.3T  3.45T     64      0   257K      0
> storage     45.3T  3.45T     60      0   241K      0
> storage     45.3T  3.45T     50      0   203K      0
> storage     45.3T  3.45T     52     25   209K  1.70M
> storage     45.3T  3.45T     54      0   217K      0
> storage     45.3T  3.45T     51      0   205K      0
> storage     45.3T  3.45T     54      0   216K      0
> storage     45.3T  3.45T     55      0   222K      0
> storage     45.3T  3.45T     56      0   226K      0
> storage     45.3T  3.45T     65      0   264K      0
> storage     45.3T  3.45T     71      0   286K      0
> 
> The write peaks are from processes syncing data to the pool - in this 
> state it does not do reads (the data the sync process deals with is 
> already in ARC).
> Then it goes into writing back to the pool (perhaps DDT metadata)
> 
> storage     45.3T  3.45T     17  24.4K  69.6K  97.5M
> storage     45.3T  3.45T      0  19.6K      0  78.5M
> storage     45.3T  3.45T      0  14.2K      0  56.8M
> storage     45.3T  3.45T      0  7.90K      0  31.6M
> storage     45.3T  3.45T      0  7.81K      0  32.8M
> storage     45.3T  3.45T      0  9.54K      0  38.2M
> storage     45.3T  3.45T      0  7.07K      0  28.3M
> storage     45.3T  3.45T      0  7.70K      0  30.8M
> storage     45.3T  3.45T      0  6.19K      0  24.8M
> storage     45.3T  3.45T      0  5.45K      0  21.8M
> storage     45.3T  3.45T      0  5.78K      0  24.7M
> storage     45.3T  3.45T      0  5.29K      0  21.2M
> storage     45.3T  3.45T      0  5.69K      0  22.8M
> storage     45.3T  3.45T      0  5.52K      0  22.1M
> storage     45.3T  3.45T      0  3.26K      0  13.1M
> storage     45.3T  3.45T      0  1.77K      0  7.10M
> storage     45.3T  3.45T      0  1.63K      0  8.14M
> storage     45.3T  3.45T      0  1.41K      0  5.64M
> storage     45.3T  3.45T      0  1.22K      0  4.88M
> storage     45.3T  3.45T      0  1.27K      0  5.09M
> storage     45.3T  3.45T      0  1.06K      0  4.26M
> storage     45.3T  3.45T      0  1.07K      0  4.30M
> storage     45.3T  3.45T      0    979      0  3.83M
> storage     45.3T  3.45T      0   1002      0  3.91M
> storage     45.3T  3.45T      0   1010      0  3.95M
> storage     45.3T  3.45T      0    948  2.40K  3.71M
> storage     45.3T  3.45T      0    939      0  3.67M
> storage     45.3T  3.45T      0   1023      0  7.10M
> storage     45.3T  3.45T      0  1.01K  4.80K  4.04M
> storage     45.3T  3.45T      0    822      0  3.22M
> storage     45.3T  3.45T      0    434      0  1.70M
> storage     45.3T  3.45T      0    398  2.40K  1.56M
> 
> For quite some time, there are no reads from the pool. When that 
> happens, gstat (gstat -f 'da[0-9]*$') displays something like this:
> 
> 
> dT: 1.001s  w: 1.000s  filter: da[0-9]*$
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>    24   1338      0      0    0.0   1338  12224   17.8  100.9| da0
>    24   6888      0      0    0.0   6888  60720    3.5  100.0| da1
>    24   6464      0      0    0.0   6464  71997    3.7  100.0| da2
>    24   6117      0      0    0.0   6117  82386    3.9   99.9| da3
>    24   6455      0      0    0.0   6455  66822    3.7  100.0| da4
>    24   6782      0      0    0.0   6782  69207    3.5  100.0| da5
>    24    698      0      0    0.0    698  27533   34.1   99.6| da6
>    24    590      0      0    0.0    590  21627   40.9   99.7| da7
>    24    561      0      0    0.0    561  21031   42.8  100.2| da8
>    24    724      0      0    0.0    724  25583   33.1   99.9| da9
>    24    567      0      0    0.0    567  22965   41.4   98.0| da10
>    24    566      0      0    0.0    566  21834   42.4   99.9| da11
>    24    586      0      0    0.0    586   4899   43.5  100.2| da12
>    24    487      0      0    0.0    487   4008   49.3  100.9| da13
>    24    628      0      0    0.0    628   5007   38.9  100.2| da14
>    24    714      0      0    0.0    714   5706   33.8   99.9| da15
>    24    595      0      0    0.0    595   4831   39.8   99.8| da16
>    24    485      0      0    0.0    485   3932   49.2  100.1| da17
>     0      0      0      0    0.0      0      0    0.0    0.0| da18
>     0      0      0      0    0.0      0      0    0.0    0.0| da19
>     0      0      0      0    0.0      0      0    0.0    0.0| da20
>     0      0      0      0    0.0      0      0    0.0    0.0| ada0
>     0      0      0      0    0.0      0      0    0.0    0.0| ada1
> 
> 
> (drives da8 and 19 are spares, da20 is the L2ARC SSD drive, ada0 and 
> ada0 are the boot SSDs in separate zpool)
> Now, here comes the weird part. the gpart display would show intensive 
> writes to all vdevs (da0-da5, da6-da11,da12-da17) then one of the vdevs 
> would complete writing, and stop writing, while other vdevs continue, at 
> the end only one vdev writes until as it seems, data is completely 
> written to all vdevs (this can be observed in the zfs iostat output 
> above with the decreasing write IOPS each 10 seconds), then there is a 
> few seconds "do nothing" period and then we are back to small reads.
> 
> The other observation I have is with the first vdev: the 512b drives do 
> a lot of I/O fast, complete first and then sit idle, while da0 continues 
> to write for many more seconds. They consistently show many more IOPS 
> than the other drives for this type of activity -- on streaming writes 
> all drives behave more or less the same. It is only on this un-dedup 
> scenario where the difference is so much pronounced.
> 
> All the vdevs in the pool are with ashift=12 so the theory that ZFS 
> actually issues 512b writes to these drives can't be true, can it?
> 
> Another worry is this Seagate Barracuda 7200.14 (AF) 
> (ST3000DM001-1CH166, firmware CC24) drive. It seems constantly 
> under-performing. Does anyone know if it is so different from the 
> ST3000DM001-9YN166 drives? Might be, I should just replace it?
> 
> My concern is the bursty and irregular nature of writing to vdevs. As it 
> is now, an write operation to the pool needs to wait for all of the vdev 
> writes to complete which is this case takes tens of seconds. A single 
> drive in an vdev that underperforms will slow down the entire pool.
> Perhaps ZFS could prioritize vdev usage based on the vdev troughput, 
> similar to how it prioritizes writes based on how much it is full.
> 
> Also, what is ZFS doing during the idle periods? Are there some timeouts 
> involved? It is certainly not using any CPU... The small random I/O is 
> certainly not loading the disks.
> 
> Then, I have 240GB L2ARC and secondarycache=metadata for the pool. Yet, 
> the DDT apparently does not want to go there... Is there a way to 
> "force" it to be loaded to L2ARC? Before the last big delete, I had
> 
> zdb -D storage
> DDT-sha256-zap-duplicate: 19907778 entries, size 1603 on disk, 259 in core
> DDT-sha256-zap-unique: 30101659 entries, size 1428 on disk, 230 in core
> 
> dedup = 1.98, compress = 1.00, copies = 1.03, dedup * compress / copies 
> = 1.92
> 
> With time, the in core values stay more or less the same.
> 
> I also discovered, that the L2ARC drive apparently is not subject to 
> TRIM for some reason. TRIM works on the boot drives, but these are 
> connected to the motherboard SATA ports).
> 
> # sysctl kern.cam.da.20
> kern.cam.da.20.delete_method: ATA_TRIM
> kern.cam.da.20.minimum_cmd_size: 6
> kern.cam.da.20.sort_io_queue: 0
> kern.cam.da.20.error_inject: 0
> 
> # sysctl -a | grep trim
> vfs.zfs.vdev.trim_on_init: 1
> vfs.zfs.vdev.trim_max_pending: 64
> vfs.zfs.vdev.trim_max_bytes: 2147483648
> vfs.zfs.trim.enabled: 1
> vfs.zfs.trim.max_interval: 1
> vfs.zfs.trim.timeout: 30
> vfs.zfs.trim.txg_delay: 32
> kstat.zfs.misc.zio_trim.bytes: 139489971200
> kstat.zfs.misc.zio_trim.success: 628351
> kstat.zfs.misc.zio_trim.unsupported: 622819
> kstat.zfs.misc.zio_trim.failed: 0
> 
> Yet, I don't observe any BIO_DELETE activity to this drive with gstat -d
> 
> Wasn't TRIM supposed to work on drives attached to LSI2008 in 9-stable?
> 
> Daniel
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3472068604314C9887FE3BD4CD42B7C8>