From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 11:41:44 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 33946236 for ; Tue, 16 Jul 2013 11:41:44 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123]) by mx1.freebsd.org (Postfix) with ESMTP id BF224F9C for ; Tue, 16 Jul 2013 11:41:43 +0000 (UTC) Received: from dcave.digsys.bg (dcave.digsys.bg [193.68.6.1]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.6/8.14.6) with ESMTP id r6GBfVaG010630 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO) for ; Tue, 16 Jul 2013 14:41:32 +0300 (EEST) (envelope-from daniel@digsys.bg) Message-ID: <51E5316B.9070201@digsys.bg> Date: Tue, 16 Jul 2013 14:41:31 +0300 From: Daniel Kalchev User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130627 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs Subject: ZFS vdev I/O questions Content-Type: text/plain; charset=windows-1251; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 11:41:44 -0000 I am observing some "strange" behaviour with I/O spread on ZFS vdevs and thought I might ask if someone has observed it too. The system hardware is an Supermicro X8DTH-6F board with integrated LSI2008 controller, two Xeon E5620 CPUs and 72GB or RAM (6x4 + 6x8 GB modules). Runs 9-stable r252690. It has currently 18 drive zpool, split on three 6 drive raidz2 vdevs, plus ZIL and L2ARC on separate SSDs (240GB Intel 520). The ZIL consists of two partitions of the boot SSDs (Intel 320), not mirrored. The zpool layout is pool: storage state: ONLINE scan: scrub canceled on Thu Jul 11 17:14:50 2013 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gpt/disk00 ONLINE 0 0 0 gpt/disk01 ONLINE 0 0 0 gpt/disk02 ONLINE 0 0 0 gpt/disk03 ONLINE 0 0 0 gpt/disk04 ONLINE 0 0 0 gpt/disk05 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 gpt/disk06 ONLINE 0 0 0 gpt/disk07 ONLINE 0 0 0 gpt/disk08 ONLINE 0 0 0 gpt/disk09 ONLINE 0 0 0 gpt/disk10 ONLINE 0 0 0 gpt/disk11 ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 gpt/disk12 ONLINE 0 0 0 gpt/disk13 ONLINE 0 0 0 gpt/disk14 ONLINE 0 0 0 gpt/disk15 ONLINE 0 0 0 gpt/disk16 ONLINE 0 0 0 gpt/disk17 ONLINE 0 0 0 logs ada0p2 ONLINE 0 0 0 ada1p2 ONLINE 0 0 0 cache da20p2 ONLINE 0 0 0 zdb output storage: version: 5000 name: 'storage' state: 0 txg: 5258772 pool_guid: 17094379857311239400 hostid: 3505628652 hostname: 'a1.register.bg' vdev_children: 5 vdev_tree: type: 'root' id: 0 guid: 17094379857311239400 children[0]: type: 'raidz' id: 0 guid: 2748500753748741494 nparity: 2 metaslab_array: 33 metaslab_shift: 37 ashift: 12 asize: 18003521961984 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 5074824874132816460 path: '/dev/gpt/disk00' phys_path: '/dev/gpt/disk00' whole_disk: 1 DTL: 378 create_txg: 4 children[1]: type: 'disk' id: 1 guid: 14410366944090513563 path: '/dev/gpt/disk01' phys_path: '/dev/gpt/disk01' whole_disk: 1 DTL: 53 create_txg: 4 children[2]: type: 'disk' id: 2 guid: 3526681390841761237 path: '/dev/gpt/disk02' phys_path: '/dev/gpt/disk02' whole_disk: 1 DTL: 52 create_txg: 4 children[3]: type: 'disk' id: 3 guid: 3773850995072323004 path: '/dev/gpt/disk03' phys_path: '/dev/gpt/disk03' whole_disk: 1 DTL: 51 create_txg: 4 children[4]: type: 'disk' id: 4 guid: 16528489666301728411 path: '/dev/gpt/disk04' phys_path: '/dev/gpt/disk04' whole_disk: 1 DTL: 50 create_txg: 4 children[5]: type: 'disk' id: 5 guid: 11222774817699257051 path: '/dev/gpt/disk05' phys_path: '/dev/gpt/disk05' whole_disk: 1 DTL: 44147 create_txg: 4 children[1]: type: 'raidz' id: 1 guid: 614220834244218709 nparity: 2 metaslab_array: 39 metaslab_shift: 37 ashift: 12 asize: 18003521961984 is_log: 0 create_txg: 40 children[0]: type: 'disk' id: 0 guid: 8076478524731550200 path: '/dev/gpt/disk06' phys_path: '/dev/gpt/disk06' whole_disk: 1 DTL: 2914 create_txg: 40 children[1]: type: 'disk' id: 1 guid: 1689851194543981566 path: '/dev/gpt/disk07' phys_path: '/dev/gpt/disk07' whole_disk: 1 DTL: 48 create_txg: 40 children[2]: type: 'disk' id: 2 guid: 9743236178648200269 path: '/dev/gpt/disk08' phys_path: '/dev/gpt/disk08' whole_disk: 1 DTL: 47 create_txg: 40 children[3]: type: 'disk' id: 3 guid: 10157617457760516410 path: '/dev/gpt/disk09' phys_path: '/dev/gpt/disk09' whole_disk: 1 DTL: 46 create_txg: 40 children[4]: type: 'disk' id: 4 guid: 5035981195206926078 path: '/dev/gpt/disk10' phys_path: '/dev/gpt/disk10' whole_disk: 1 DTL: 45 create_txg: 40 children[5]: type: 'disk' id: 5 guid: 4975835521778875251 path: '/dev/gpt/disk11' phys_path: '/dev/gpt/disk11' whole_disk: 1 DTL: 44149 create_txg: 40 children[2]: type: 'raidz' id: 2 guid: 7453512836015019221 nparity: 2 metaslab_array: 38974 metaslab_shift: 37 ashift: 12 asize: 18003521961984 is_log: 0 create_txg: 4455560 children[0]: type: 'disk' id: 0 guid: 11182458869377968267 path: '/dev/gpt/disk12' phys_path: '/dev/gpt/disk12' whole_disk: 1 DTL: 45059 create_txg: 4455560 children[1]: type: 'disk' id: 1 guid: 5844283175515272344 path: '/dev/gpt/disk13' phys_path: '/dev/gpt/disk13' whole_disk: 1 DTL: 44145 create_txg: 4455560 children[2]: type: 'disk' id: 2 guid: 13095364699938843583 path: '/dev/gpt/disk14' phys_path: '/dev/gpt/disk14' whole_disk: 1 DTL: 44144 create_txg: 4455560 children[3]: type: 'disk' id: 3 guid: 5196507898996589388 path: '/dev/gpt/disk15' phys_path: '/dev/gpt/disk15' whole_disk: 1 DTL: 44143 create_txg: 4455560 children[4]: type: 'disk' id: 4 guid: 12809770017318709512 path: '/dev/gpt/disk16' phys_path: '/dev/gpt/disk16' whole_disk: 1 DTL: 44142 create_txg: 4455560 children[5]: type: 'disk' id: 5 guid: 7339883019925920701 path: '/dev/gpt/disk17' phys_path: '/dev/gpt/disk17' whole_disk: 1 DTL: 44141 create_txg: 4455560 children[3]: type: 'disk' id: 3 guid: 18011869864924559827 path: '/dev/ada0p2' phys_path: '/dev/ada0p2' whole_disk: 1 metaslab_array: 16675 metaslab_shift: 26 ashift: 12 asize: 8585216000 is_log: 1 DTL: 86787 create_txg: 5182360 children[4]: type: 'disk' id: 4 guid: 1338775535758010670 path: '/dev/ada1p2' phys_path: '/dev/ada1p2' whole_disk: 1 metaslab_array: 16693 metaslab_shift: 26 ashift: 12 asize: 8585216000 is_log: 1 DTL: 86788 create_txg: 5182377 features_for_read: Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640, firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware CC24) -- this is an 4k sector drive of a new generation (notice the relatively 'old' firmware, that can't be upgraded). Drives da6-da17 are also Seagate Barracuda 7200.14 (AF) but (ST3000DM001-9YN166, firmware CC4H) -- the more "normal" part number. Some have firmware CC4C which I replace drive by drive (but other than the excessive load counts no other issues so far). The only ZFS related tuning is in /etc/sysctl.conf # improve ZFS resilver vfs.zfs.resilver_delay=0 vfs.zfs.scrub_delay=0 vfs.zfs.top_maxinflight=128 vfs.zfs.resilver_min_time_ms=5000 vfs.zfs.vdev.max_pending=24 # L2ARC: vfs.zfs.l2arc_norw=0 vfs.zfs.l2arc_write_max=83886080 vfs.zfs.l2arc_write_boost=83886080 The pool of course had dedup and had serious dedup ratios, like over 10x. In general, with the ZIL and L2ARC, the only trouble I have seen with dedup is when deleting lots of data... which this server has seen plenty of. During this experiment, I have moved most data to other server and un-dedup the last remaining TBs. While doing zfs destroy on an 2-3TB dataset, I observe very annoying behaviour. The pool would stay mostly idle, accepting almost no I/O and doing small random reads, like this $ zpool iostat storage 10 storage 45.3T 3.45T 466 0 1.82M 0 storage 45.3T 3.45T 50 0 203K 0 storage 45.3T 3.45T 45 25 183K 1.70M storage 45.3T 3.45T 49 0 199K 0 storage 45.3T 3.45T 50 0 202K 0 storage 45.3T 3.45T 51 0 204K 0 storage 45.3T 3.45T 57 0 230K 0 storage 45.3T 3.45T 65 0 260K 0 storage 45.3T 3.45T 68 25 274K 1.70M storage 45.3T 3.45T 65 0 260K 0 storage 45.3T 3.45T 64 0 260K 0 storage 45.3T 3.45T 67 0 272K 0 storage 45.3T 3.45T 66 0 266K 0 storage 45.3T 3.45T 64 0 258K 0 storage 45.3T 3.45T 62 25 250K 1.70M storage 45.3T 3.45T 57 0 231K 0 storage 45.3T 3.45T 58 0 235K 0 storage 45.3T 3.45T 66 0 267K 0 storage 45.3T 3.45T 64 0 257K 0 storage 45.3T 3.45T 60 0 241K 0 storage 45.3T 3.45T 50 0 203K 0 storage 45.3T 3.45T 52 25 209K 1.70M storage 45.3T 3.45T 54 0 217K 0 storage 45.3T 3.45T 51 0 205K 0 storage 45.3T 3.45T 54 0 216K 0 storage 45.3T 3.45T 55 0 222K 0 storage 45.3T 3.45T 56 0 226K 0 storage 45.3T 3.45T 65 0 264K 0 storage 45.3T 3.45T 71 0 286K 0 The write peaks are from processes syncing data to the pool - in this state it does not do reads (the data the sync process deals with is already in ARC). Then it goes into writing back to the pool (perhaps DDT metadata) storage 45.3T 3.45T 17 24.4K 69.6K 97.5M storage 45.3T 3.45T 0 19.6K 0 78.5M storage 45.3T 3.45T 0 14.2K 0 56.8M storage 45.3T 3.45T 0 7.90K 0 31.6M storage 45.3T 3.45T 0 7.81K 0 32.8M storage 45.3T 3.45T 0 9.54K 0 38.2M storage 45.3T 3.45T 0 7.07K 0 28.3M storage 45.3T 3.45T 0 7.70K 0 30.8M storage 45.3T 3.45T 0 6.19K 0 24.8M storage 45.3T 3.45T 0 5.45K 0 21.8M storage 45.3T 3.45T 0 5.78K 0 24.7M storage 45.3T 3.45T 0 5.29K 0 21.2M storage 45.3T 3.45T 0 5.69K 0 22.8M storage 45.3T 3.45T 0 5.52K 0 22.1M storage 45.3T 3.45T 0 3.26K 0 13.1M storage 45.3T 3.45T 0 1.77K 0 7.10M storage 45.3T 3.45T 0 1.63K 0 8.14M storage 45.3T 3.45T 0 1.41K 0 5.64M storage 45.3T 3.45T 0 1.22K 0 4.88M storage 45.3T 3.45T 0 1.27K 0 5.09M storage 45.3T 3.45T 0 1.06K 0 4.26M storage 45.3T 3.45T 0 1.07K 0 4.30M storage 45.3T 3.45T 0 979 0 3.83M storage 45.3T 3.45T 0 1002 0 3.91M storage 45.3T 3.45T 0 1010 0 3.95M storage 45.3T 3.45T 0 948 2.40K 3.71M storage 45.3T 3.45T 0 939 0 3.67M storage 45.3T 3.45T 0 1023 0 7.10M storage 45.3T 3.45T 0 1.01K 4.80K 4.04M storage 45.3T 3.45T 0 822 0 3.22M storage 45.3T 3.45T 0 434 0 1.70M storage 45.3T 3.45T 0 398 2.40K 1.56M For quite some time, there are no reads from the pool. When that happens, gstat (gstat -f 'da[0-9]*$') displays something like this: dT: 1.001s w: 1.000s filter: da[0-9]*$ L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name 24 1338 0 0 0.0 1338 12224 17.8 100.9| da0 24 6888 0 0 0.0 6888 60720 3.5 100.0| da1 24 6464 0 0 0.0 6464 71997 3.7 100.0| da2 24 6117 0 0 0.0 6117 82386 3.9 99.9| da3 24 6455 0 0 0.0 6455 66822 3.7 100.0| da4 24 6782 0 0 0.0 6782 69207 3.5 100.0| da5 24 698 0 0 0.0 698 27533 34.1 99.6| da6 24 590 0 0 0.0 590 21627 40.9 99.7| da7 24 561 0 0 0.0 561 21031 42.8 100.2| da8 24 724 0 0 0.0 724 25583 33.1 99.9| da9 24 567 0 0 0.0 567 22965 41.4 98.0| da10 24 566 0 0 0.0 566 21834 42.4 99.9| da11 24 586 0 0 0.0 586 4899 43.5 100.2| da12 24 487 0 0 0.0 487 4008 49.3 100.9| da13 24 628 0 0 0.0 628 5007 38.9 100.2| da14 24 714 0 0 0.0 714 5706 33.8 99.9| da15 24 595 0 0 0.0 595 4831 39.8 99.8| da16 24 485 0 0 0.0 485 3932 49.2 100.1| da17 0 0 0 0 0.0 0 0 0.0 0.0| da18 0 0 0 0 0.0 0 0 0.0 0.0| da19 0 0 0 0 0.0 0 0 0.0 0.0| da20 0 0 0 0 0.0 0 0 0.0 0.0| ada0 0 0 0 0 0.0 0 0 0.0 0.0| ada1 (drives da8 and 19 are spares, da20 is the L2ARC SSD drive, ada0 and ada0 are the boot SSDs in separate zpool) Now, here comes the weird part. the gpart display would show intensive writes to all vdevs (da0-da5, da6-da11,da12-da17) then one of the vdevs would complete writing, and stop writing, while other vdevs continue, at the end only one vdev writes until as it seems, data is completely written to all vdevs (this can be observed in the zfs iostat output above with the decreasing write IOPS each 10 seconds), then there is a few seconds "do nothing" period and then we are back to small reads. The other observation I have is with the first vdev: the 512b drives do a lot of I/O fast, complete first and then sit idle, while da0 continues to write for many more seconds. They consistently show many more IOPS than the other drives for this type of activity -- on streaming writes all drives behave more or less the same. It is only on this un-dedup scenario where the difference is so much pronounced. All the vdevs in the pool are with ashift=12 so the theory that ZFS actually issues 512b writes to these drives can't be true, can it? Another worry is this Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware CC24) drive. It seems constantly under-performing. Does anyone know if it is so different from the ST3000DM001-9YN166 drives? Might be, I should just replace it? My concern is the bursty and irregular nature of writing to vdevs. As it is now, an write operation to the pool needs to wait for all of the vdev writes to complete which is this case takes tens of seconds. A single drive in an vdev that underperforms will slow down the entire pool. Perhaps ZFS could prioritize vdev usage based on the vdev troughput, similar to how it prioritizes writes based on how much it is full. Also, what is ZFS doing during the idle periods? Are there some timeouts involved? It is certainly not using any CPU... The small random I/O is certainly not loading the disks. Then, I have 240GB L2ARC and secondarycache=metadata for the pool. Yet, the DDT apparently does not want to go there... Is there a way to "force" it to be loaded to L2ARC? Before the last big delete, I had zdb -D storage DDT-sha256-zap-duplicate: 19907778 entries, size 1603 on disk, 259 in core DDT-sha256-zap-unique: 30101659 entries, size 1428 on disk, 230 in core dedup = 1.98, compress = 1.00, copies = 1.03, dedup * compress / copies = 1.92 With time, the in core values stay more or less the same. I also discovered, that the L2ARC drive apparently is not subject to TRIM for some reason. TRIM works on the boot drives, but these are connected to the motherboard SATA ports). # sysctl kern.cam.da.20 kern.cam.da.20.delete_method: ATA_TRIM kern.cam.da.20.minimum_cmd_size: 6 kern.cam.da.20.sort_io_queue: 0 kern.cam.da.20.error_inject: 0 # sysctl -a | grep trim vfs.zfs.vdev.trim_on_init: 1 vfs.zfs.vdev.trim_max_pending: 64 vfs.zfs.vdev.trim_max_bytes: 2147483648 vfs.zfs.trim.enabled: 1 vfs.zfs.trim.max_interval: 1 vfs.zfs.trim.timeout: 30 vfs.zfs.trim.txg_delay: 32 kstat.zfs.misc.zio_trim.bytes: 139489971200 kstat.zfs.misc.zio_trim.success: 628351 kstat.zfs.misc.zio_trim.unsupported: 622819 kstat.zfs.misc.zio_trim.failed: 0 Yet, I don't observe any BIO_DELETE activity to this drive with gstat -d Wasn't TRIM supposed to work on drives attached to LSI2008 in 9-stable? Daniel