From owner-freebsd-fs@FreeBSD.ORG Tue Jul 16 11:58:22 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 08BDD788 for ; Tue, 16 Jul 2013 11:58:22 +0000 (UTC) (envelope-from prvs=190921e474=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id 686A8FB for ; Tue, 16 Jul 2013 11:58:21 +0000 (UTC) Received: from r2d2 ([82.69.141.170]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50005008033.msg for ; Tue, 16 Jul 2013 12:58:18 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Tue, 16 Jul 2013 12:58:18 +0100 (not processed: message from valid local sender) X-MDDKIM-Result: neutral (mail1.multiplay.co.uk) X-MDRemoteIP: 82.69.141.170 X-Return-Path: prvs=190921e474=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk X-MDaemon-Deliver-To: freebsd-fs@freebsd.org Message-ID: <3472068604314C9887FE3BD4CD42B7C8@multiplay.co.uk> From: "Steven Hartland" To: "Daniel Kalchev" , "freebsd-fs" References: <51E5316B.9070201@digsys.bg> Subject: Re: ZFS vdev I/O questions Date: Tue, 16 Jul 2013 12:58:43 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jul 2013 11:58:22 -0000 One thing to check with this is to add -d to your gstat to see if your waiting deletes IO? I doubt it but worth checking. Regards Steve ----- Original Message ----- From: "Daniel Kalchev" To: "freebsd-fs" Sent: Tuesday, July 16, 2013 12:41 PM Subject: ZFS vdev I/O questions >I am observing some "strange" behaviour with I/O spread on ZFS vdevs and > thought I might ask if someone has observed it too. > > The system hardware is an Supermicro X8DTH-6F board with integrated > LSI2008 controller, two Xeon E5620 CPUs and 72GB or RAM (6x4 + 6x8 GB > modules). > Runs 9-stable r252690. > > It has currently 18 drive zpool, split on three 6 drive raidz2 vdevs, > plus ZIL and L2ARC on separate SSDs (240GB Intel 520). The ZIL consists > of two partitions of the boot SSDs (Intel 320), not mirrored. The zpool > layout is > > pool: storage > state: ONLINE > scan: scrub canceled on Thu Jul 11 17:14:50 2013 > config: > > NAME STATE READ WRITE CKSUM > storage ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > gpt/disk00 ONLINE 0 0 0 > gpt/disk01 ONLINE 0 0 0 > gpt/disk02 ONLINE 0 0 0 > gpt/disk03 ONLINE 0 0 0 > gpt/disk04 ONLINE 0 0 0 > gpt/disk05 ONLINE 0 0 0 > raidz2-1 ONLINE 0 0 0 > gpt/disk06 ONLINE 0 0 0 > gpt/disk07 ONLINE 0 0 0 > gpt/disk08 ONLINE 0 0 0 > gpt/disk09 ONLINE 0 0 0 > gpt/disk10 ONLINE 0 0 0 > gpt/disk11 ONLINE 0 0 0 > raidz2-2 ONLINE 0 0 0 > gpt/disk12 ONLINE 0 0 0 > gpt/disk13 ONLINE 0 0 0 > gpt/disk14 ONLINE 0 0 0 > gpt/disk15 ONLINE 0 0 0 > gpt/disk16 ONLINE 0 0 0 > gpt/disk17 ONLINE 0 0 0 > logs > ada0p2 ONLINE 0 0 0 > ada1p2 ONLINE 0 0 0 > cache > da20p2 ONLINE 0 0 0 > > > zdb output > > storage: > version: 5000 > name: 'storage' > state: 0 > txg: 5258772 > pool_guid: 17094379857311239400 > hostid: 3505628652 > hostname: 'a1.register.bg' > vdev_children: 5 > vdev_tree: > type: 'root' > id: 0 > guid: 17094379857311239400 > children[0]: > type: 'raidz' > id: 0 > guid: 2748500753748741494 > nparity: 2 > metaslab_array: 33 > metaslab_shift: 37 > ashift: 12 > asize: 18003521961984 > is_log: 0 > create_txg: 4 > children[0]: > type: 'disk' > id: 0 > guid: 5074824874132816460 > path: '/dev/gpt/disk00' > phys_path: '/dev/gpt/disk00' > whole_disk: 1 > DTL: 378 > create_txg: 4 > children[1]: > type: 'disk' > id: 1 > guid: 14410366944090513563 > path: '/dev/gpt/disk01' > phys_path: '/dev/gpt/disk01' > whole_disk: 1 > DTL: 53 > create_txg: 4 > children[2]: > type: 'disk' > id: 2 > guid: 3526681390841761237 > path: '/dev/gpt/disk02' > phys_path: '/dev/gpt/disk02' > whole_disk: 1 > DTL: 52 > create_txg: 4 > children[3]: > type: 'disk' > id: 3 > guid: 3773850995072323004 > path: '/dev/gpt/disk03' > phys_path: '/dev/gpt/disk03' > whole_disk: 1 > DTL: 51 > create_txg: 4 > children[4]: > type: 'disk' > id: 4 > guid: 16528489666301728411 > path: '/dev/gpt/disk04' > phys_path: '/dev/gpt/disk04' > whole_disk: 1 > DTL: 50 > create_txg: 4 > children[5]: > type: 'disk' > id: 5 > guid: 11222774817699257051 > path: '/dev/gpt/disk05' > phys_path: '/dev/gpt/disk05' > whole_disk: 1 > DTL: 44147 > create_txg: 4 > children[1]: > type: 'raidz' > id: 1 > guid: 614220834244218709 > nparity: 2 > metaslab_array: 39 > metaslab_shift: 37 > ashift: 12 > asize: 18003521961984 > is_log: 0 > create_txg: 40 > children[0]: > type: 'disk' > id: 0 > guid: 8076478524731550200 > path: '/dev/gpt/disk06' > phys_path: '/dev/gpt/disk06' > whole_disk: 1 > DTL: 2914 > create_txg: 40 > children[1]: > type: 'disk' > id: 1 > guid: 1689851194543981566 > path: '/dev/gpt/disk07' > phys_path: '/dev/gpt/disk07' > whole_disk: 1 > DTL: 48 > create_txg: 40 > children[2]: > type: 'disk' > id: 2 > guid: 9743236178648200269 > path: '/dev/gpt/disk08' > phys_path: '/dev/gpt/disk08' > whole_disk: 1 > DTL: 47 > create_txg: 40 > children[3]: > type: 'disk' > id: 3 > guid: 10157617457760516410 > path: '/dev/gpt/disk09' > phys_path: '/dev/gpt/disk09' > whole_disk: 1 > DTL: 46 > create_txg: 40 > children[4]: > type: 'disk' > id: 4 > guid: 5035981195206926078 > path: '/dev/gpt/disk10' > phys_path: '/dev/gpt/disk10' > whole_disk: 1 > DTL: 45 > create_txg: 40 > children[5]: > type: 'disk' > id: 5 > guid: 4975835521778875251 > path: '/dev/gpt/disk11' > phys_path: '/dev/gpt/disk11' > whole_disk: 1 > DTL: 44149 > create_txg: 40 > children[2]: > type: 'raidz' > id: 2 > guid: 7453512836015019221 > nparity: 2 > metaslab_array: 38974 > metaslab_shift: 37 > ashift: 12 > asize: 18003521961984 > is_log: 0 > create_txg: 4455560 > children[0]: > type: 'disk' > id: 0 > guid: 11182458869377968267 > path: '/dev/gpt/disk12' > phys_path: '/dev/gpt/disk12' > whole_disk: 1 > DTL: 45059 > create_txg: 4455560 > children[1]: > type: 'disk' > id: 1 > guid: 5844283175515272344 > path: '/dev/gpt/disk13' > phys_path: '/dev/gpt/disk13' > whole_disk: 1 > DTL: 44145 > create_txg: 4455560 > children[2]: > type: 'disk' > id: 2 > guid: 13095364699938843583 > path: '/dev/gpt/disk14' > phys_path: '/dev/gpt/disk14' > whole_disk: 1 > DTL: 44144 > create_txg: 4455560 > children[3]: > type: 'disk' > id: 3 > guid: 5196507898996589388 > path: '/dev/gpt/disk15' > phys_path: '/dev/gpt/disk15' > whole_disk: 1 > DTL: 44143 > create_txg: 4455560 > children[4]: > type: 'disk' > id: 4 > guid: 12809770017318709512 > path: '/dev/gpt/disk16' > phys_path: '/dev/gpt/disk16' > whole_disk: 1 > DTL: 44142 > create_txg: 4455560 > children[5]: > type: 'disk' > id: 5 > guid: 7339883019925920701 > path: '/dev/gpt/disk17' > phys_path: '/dev/gpt/disk17' > whole_disk: 1 > DTL: 44141 > create_txg: 4455560 > children[3]: > type: 'disk' > id: 3 > guid: 18011869864924559827 > path: '/dev/ada0p2' > phys_path: '/dev/ada0p2' > whole_disk: 1 > metaslab_array: 16675 > metaslab_shift: 26 > ashift: 12 > asize: 8585216000 > is_log: 1 > DTL: 86787 > create_txg: 5182360 > children[4]: > type: 'disk' > id: 4 > guid: 1338775535758010670 > path: '/dev/ada1p2' > phys_path: '/dev/ada1p2' > whole_disk: 1 > metaslab_array: 16693 > metaslab_shift: 26 > ashift: 12 > asize: 8585216000 > is_log: 1 > DTL: 86788 > create_txg: 5182377 > features_for_read: > > Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640, > firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been > replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware > CC24) -- this is an 4k sector drive of a new generation (notice the > relatively 'old' firmware, that can't be upgraded). > Drives da6-da17 are also Seagate Barracuda 7200.14 (AF) but > (ST3000DM001-9YN166, firmware CC4H) -- the more "normal" part number. > Some have firmware CC4C which I replace drive by drive (but other than > the excessive load counts no other issues so far). > > The only ZFS related tuning is in /etc/sysctl.conf > # improve ZFS resilver > vfs.zfs.resilver_delay=0 > vfs.zfs.scrub_delay=0 > vfs.zfs.top_maxinflight=128 > vfs.zfs.resilver_min_time_ms=5000 > vfs.zfs.vdev.max_pending=24 > # L2ARC: > vfs.zfs.l2arc_norw=0 > vfs.zfs.l2arc_write_max=83886080 > vfs.zfs.l2arc_write_boost=83886080 > > > The pool of course had dedup and had serious dedup ratios, like over > 10x. In general, with the ZIL and L2ARC, the only trouble I have seen > with dedup is when deleting lots of data... which this server has seen > plenty of. During this experiment, I have moved most data to other > server and un-dedup the last remaining TBs. > > While doing zfs destroy on an 2-3TB dataset, I observe very annoying > behaviour. The pool would stay mostly idle, accepting almost no I/O and > doing small random reads, like this > > $ zpool iostat storage 10 > storage 45.3T 3.45T 466 0 1.82M 0 > storage 45.3T 3.45T 50 0 203K 0 > storage 45.3T 3.45T 45 25 183K 1.70M > storage 45.3T 3.45T 49 0 199K 0 > storage 45.3T 3.45T 50 0 202K 0 > storage 45.3T 3.45T 51 0 204K 0 > storage 45.3T 3.45T 57 0 230K 0 > storage 45.3T 3.45T 65 0 260K 0 > storage 45.3T 3.45T 68 25 274K 1.70M > storage 45.3T 3.45T 65 0 260K 0 > storage 45.3T 3.45T 64 0 260K 0 > storage 45.3T 3.45T 67 0 272K 0 > storage 45.3T 3.45T 66 0 266K 0 > storage 45.3T 3.45T 64 0 258K 0 > storage 45.3T 3.45T 62 25 250K 1.70M > storage 45.3T 3.45T 57 0 231K 0 > storage 45.3T 3.45T 58 0 235K 0 > storage 45.3T 3.45T 66 0 267K 0 > storage 45.3T 3.45T 64 0 257K 0 > storage 45.3T 3.45T 60 0 241K 0 > storage 45.3T 3.45T 50 0 203K 0 > storage 45.3T 3.45T 52 25 209K 1.70M > storage 45.3T 3.45T 54 0 217K 0 > storage 45.3T 3.45T 51 0 205K 0 > storage 45.3T 3.45T 54 0 216K 0 > storage 45.3T 3.45T 55 0 222K 0 > storage 45.3T 3.45T 56 0 226K 0 > storage 45.3T 3.45T 65 0 264K 0 > storage 45.3T 3.45T 71 0 286K 0 > > The write peaks are from processes syncing data to the pool - in this > state it does not do reads (the data the sync process deals with is > already in ARC). > Then it goes into writing back to the pool (perhaps DDT metadata) > > storage 45.3T 3.45T 17 24.4K 69.6K 97.5M > storage 45.3T 3.45T 0 19.6K 0 78.5M > storage 45.3T 3.45T 0 14.2K 0 56.8M > storage 45.3T 3.45T 0 7.90K 0 31.6M > storage 45.3T 3.45T 0 7.81K 0 32.8M > storage 45.3T 3.45T 0 9.54K 0 38.2M > storage 45.3T 3.45T 0 7.07K 0 28.3M > storage 45.3T 3.45T 0 7.70K 0 30.8M > storage 45.3T 3.45T 0 6.19K 0 24.8M > storage 45.3T 3.45T 0 5.45K 0 21.8M > storage 45.3T 3.45T 0 5.78K 0 24.7M > storage 45.3T 3.45T 0 5.29K 0 21.2M > storage 45.3T 3.45T 0 5.69K 0 22.8M > storage 45.3T 3.45T 0 5.52K 0 22.1M > storage 45.3T 3.45T 0 3.26K 0 13.1M > storage 45.3T 3.45T 0 1.77K 0 7.10M > storage 45.3T 3.45T 0 1.63K 0 8.14M > storage 45.3T 3.45T 0 1.41K 0 5.64M > storage 45.3T 3.45T 0 1.22K 0 4.88M > storage 45.3T 3.45T 0 1.27K 0 5.09M > storage 45.3T 3.45T 0 1.06K 0 4.26M > storage 45.3T 3.45T 0 1.07K 0 4.30M > storage 45.3T 3.45T 0 979 0 3.83M > storage 45.3T 3.45T 0 1002 0 3.91M > storage 45.3T 3.45T 0 1010 0 3.95M > storage 45.3T 3.45T 0 948 2.40K 3.71M > storage 45.3T 3.45T 0 939 0 3.67M > storage 45.3T 3.45T 0 1023 0 7.10M > storage 45.3T 3.45T 0 1.01K 4.80K 4.04M > storage 45.3T 3.45T 0 822 0 3.22M > storage 45.3T 3.45T 0 434 0 1.70M > storage 45.3T 3.45T 0 398 2.40K 1.56M > > For quite some time, there are no reads from the pool. When that > happens, gstat (gstat -f 'da[0-9]*$') displays something like this: > > > dT: 1.001s w: 1.000s filter: da[0-9]*$ > L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name > 24 1338 0 0 0.0 1338 12224 17.8 100.9| da0 > 24 6888 0 0 0.0 6888 60720 3.5 100.0| da1 > 24 6464 0 0 0.0 6464 71997 3.7 100.0| da2 > 24 6117 0 0 0.0 6117 82386 3.9 99.9| da3 > 24 6455 0 0 0.0 6455 66822 3.7 100.0| da4 > 24 6782 0 0 0.0 6782 69207 3.5 100.0| da5 > 24 698 0 0 0.0 698 27533 34.1 99.6| da6 > 24 590 0 0 0.0 590 21627 40.9 99.7| da7 > 24 561 0 0 0.0 561 21031 42.8 100.2| da8 > 24 724 0 0 0.0 724 25583 33.1 99.9| da9 > 24 567 0 0 0.0 567 22965 41.4 98.0| da10 > 24 566 0 0 0.0 566 21834 42.4 99.9| da11 > 24 586 0 0 0.0 586 4899 43.5 100.2| da12 > 24 487 0 0 0.0 487 4008 49.3 100.9| da13 > 24 628 0 0 0.0 628 5007 38.9 100.2| da14 > 24 714 0 0 0.0 714 5706 33.8 99.9| da15 > 24 595 0 0 0.0 595 4831 39.8 99.8| da16 > 24 485 0 0 0.0 485 3932 49.2 100.1| da17 > 0 0 0 0 0.0 0 0 0.0 0.0| da18 > 0 0 0 0 0.0 0 0 0.0 0.0| da19 > 0 0 0 0 0.0 0 0 0.0 0.0| da20 > 0 0 0 0 0.0 0 0 0.0 0.0| ada0 > 0 0 0 0 0.0 0 0 0.0 0.0| ada1 > > > (drives da8 and 19 are spares, da20 is the L2ARC SSD drive, ada0 and > ada0 are the boot SSDs in separate zpool) > Now, here comes the weird part. the gpart display would show intensive > writes to all vdevs (da0-da5, da6-da11,da12-da17) then one of the vdevs > would complete writing, and stop writing, while other vdevs continue, at > the end only one vdev writes until as it seems, data is completely > written to all vdevs (this can be observed in the zfs iostat output > above with the decreasing write IOPS each 10 seconds), then there is a > few seconds "do nothing" period and then we are back to small reads. > > The other observation I have is with the first vdev: the 512b drives do > a lot of I/O fast, complete first and then sit idle, while da0 continues > to write for many more seconds. They consistently show many more IOPS > than the other drives for this type of activity -- on streaming writes > all drives behave more or less the same. It is only on this un-dedup > scenario where the difference is so much pronounced. > > All the vdevs in the pool are with ashift=12 so the theory that ZFS > actually issues 512b writes to these drives can't be true, can it? > > Another worry is this Seagate Barracuda 7200.14 (AF) > (ST3000DM001-1CH166, firmware CC24) drive. It seems constantly > under-performing. Does anyone know if it is so different from the > ST3000DM001-9YN166 drives? Might be, I should just replace it? > > My concern is the bursty and irregular nature of writing to vdevs. As it > is now, an write operation to the pool needs to wait for all of the vdev > writes to complete which is this case takes tens of seconds. A single > drive in an vdev that underperforms will slow down the entire pool. > Perhaps ZFS could prioritize vdev usage based on the vdev troughput, > similar to how it prioritizes writes based on how much it is full. > > Also, what is ZFS doing during the idle periods? Are there some timeouts > involved? It is certainly not using any CPU... The small random I/O is > certainly not loading the disks. > > Then, I have 240GB L2ARC and secondarycache=metadata for the pool. Yet, > the DDT apparently does not want to go there... Is there a way to > "force" it to be loaded to L2ARC? Before the last big delete, I had > > zdb -D storage > DDT-sha256-zap-duplicate: 19907778 entries, size 1603 on disk, 259 in core > DDT-sha256-zap-unique: 30101659 entries, size 1428 on disk, 230 in core > > dedup = 1.98, compress = 1.00, copies = 1.03, dedup * compress / copies > = 1.92 > > With time, the in core values stay more or less the same. > > I also discovered, that the L2ARC drive apparently is not subject to > TRIM for some reason. TRIM works on the boot drives, but these are > connected to the motherboard SATA ports). > > # sysctl kern.cam.da.20 > kern.cam.da.20.delete_method: ATA_TRIM > kern.cam.da.20.minimum_cmd_size: 6 > kern.cam.da.20.sort_io_queue: 0 > kern.cam.da.20.error_inject: 0 > > # sysctl -a | grep trim > vfs.zfs.vdev.trim_on_init: 1 > vfs.zfs.vdev.trim_max_pending: 64 > vfs.zfs.vdev.trim_max_bytes: 2147483648 > vfs.zfs.trim.enabled: 1 > vfs.zfs.trim.max_interval: 1 > vfs.zfs.trim.timeout: 30 > vfs.zfs.trim.txg_delay: 32 > kstat.zfs.misc.zio_trim.bytes: 139489971200 > kstat.zfs.misc.zio_trim.success: 628351 > kstat.zfs.misc.zio_trim.unsupported: 622819 > kstat.zfs.misc.zio_trim.failed: 0 > > Yet, I don't observe any BIO_DELETE activity to this drive with gstat -d > > Wasn't TRIM supposed to work on drives attached to LSI2008 in 9-stable? > > Daniel > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.