From owner-freebsd-fs@FreeBSD.ORG Tue Jan 29 19:00:02 2013 Return-Path: Delivered-To: freebsd-fs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0F8B9D99 for ; Tue, 29 Jan 2013 19:00:02 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id E70046D for ; Tue, 29 Jan 2013 19:00:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0TJ01IU093310 for ; Tue, 29 Jan 2013 19:00:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0TJ01Vt093309; Tue, 29 Jan 2013 19:00:01 GMT (envelope-from gnats) Date: Tue, 29 Jan 2013 19:00:01 GMT Message-Id: <201301291900.r0TJ01Vt093309@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org Cc: From: Jeremy Chadwick Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Jeremy Chadwick List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 Jan 2013 19:00:02 -0000 The following reply was made to PR kern/169480; it has been noted by GNATS. From: Jeremy Chadwick To: Harry Coin Cc: bug-followup@FreeBSD.org, levent.serinol@mynet.com Subject: Re: kern/169480: [zfs] ZFS stalls on heavy I/O Date: Tue, 29 Jan 2013 10:50:28 -0800 Re 1,2: that transfer speed (183MBytes/second) sounds much better/much more accurate for what's going on. The speed-limiting factors were certainly a small blocksize (512 bytes) used by dd, and using /dev/random rather than /dev/zero. I realise you're probably expecting to see something like 480MBytes/second (4 drives * 120MB/sec), but that's probably not going to happen on that model of system and with that CPU. For example, on my Q9550 system described earlier, I can get about this: $ dd if=/dev/zero of=testfile bs=64k ^C27148+0 records in 27147+0 records out 1779105792 bytes transferred in 6.935566 secs (256519186 bytes/sec) While "gstat -I500ms" shows each disk going between 60MBytes/sec and 140MBytes/sec. "zpool iostat -v data 1" shows between 120-220MBytes/sec at the pool level, and showing around 65-110MBytes/sec on a per-disk level. Anyway, point being, things are faster with a large bs and from a source that doesn't churn interrupts. But don't necessarily "pull a Linux" and start doing things like bs=1m -- as I said before, Linux dd is different, because the I/O is cached (without --direct), while on FreeBSD dd is always direct. Re 3: That sounds a bit on the slow side. I would expect those disks, at least during writes, to do more. If **all** the drives show this behaviour consistently in gstat, then you know the issue IS NOT with an individual disk, and is instead the issue lies elsewhere. That rules out one piece of the puzzle, and that's good. Re 5: Did you mean to type 14MBytes/second, not 14mbits/second? If so, yes, I would agree that's slow. Scrubbing is not necessarily a good way to "benchmark" disks, but I understand for "benchmarking" ZFS it's the best you've got to some degree. Regarding dd'ing and 512 bytes -- as I described to you in my previous mail: > This speed will be "bursty" and "sporadic" due to the how ZFS ARC > works. The interval at which "things are flushed to disk" is based on > the vfs.zfs.txg.timeout sysctl, which on FreeBSD 9.1-RELEASE should > default to 5 (5 seconds). This is where your "4 secs or so" magic value comes from. Please do not change this sysctl/value; keep it at 5. Finally, your vmstat -i output shows something of concern, UNLESS you did this WHILE you had the dd (doesn't matter what block size) going, and are using /dev/random or /dev/urandom (same thing on FreeBSD): > irq20: hpet0 620136 328 > irq259: ahci1 849746 450 These interrupt rates are quite high. hpet0 refers to your event timer/clock timer (see kern.eventtimer.choice and kern.eventtimer.timer) being HPET, and ahci1 refers to your Intel ICH7 AHCI controller. Basically what's happening here is that you're generating a ton of interrupts doing dd if=/dev/urandom bs=512. And it makes perfect sense to me why: because /dev/urandom has to harvest entropy from interrupt sources (please see random(4) man page), and you're generating a lot of interrupts to your AHCI controller for each individual 512-byte write. When you say "move a video from one dataset to another", please explain what it is you're moving from and to. Specifically: what filesystems, and output from "zfs list". If you're moving a file from a ZFS filesystem to another ZFS filesystem on the same pool, then please state that. That may help kernel folks figure out where your issue lies. At this stage, a kernel developer is going to need to step in and try to help you figure out where the actual bottleneck is occurring. This is going to be very difficult/complex/very likely not possible with you using nas4free, because you will almost certainly be asked to rebuild world/kernel to include some new options and possibly asked to include DTrace/CTF support (for real-time debugging). The situation is tricky. It would really help if you would/could remove nas4free from the picture and instead just run stock FreeBSD, because as I said, if there are some kind of kernel tunings or adjustment values the nas4free folks put in place that stock FreeBSD doesn't, those could be harming you. I can't be of more help here, I'm sorry to say. The good news is that your disks sound fine. Kernel developers will need to take this up. P.S. -- I would strongly recommend updating your nas4free forum post with a link to this conversation in this PR. IMO, the nas4free people need to step up and take responsibility (and that almost certainly means talking/working with the FreeBSD folks). -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |