Date: Thu, 4 Jul 2013 03:32:27 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Steven Hartland <killing@multiplay.co.uk> Cc: freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: EBS snapshot backups from a FreeBSD zfs file system: zpool freeze? Message-ID: <20130704103227.GA84901@icarus.home.lan> In-Reply-To: <98367DD8A8E34B75906FD46655AD74CF@multiplay.co.uk> References: <87mwq34emp.wl%berend@pobox.com> <20130703200241.GB60515@in-addr.com> <87k3l748gb.wl%berend@pobox.com> <20130703233631.GA74698@icarus.home.lan> <87d2qz42q4.wl%berend@pobox.com> <20130704010815.GB75529@icarus.home.lan> <8761wr3xxk.wl%berend@pobox.com> <36872A46E9BE40688B8F59FD05D4ECE9@multiplay.co.uk> <20130704082209.GB83766@icarus.home.lan> <98367DD8A8E34B75906FD46655AD74CF@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jul 04, 2013 at 09:47:26AM +0100, Steven Hartland wrote: > > ----- Original Message ----- From: "Jeremy Chadwick" > <jdc@koitsu.org> > To: "Steven Hartland" <killing@multiplay.co.uk> > Cc: "Berend de Boer" <berend@pobox.com>; "freebsd-fs" <freebsd-fs@freebsd.org> > Sent: Thursday, July 04, 2013 9:22 AM > Subject: Re: EBS snapshot backups from a FreeBSD zfs file system: zpool freeze? > > > >On Thu, Jul 04, 2013 at 09:06:57AM +0100, Steven Hartland wrote: > >>----- Original Message ----- From: "Berend de Boer" > >><berend@pobox.com> > >>Jeremy> Also, because nobody seems to warn others of this: if > >>Jeremy> you go the ZFS route on FreeBSD, please do not use > >>Jeremy> features like dedup or compression. > >> > >>While dedup is memory and sometimes cpu hungry, so HW spec > >>should be considered before using it, compression is not so > >>and I've not seen any valid reason not to use it should it > >>fit your uses. > >> > >>We actually use compression extensivily here and we've > >>had nothing but positive results from it so sounds like > >>FUD to me. > > > >The problem with the lack of separate and prioritised write threads for > >dedup and compression, thus causing interactivity stalls, is not FUD, > >it's fact. I explained this in the part of my reply to Berend which you > >omitted, which included the proof and acknowledgement from folks who > >are in-the-know (Bob Friesenhahn). :/ Nobody has told me "yeah that > >got fixed", so there is no reason for me to believe anything has > >changed. > > Do you have an links to the discuss on this Jeremy as I'd be intereted > to read up on the this when I have some spare time? Warning up front: sorry for the long mail (I did try to keep it terse) but most of it is demonstrating the problem. Useful FreeBSD links, specifically the conversations I've had over the years about this problem, at least the most useful ones. The first one is probably the most relevant, since it's a statement from Bob himself explaining it: http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072171.html http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072178.html To be clear (note the date and version): as of September 2011 I was able to reproduce the problem on stable/8. While you were writing your mail, I was off actually trying to find out technical details (specifically the source code changes in OpenSolaris or later) which fixed it / what Bob alluded to. I really had to jab at search engines to find anything useful, and wasn't getting anywhere until I found this: http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/28192 This mentioned the OpenSolaris bug number 6586537. I then poked about svnweb and found that this fix was imported into FreeBSD with the "ZFS version 15" import. Commit log entry: 6586537 async zio taskqs can block out userland commands (142901-09) Relevant revisions, dates, and branches for this: r209962: Jul 2010: head: http://svnweb.freebsd.org/base?view=revision&revision=209962 r212668: Sep 2010: stable/8: http://svnweb.freebsd.org/base?view=revision&revision=212668 And that head became stable/9 as of September 2011, I believe. So my testing as of September 2011 would have included the fix for 6586537. This makes me wonder if 6586537 is truly the issue I've been describing or not. It's easy enough to test for on stable/9 today (zfs create, zfs set compression=on, do the dd and in another window do stuff and see what happens, then later zfs destroy). So let's see if it's still there almost 2 years later... Yup, still there, but it seems improved in some way, possibly due to a combination of things. This box is actually a C2Q (more powerful than the one in Sep 2011) too, and is actively doing nothing. Relevant bits: # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT backups 1.81T 463G 1.36T 24% 1.00x ONLINE - data 2.72T 694G 2.04T 24% 1.00x ONLINE - # zdb -C | grep ashift ashift: 12 ashift: 12 # zfs create -o compression=lzjb -o mountpoint=/mnt backups/comptest # zfs get all backups/comptest | grep compression backups/comptest compression lzjb local The "backups" pool is a single disk (WD20EFRX) running at SATA300 with NCQ, backed by an Intel ICH9 in AHCI mode. The disk is a 4K sector drive where the gnop trick was used (proof above). I could have used the "data" pool (raidz1 driven by 3 disks (WD10EFRX) + gnop), but it wouldn't matter -- the problem is consistent no matter what the pool. I can't demonstrate the problem using "while : ; do date ; sleep 1 ; done" because sleep 1 isn't granular enough (yes I'm aware FreeBSD sleep(1) supports more granularity) and because date/strftime doesn't show microseconds. So off into perl + Time::HiRes we go... window1# date +%s ; dd if=/dev/zero of=/mnt/bigfile bs=64k 1372932367 ^C123977+0 records in 123977+0 records out 8124956672 bytes transferred in 16.437748 secs (494286486 bytes/sec) window2# perl -e 'use Time::HiRes qw(time sleep); $|=1; while(1) { print time, "\n"; sleep(0.2); }' Now because even 0.2 seconds probably isn't granular enough, I ended up pressing Enter in the middle of the running perl output every time I'd notice that lines weren't coming across at consistent 0.2 second intervals (I guess I have a good eye for this sort of thing). So blank lines are me noticing the pauses/delays I've been talking about: 1372932411.90407 1372932412.10415 1372932412.30513 1372932412.50614 1372932412.70713 1372932412.90813 1372932413.10913 1372932413.31013 1372932413.51112 1372932413.71213 1372932413.91315 1372932414.11413 1372932414.31513 1372932414.51615 1372932414.71714 1372932415.00015 1372932415.27278 1372932415.47316 1372932415.67416 1372932415.87514 1372932416.07615 1372932416.27715 1372932416.48115 1372932416.78215 1372932416.98614 1372932417.18717 1372932417.38814 1372932417.58912 1372932417.79016 1372932417.99115 1372932418.40577 1372932418.60617 1372932418.80715 1372932419.00813 1372932419.20913 1372932419.41013 1372932419.64116 1372932419.85516 1372932420.11614 1372932420.31716 1372932420.51813 1372932420.71913 1372932420.92016 1372932421.12115 1372932421.32216 1372932421.58213 1372932421.78316 1372932421.98416 1372932422.18515 1372932422.38613 1372932422.58713 1372932422.80118 1372932423.05617 1372932423.34016 1372932423.54116 1372932423.74215 1372932423.94314 1372932424.14415 1372932424.43316 1372932424.63417 1372932424.85514 1372932425.05613 1372932425.25715 1372932425.45813 1372932425.65913 1372932425.86017 1372932426.18416 1372932426.51216 1372932426.71312 1372932426.91413 1372932427.11515 1372932427.31613 1372932427.74915 1372932428.00214 1372932428.20315 1372932428.40415 1372932428.60514 1372932428.80613 1372932429.00713 1372932429.38115 1372932429.58214 1372932429.78316 1372932429.98417 1372932430.18519 1372932430.38614 1372932430.58713 1372932430.92817 1372932431.12914 1372932431.33012 1372932431.53115 1372932431.73214 1372932431.93313 1372932432.13413 1372932432.48115 1372932432.73414 1372932432.93514 1372932433.13616 1372932433.33713 1372932433.53817 1372932433.73915 1372932433.95151 1372932434.28214 1372932434.48316 1372932434.68414 1372932434.88515 1372932435.08614 1372932435.28712 1372932435.48916 1372932435.84146 1372932436.05013 1372932436.25117 ^C There's a quite consistent pattern if you look closely: about every 8 lines of output. Each line = every 0.2 seconds, so about every 1.5 seconds is where I'd see a pause which would last for about 0.5 seconds. And no, the above output *was not* being written to a file on ZFS, only to stdout. :-) What's interesting: I tried compression=gzip-9, which historically was worse (I remember this clearly), but the stalls are about the same. Maybe it's because I'm using /dev/zero rather than /dev/random, but the issue there is that /dev/random would tax the CPU (entropy, etc.) more. We didn't use compression at my previous job on Solaris (available CPU time was very, very important given what the machines did), so I don't have any context for comparison. But: I can do this exact same procedure on the /backups filesystem/pool, without compression of course, and there are no stalls -- just smooth interactivity. Now let me circle back to the convo I had with Fabian in 2013... I have zero experience doing this "sched trace" stuff. I do not speak Python, but looking at /usr/src/tools/sched/schedgraph.py almost implies it has some kind of "visual graphing" (via X? I have no clue from the code) and "borders" and "colour" support -- this is not an X system, so unless this Python script generates image files somehow (I have no image libraries installed on my system)... My kernel does contain: options KTR options KTR_ENTRIES=262144 options KTR_COMPILE=(KTR_SCHED) options KTR_MASK=(KTR_SCHED) And I can follow the instructions at the top of the Python script and provide the ktrdump somewhere if needed, but that's about it. I don't know if that would help or be beneficial in any way -- because even though I have some familiarity with userland profiling via *_p.a libs, this is something at a completely different level. So if someone wants this, I need a bit of hand-holding to know what all I'm supposed to be doing. The instructions in the Python script make me a little weary, particularly since it doesn't say to re-set debug.ktr.mask to 536870912 afterward, so I'm not sure what the implications are. > >If a person considering use of compression on FreeBSD ZFS doesn't mind > >that problem, then by all means use it. It doesn't change the fact that > >there's an issue, and one that folks should be made aware of up front. > >It's not spreading FUD: it's spreading knowledge of a certain behaviour > >that differs between FreeBSD and Solaris/Illumos. The issue is a > >deal-breaker for me; if it's not for you, great. > > Sounds like it could well be use case based then, as we've not had any > problems compression causing interactively problems. Quite the opposite > in fact, the reduced physical IO that compression results in improved > interactivity. > > So I guess its like everything, one size doesn't fit all, so temporing > statements about blanket avoiding the these features seems like the > way to go :) While I see the logic in what you're saying, I prefer to publicly disclose the differences in behaviours between Illumos ZFS and FreeBSD ZFS. I'm well-aware of the tremendous and positive effort to minimise those differences (code-wise) -- I remember mm@ talking about this some time ago -- but if this is somehow one of them, I do not see the harm in telling people "FYI, there is this quirk/behavioural aspect specific to FreeBSD that you should be aware of". It doesn't mean ZFS on FreeBSD sucks, it doesn't mean it's broken, it just means it's something that would completely surprise someone out of the blue. Imagine the thread: "my system intermittently stalls, even at VGA console... does anyone know what's causing this?" -- I doubt anyone would think to check ZFS. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130704103227.GA84901>