Date: Mon, 8 Jul 2013 17:05:08 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Freddie Cash <fjwcash@gmail.com> Cc: freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: EBS snapshot backups from a FreeBSD zfs file system: zpool freeze? Message-ID: <20130709000508.GA92194@icarus.home.lan> In-Reply-To: <CAOjFWZ4LkRA14Q6X7j-CL2BjhKmQnijH-FeMvxdKZKiaH3oarQ@mail.gmail.com> References: <CADBaqmihCB5JP01hLwXTWHoZiJJ5-jkT-Ro=oDwOcKZT_zvEKA@mail.gmail.com> <A5A66641-5EF9-454E-A767-009480EE404E@dragondata.com> <14A2336A-969C-4A13-9EFA-C0C42A12039F@hostpoint.ch> <87zjty11gn.wl%berend@pobox.com> <41CC5720-B1EA-4841-8BA5-893F4A628EAD@hostpoint.ch> <877gh024vy.wl%berend@pobox.com> <20130708210145.GA89605@icarus.home.lan> <CAOjFWZ4CcGP-5axPewCA0hhqxoFuQ1E9zvZyqGWPbWsW1d5jOw@mail.gmail.com> <87vc4kznsa.wl%berend@pobox.com> <CAOjFWZ4LkRA14Q6X7j-CL2BjhKmQnijH-FeMvxdKZKiaH3oarQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 08, 2013 at 03:37:46PM -0700, Freddie Cash wrote: > On Mon, Jul 8, 2013 at 3:31 PM, Berend de Boer <berend@pobox.com> wrote: > > > >>>>> "Freddie" == Freddie Cash <fjwcash@gmail.com> writes: > > > > Freddie> At which point, it would make more sense taking the > > Freddie> discussion upstream to Illumos to find a way to quiesce a > > Freddie> ZFS pool in such a way that EBS backups would work. Once > > Freddie> that is done, then it can filter downstream to FreeBSD, > > Freddie> Linux, and others. > > > > Great tip. Didn't know exactly if the ZFS implementation in FreeBSD > > was forked or not. I see on their home page about submitting patches > > :-) > > > > The FreeBSD implementation of ZFS isn't 100% identical to the Illumos (aka > "reference") implementation, mainly due to GEOM; however, the FreeBSD ZFS > maintainers try to keep it at feature parity with Illumos (and even push > patches upstream that get added to Illumos). > > Same with the Linux implementation of ZFS, although there are more changes > made to that one to shoehorn it into that wonderful mess they call "a > storage stack". :) There are a handful of features available in the > ZFS-on-Linux implementation that aren't anywhere else (like "-o ashift=" > for zpool create/add). > > All in all, the ZFS-using OS projects try to stay as close to the Illumos > version as is reasonable for the OS. > > It certainly would be interesting to have a "zfs freeze" and/or a "zpool > freeze" (depending on where you want to quiesce things), but it may not > play into how ZFS works (wanting to have complete control over the block > devices, meaning no special magic underneath like block-level snapshots). > :) Or, it may be the "next great feature" of ZFS. :) Well back to his original statement, quoting: > On Linux' file systems I can freeze a file system, start the backup of > all disks, and unfreeze. This freeze usually only takes 100ms or so. I interpret this statement to mean, on Linux: 1. Some command is issued at the filesystem level that causes all I/O operations (read and write) directed to/from that filesystem to block (wait) indefinitely, and that all pending queued writes to the disk are flushed to disk (on FreeBSD we would call this BIO_FLUSH), 2. Some other command is issued (at the Amazon EBS level, whether it be done via a web page or via CLI commands on the same Linux box -- though I don't know how that would work unless the CLI tools are on a completely separate filesystem), where an EBS snapshot is taken (similar to a filesystem snapshot but at the actual storage level, Possibly if this is a Linux command there's an actual device driver that sits between the storage layer and EBS which can effectively "halt" or "control" things in some manner (would not be surprised! VMs often offer this) -- I'll call this a "shim", 3. Some command is issued at the filesystem level that releases that block/wait, and all future I/O requests go through. What this means is that "block-level snapshots" are what would be necessary -- the key here is that writes pending (scheduled to be written to the disk) need to be flushed, and that any other I/O block. I do not think something like CACHE FLUSH EXT (i.e. the ATA command used to actually flush disk-level cache to the platters) matters -- EBS, whether the data is "in its cache" or not has no bearing, it should know what to do in either case. All this would be because of what EBS would require/mandate. On FreeBSD we don't have the Linux equivalent of #1/#3 -- the layer where this would be done, ideally, is at the GEOM level (ex. "gfreeze" command would block all I/O and also issue BIO_FLUSH to ensure things had been written). Due to the split between GEOM and filesystems (unrelated things per se), one would have to issue "gfreeze" on the disks that make up the filesystem, followed by doing the EBS backup/snapshot, followed by "gfreeze -u" on all the disks. Wishful thinking, and very idealistic, but that's my take on it. I have no idea how you'd issue this command to select disks without there being some risk; i.e. if a 5-disk raidz1, you'd issue that command 5 times (even if just in 1 single command, the kernel still has to iterate over 5 items linearly), which means there's a chance the filesystem could have successfully written parts of something to some of those 5 disks, thus upon an EBS snapshot restore the filesystem is actually inconsistent (ZFS reporting checksum failures, for example). I have no idea how at the filesystem level (ex. zfs, not zpool) such could be accomplished because again BIO_FLUSH is what's needed, and that would be at the "provider" level (GEOM term) -- I think (kernel folks please correct me). I also have no idea how other layers (ex. CAM) would react to such a "freeze". Likewise, I worry about userland applications; 100ms is a nice and convenient number... ;-) On FreeBSD I think what most folks do is avoid all of the above and use filesystem snapshots exclusively, either ZFS or UFS, although UFS snapshots... well... don't get me started. Filesystem snapshots are "supposed" to be fast, but they depend greatly on a lot of things and how they're implemented. But honestly they're what most people turn to, rather than doing backups at the "block level" (e.g. EBS). I've never encountered anything like a "block level" freeze or snapshot on bare metal (this would have to be done somehow at the controller level; SANs have this, I believe, but not simple HBAs that I've worked with). One can't even do something like extend sync(8) to somehow issue BIO_FLUSH, because it doesn't guarantee contention between the BIO_FLUSH and the time things are done -- more writes could enter the queue or maybe enough that the queue is full + gets processed right then and there, leading to the same situation. This whole thing is a mess due to the layers of disconnect between all the pieces (including on Linux -- it just so happens they have some interesting way with **very specific filesystems** to accomplish this task), and if you ask me, a complete disconnect from reality between the "cloud providers" (Amazon, etc.) and how actual storage and filesystems *work*. Very naughty assumptions being made on their part, unless, of course, there is that "shim" I spoke about. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130709000508.GA92194>