From owner-freebsd-stable@FreeBSD.ORG Sat Mar 2 22:57:14 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 7414B56D for ; Sat, 2 Mar 2013 22:57:14 +0000 (UTC) (envelope-from karl@denninger.net) Received: from fs.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) by mx1.freebsd.org (Postfix) with ESMTP id 12EDC7DF for ; Sat, 2 Mar 2013 22:57:13 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by fs.denninger.net (8.14.6/8.13.1) with ESMTP id r22Mv5hI037309 for ; Sat, 2 Mar 2013 16:57:05 -0600 (CST) (envelope-from karl@denninger.net) Received: from [127.0.0.1] [192.168.1.40] by Spamblock-sys (LOCAL); Sat Mar 2 16:57:05 2013 Message-ID: <513283BC.5090606@denninger.net> Date: Sat, 02 Mar 2013 16:57:00 -0600 From: Karl Denninger User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-stable@freebsd.org Subject: Re: Musings on ZFS Backup strategies References: <20130301165040.GA26251@anubis.morrow.me.uk> <20130301185912.GA27546@anubis.morrow.me.uk> <20130302221446.GG286@server.rulingia.com> In-Reply-To: <20130302221446.GG286@server.rulingia.com> X-Enigmail-Version: 1.5 X-Antivirus: avast! (VPS 130302-1, 03/02/2013), Outbound message X-Antivirus-Status: Clean Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Mar 2013 22:57:14 -0000 On 3/2/2013 4:14 PM, Peter Jeremy wrote: > On 2013-Mar-01 08:24:53 -0600, Karl Denninger wrot= e: >> If I then restore the base and snapshot, I get back to where I was whe= n >> the latest snapshot was taken. I don't need to keep the incremental >> snapshot for longer than it takes to zfs send it, so I can do: >> >> zfs snapshot pool/some-filesystem@unique-label >> zfs send -i pool/some-filesystem@base pool/some-filesystem@unique-labe= l >> zfs destroy pool/some-filesystem@unique-label >> >> and that seems to work (and restore) just fine. > This gives you an incremental since the base snapshot - which will > probably grow in size over time. If you are storing the ZFS send > streams on (eg) tape, rather than receiving them, you probably still > want the "Towers of Hanoi" style backup hierarchy to control your > backup volume. It's also worth noting that whilst the stream will > contain the compression attributes of the filesystem(s) in it, the > actual data is the stream in uncompressed I noted that. The script I wrote to do this looks at the compression status in the filesystem and, if enabled, pipes the data stream through pbzip2 on the way to storage. The only problem with this presumption is that for database "data" filesystems the "best practices" say that you should set the recordsize to that of the underlying page size of the dbms (e.g. 8k for Postgresql) for best performance and NOT enable compression. Reality however is that the on-disk format of most database files is EXTREMELY compressible (often WELL better than 2:1), so I sacrifice there. I think the better option is to stuff a user parameter into the filesystem attribute table (which apparently I can do without boundary) telling the script whether or not to compress on output so it's not tied to the filesystem's compression setting. I'm quite-curious, in fact, as to whether the "best practices" really are in today's world. Specifically, for a CPU-laden machine with lots of compute power I wonder if enabling compression on the database filesystems and leaving the recordsize alone would be a net performance win due to the reduction in actual I/O volume. This assumes you have the CPU available, of course, but that has gotten cheaper much faster than I/O bandwidth has. >> This in turn means that keeping more than two incremental dumps offlin= e >> has little or no value; the second merely being taken to insure that >> there is always at least one that has been written to completion witho= ut >> error to apply on top of the base. > This is quite a critical point with this style of backup: The ZFS send > stream is not intended as an archive format. It includes error > detection but no error correction and any error in a stream renders > the whole stream unusable (you can't retrieve only part of a stream). > If you go this way, you probably want to wrap the stream in a FEC > container (eg based on ports/comms/libfec) and/or keep multiple copies.= That's no more of a problem than it is for a dump file saved on a disk though, is it? While restore can (putatively) read past errors on a tape, in reality if the storage is a disk and part of the file is unreadable the REST of that particular archive is unreadable. Skipping unreadable records does "sorta work" for tapes, but it rarely if ever does for storage onto a spinning device within the boundary of the impacted file. In practice I attempt to cover this by (1) saving the stream to local disk and then (2) rsync'ing the first disk to a second in the same cabinet. If the file I just wrote is unreadable I should discover it at (2), which hopefully is well before I actually need it in anger. Disk #2 then gets rotated out to an offsite vault on a regular schedule in case the building catches fire or similar. My exposure here is to time-related bitrot which is a non-zero risk but I can't scrub a disk that's sitting in a vault, so I don't know that there's a realistic means around this risk other than a full online "hotsite" that I can ship the snapshots to (which I don't have the necessary bandwidth or storage to cover.) If I change the backup media (currently UFS formatted) to ZFS formatted and dump directly there via a zfs send/receive I could run both drives as a mirror instead of rsync'ing from one to the other after the first copy is done, then detach the mirror to rotate the drive out and attach the other one, causing a resilver. That's fine EXCEPT if I have a controller go insane I now probably lose everything other than the offsite copy since everything is up for write during the snapshot operation. That ain't so good and that's a risk I've had turn into reality twice in 20 years. On the upside if the primary has an error on it I catch it when I try to resilver as that operation will fail since the entire data structure that's on-disk and written has to be traversed and the checksums should catch any silent corruption. If that happens I know I'm naked (other than the vault copy which I hope is good!) until I replace the backup drive with the error and re-copy everything. What I have trouble quantifying is which is the LARGER risk; I've yet to have a backup drive that is unreadable when I needed it, and I do test my restore capability pretty regularly, but twice in 20 years I've had active disk adapters in running machines destroy every write-mounted drive that was attached to them without warning. Both times the pucker factor went off the charts as soon as I realized what had happened as from an operational perspective it was pretty-much identical to a tornado or fire destroying the machine. > The "recommended" approach is to do zfs send | zfs recv and store a > replica of your pool (with whatever level of RAID that meets your > needs). This way, you immediately detect an error in the send stream > and can repeat the send. You then use scrub to verify (and recover) > the replica. I'm contemplating how to set that up in a way that works and has a reasonable associated operational profile for putting it into practice.=20 What I do now leaves the backup volumes unmounted except when actually being written to, which decreases (but does not completely eliminate) the risk of an insane controller scribbling on the backup volumes. =20 Setting read-only on the volumes doesn't help me at a filesystem level as the risk here is that of insane software and the days of a nice physical WRITE PROTECT switch on the front of a drive carrier are long in the past. I am also concerned about what happens as volume space grows beyond what can be saved on "X" devices and the problems associated with that. I've long since moved to using disk drives as a catalog for data streams rather than actual sequential media (e.g. tapes) due to the ridiculous imbalance in cost between high-capacity DLT-style drives and disks of equivalent storage, never mind transfer rates. One of the challenges that I see with ZFS is that it appears that a bogus block somewhere on a non-redundant medium may block future access to the entire pool. I'm not sure if that's actually the case or if you can read around the error, but if IS the case it's a serious problem.=20 UFS doesn't suffer from that; it will return errors on the file(s) impacted but if you avoid touching those you can read the rest of the pack and the data on it, assuming the failure is not total. ZFS doesn't really invalidate the entire pool on one unrecoverable error, does it? (The documentation is not at all clear if this is the case or not.) >> (Yes, I know, I've been a ZFS resister.... ;-)) > "Resistance is futile."=20 You know what happened to the Borg in the end, right? ;-) --=20 -- Karl Denninger /The Market Ticker =AE/ Cuda Systems LLC