Date: Thu, 11 Aug 2016 11:10:16 +0200 From: Julien Cigar <julien@perdition.city> To: Borja Marcos <borjam@sarenet.es> Cc: Jordan Hubbard <jkh@ixsystems.com>, freebsd-fs@freebsd.org Subject: Re: HAST + ZFS + NFS + CARP Message-ID: <20160811091016.GI70364@mordor.lan> In-Reply-To: <E7D42341-D324-41C7-B03A-2420DA7A7952@sarenet.es> References: <6035AB85-8E62-4F0A-9FA8-125B31A7A387@gmail.com> <20160703192945.GE41276@mordor.lan> <20160703214723.GF41276@mordor.lan> <65906F84-CFFC-40E9-8236-56AFB6BE2DE1@ixsystems.com> <B48FB28E-30FA-477F-810E-DF4F575F5063@gmail.com> <61283600-A41A-4A8A-92F9-7FAFF54DD175@ixsystems.com> <20160704183643.GI41276@mordor.lan> <AE372BF0-02BE-4BF3-9073-A05DB4E7FE34@ixsystems.com> <20160704193131.GJ41276@mordor.lan> <E7D42341-D324-41C7-B03A-2420DA7A7952@sarenet.es>
next in thread | previous in thread | raw e-mail | index | archive | help
--1E1Oui4vdubnXi3o Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Aug 11, 2016 at 10:11:15AM +0200, Borja Marcos wrote: >=20 > > On 04 Jul 2016, at 21:31, Julien Cigar <julien@perdition.city> wrote: > >=20 > >> To get specific again, I am not sure I would do what you are contempla= ting given your circumstances since it=E2=80=99s not the cheapest / simples= t solution. The cheapest / simplest solution would be to create 2 small ZF= S servers and simply do zfs snapshot replication between them at periodic i= ntervals, so you have a backup copy of the data for maximum safety as well = as a physically separate server in case one goes down hard. Disk storage i= s the cheap part now, particularly if you have data redundancy and can ther= efore use inexpensive disks, and ZFS replication is certainly =E2=80=9Cgood= enough=E2=80=9D for disaster recovery. As others have said, adding additi= onal layers will only increase the overall fragility of the solution, and = =E2=80=9Cfragile=E2=80=9D is kind of the last thing you need when you=E2=80= =99re frantically trying to deal with a server that has gone down for what = could be any number of reasons. > >>=20 > >> I, for example, use a pair of FreeNAS Minis at home to store all my me= dia and they work fine at minimal cost. I use one as the primary server th= at talks to all of the VMWare / Plex / iTunes server applications (and serv= es as a backup device for all my iDevices) and it replicates the entire poo= l to another secondary server that can be pushed into service as the primar= y if the first one loses a power supply / catches fire / loses more than 1 = drive at a time / etc. Since I have a backup, I can also just use RAIDZ1 f= or the 4x4Tb drive configuration on the primary and get a good storage / re= dundancy ratio (I can lose a single drive without data loss but am also not= wasting a lot of storage on parity). > >=20 > > You're right, I'll definitively reconsider the zfs send / zfs receive > > approach. >=20 > Sorry to be so late to the party. >=20 > Unless you have a *hard* requirement for synchronous replication, I would= avoid it like the plague. Synchronous replication sounds sexy, but it > has several disadvantages: Complexity and in case you wish to keep an off= -site replica it will definitely impact performance. Distance will > increase delay. >=20 > Asynchronous replication with ZFS has several advantages, however. >=20 > First and foremost: the snapshot-replicate approach is a terrific short-t= erm =E2=80=9Cbackup=E2=80=9D solution that will allow you to recover quickl= y from some > often too quickly incidents, like your own software corrupting data. A ZF= S snapshot is trivial to roll back and it won=E2=80=99t involve a costly = =E2=80=9Cbackup > recovery=E2=80=9D procedure. You can do both replication *and* keep some = snapshot retention policy =C3=A0la Apple=E2=80=99s Time Machine.=20 >=20 > Second: I mentioned distance when keeping off-site replicas, as distance = necessarily increases delay. Asynchronous replication doesn=C2=B4t have tha= t problem. >=20 > Third: With some care you can do a one to N replication, even involving d= ifferent replication frequencies. >=20 > Several years ago, in 2009 I think, I set up a system that worked quite w= ell. It was based on NFS and ZFS. The requirements were a bit particular, > which in this case greatly simplified it for me. >=20 > I had a farm of front-end web servers (running Apache) that took all of t= he content from a NFS server. The NFS server used ZFS as the file system. T= his might not be useful for everyone, but in this case the web servers were= CPU bound due to plenty of PHP crap. As the front ends weren=E2=80=99t sup= posed to write to the file server (and indeed it was undesirable for securi= ty reasons) I could afford to export the NFS file systems in read-only mode= =2E=20 >=20 > The server was replicated to a sibling in 1 or 2 minute intervals, I don= =E2=80=99t remember. And the interesting part was this. I used Heartbeat to= decide which of the servers was the master. When Heartbeat decided which o= ne was the master, a specific IP address was assigned to it, starting the N= FS service. So, the front-ends would happily mount it. >=20 > What happened in case of a server failure?=20 >=20 > Heartbeat would detect it in a minute more or less. Assuming a master fai= lure, the former slave would become master, assigning itself the NFS > server IP address and starting up NFS. Meanwhile, the front-ends had a si= lly script running in 1 minute intervals that simply read a file from the > NFS mounted filesystem. In case there was a reading error it would force = an unmount of the NFS server and it would enter a loop trying to mount it a= gain until it succeeded. >=20 > It looks kludgy, but that means that in case of a server loss (ZFS on Fre= eBSD wasn=E2=80=99t that stable at the time and we suffered a couple of the= m) the website was titsup for maybe two minutes, recovering automatically. = It worked.=20 >=20 > Both NFS servers were in the same datacenter, but I could have added geog= raphical dispersion by using BGP to announce the NFS IP address to our rout= ers.=20 >=20 > There are better solutions, but this one involved no fancy software licen= ses, no expensive hardware and it was quite reliable. The only problem we h= ad was, maybe I was just too daring, we were bitten by a ZFS deadlock bug s= everal times. But it worked anyway. >=20 >=20 As I said in a previous post I tested the zfs send/receive approach (with zrep) and it works (more or less) perfectly.. so I concur in all what you said, especially about off-site replicate and synchronous replication. Out of curiosity I'm also testing a ZFS + iSCSI + CARP at the moment,=20 I'm in the early tests, haven't done any heavy writes yet, but ATM it=20 works as expected, I havent' managed to corrupt the zpool. I think that with the following assumptions the failover from MASTER (old master) -> BACKUP (new master) can be done quite safely (the opposite *MUST* always be done manually IMHO): 1) Don't mount the zpool at boot 2) Ensure that the failover script is not executed at boot 3) Once the failover script has been executed and that the BACKUP is=20 the new MASTER assume that it will remain so, unless changed manually This is to avoid the case of a catastrophic power loss in the DC and a possible split-brain scenario when they both go off / on simultaneously. 2) is especially important with CARPed interface where the state could flip from BACKUP -> MASTER -> BACKUP at boot sometimes. For 3) you must adapt the advskew if the CARPed interface, so that even if the BACKUP (now master) has an unplanned shutdown/reboot the old MASTER (now backup) doesn't switch, unless done manually. So you should do something like: sysrc ifconfig_bge0_alias0=3D"vhid 54 advskew 10 pass xxx alias 192.168.10.15/32" ifconfig bge0 vhid 54 advskew 10 in the failover script (where the "new" advskew (10) is smaller than=20 the old master (now backup) advskew) The failover should only be done for unplanned events, so if you reboot the MASTER for some reasons (freebsd-update, etc) the failover script on the BACKUP should handle that. (more soon...) Julien >=20 >=20 > Borja. >=20 >=20 >=20 --=20 Julien Cigar Belgian Biodiversity Platform (http://www.biodiversity.be) PGP fingerprint: EEF9 F697 4B68 D275 7B11 6A25 B2BB 3710 A204 23C0 No trees were killed in the creation of this message. However, many electrons were terribly inconvenienced. --1E1Oui4vdubnXi3o Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCgAGBQJXrED1AAoJELK7NxCiBCPAqOgQAJPq/UXsc8VfydbO0R4WCXsO pQuoRErCLu0wYOyeZmKyPgRO05V+Iv8fDcvw/uhzrx6bz+mxISmSgUFt/7PQM7M/ q+VkFyE1whh/Yh3G23n6s3tISoopXgAi+kvSJal/hcOmYDxJ6nlFZ27QsIrBL8UN JWJ0kY+MBUR9wdQhZES37Y4pu/o3ZA+uthRyH+VpW7DavKjVU9yNzddPp+8kCL8K si2c6QQaRiTTIEszOXygeRaZTuwjSy5dzuFqtpOQvJrQcrBJ4duapXWfTVr97I9u 9VuAs+Ffr1eWi4U2VhChGxwc3zivcpU+OvZDrJTWIeWJQtYvxQ0S37WUijeSPOi/ iGR5daA4zbiaN9OIDyODKtOjAzNNSehqGxRneWLN7I16BbCkg8U8rI5ObDZf3wn7 yUHZ34MXA5X+wB1z0q/uNq9vG5KYEaIcM35NtzBWE+iLkLaSwuJjpBJCmO2SCk3F 3yg24LiSdNagiRIDykW0I0BU0r1hbv3zPSWSEjNwx7hxOkZp32C4sImoFBY0Df6H njj93uZM5/c05HluhSp2T4SemmnGTkjU0vtfNxr2l3JadLay/VAQr37nsoaOisN1 L4W2UaAGNqW/5iL2EkxaQFYBPbmm2vQgCsn7nwlZ9bvkqPEi2CRUSVeIrUdH6H+I DQCYrx+6HLkdiEQi/7cu =H3Gg -----END PGP SIGNATURE----- --1E1Oui4vdubnXi3o--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160811091016.GI70364>