Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 1 Jul 2016 08:39:41 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Ben RUBSON <ben.rubson@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: HAST + ZFS + NFS + CARP
Message-ID:  <1963819198.192910387.1467376781932.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <5F99508D-7532-468A-9121-7A76957A72DB@gmail.com>
References:  <71b8da1e-acb2-9d4e-5d11-20695aa5274a@internetx.com> <20160701084717.GE5695@mordor.lan> <47c7e1a5-6ae8-689c-9c2d-bb92f659ea43@internetx.com> <20160701101524.GF5695@mordor.lan> <f74627e3-604e-da71-c024-7e4e71ff36cb@internetx.com> <20160701105735.GG5695@mordor.lan> <5776569B.3050504@quip.cz> <5F99508D-7532-468A-9121-7A76957A72DB@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Choosing the most recent post in the thread (Ben RUBSON):

First off, this topic is of interest to me because a pNFS server
would need a similar fail-over for a metadata server.
Having said that, I know little about this and am learning from
what I've read (ie. keep up the good work, folks;-).

I will put a few NFS related comments inline.

> 
> > On 01 Jul 2016, at 13:40, Miroslav Lachman <000.fbsd@quip.cz> wrote:
> > 
> > Julien Cigar wrote on 07/01/2016 12:57:
> > 
> >>>> why...? I guess iSCSI is slower but should be safer than HAST, no?
> >>> 
> >>> do your testing, please. even with simulated short network cuts. 10-20
> >>> secs are way enaugh to give you a picture of what is going to happen
> >> 
> >> of course I'll test everything properly :) I don't have the hardware yet
> >> so ATM I'm just looking for all the possible "candidates", and I'm
> >> aware that a redundant storage is not that easy to implement ...
> >> 
> >> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
> >> either zfs send|ssh zfs receive as you suggest (but it's
> >> not realtime), either a distributed FS (which I avoid like the plague..)
> > 
> > When disaster comes you will need to restart NFS clients in almost all
> > cases (with CARP + ZFS + HAST|iSCSI) and you will lose some writes too.
> > And if something bad happens with your mgmt scripts or network you can end
> > up with corrupted ZFS pool on master and slave too - you will need to
> > recovery from backups. For example in some split brain scenario when both
> > nodes will try to import pool.
> 
What the clients need to have happen is their TCP connection fail, so they must
do a new TCP connection. At that point, they will retry any outstanding RPCs,
including ones that did UNSTABLE writes but there was no subsequent Commit RPC
done successfully.

I know nothing about CARP (have never used it), but I have a hunch it will
not do the above at the correct point in time.

In general, I would consider failing over from the NFS server to the backup
one (where the drives have been kept up to date via a ZFS mirror using iSCSI)
is a major event. I don't think you want to do this for short outages.
(Server overload, server crash/reboot, etc.)
I think it is hard to distinguish between a slow server and a failed one and
you don't want this switchover happening unless it is really needed, imho.

When it happens, I think you want a strict ordering of events like:
- old master server shut down (off the network or at least moved to different
  IP addresses so it appears off the network to the clients).
- an orderly switchover of the ZFS file system, so that the backup is now the
  only system handling the file system (the paragraph just below here covers
  that, I think?)
- new server (was backup) comes up with the same IP address(es) as the old one
  had before the failure.
  --> I am not sure if anything has to be done to speed up ARP cache invalidation
      for the old server's IP->MAC mapping.
(I'm not sure what you do w.r.t. mirroring for the backup server. Ideally there
 would be a way for a third server to have mirrored disks resilvered from the
 backup server's. I don't know enough about ZFS to know if this could be done
 with a third server, where its empty disks are set up as iSCSI mirrors or the
 backup server's after the failover? I think this is also covered by the para.
 below.)

> Of course you must take care that both nodes do not import the pool at the
> same time.
> For the slave to import the pool, first stop iSCSI targets (ctld), and also
> put network replication interface down, to be sure.
> Then, import the pool.
> Once old master repaired, export its pool (if still imported), make its disks
> iSCSI targets and give them the old slave (promoted master just above).
> Of course it implies some meticulous administration.
> 
Yes. I don't think CARP can be trusted to do this. The problem is that there
is a time ordering w.r.t. activities on the two servers. However, you can't
assume that the two servers can communicate with each other.

The simpler/reliable way would be done manually be a sysadmin (who might also
know why and how long the master will be down for).

It might be possible to automate this with daemons on the two servers, with
something like:
- master sends regular heartbeat messages to backup.
- if master gets no NFS client activity for N seconds, it sends "shutdown
  message to backup" and then does a full shutdown.

- if slave gets shutdown message from master OR doesn't see a heartbeat
  message for something like 2 * N seconds, then it assumes it can take over
  and does so.
--> The trick is, the backup can't start a takeover until the master really
    is offline.

> > With ZFS send & receive you will lose some writes but the chance you will
> > corrupt both pools are much lower than in the first case and the setup is
> > much simpler and runtime error proof.
> 
> Only some ?
> Depending on the write throughput, won't you loose a lot of data on the
> target/slave ?
> How do you make ZFS send/receive quite realtime ?
> while [ 1 ] do ; snapshot ; send/receive ; delete old snapshots ; done ?
> 
Well, if the NFS clients aren't buggy and the server isn't running sync=disabled,
then nothing should get lost if the clients recognize that the server has
"crashed/rebooted". This happens when the TCP connection to the server breaks.

You can't have a case where the client TCP connection to the server keeps
functioning but actually goes to the backup server and you can't have the
case where some clients are still doing NFS RPCs to the old master while
others are doing RPCs to the backup.

Good luck with it. It is an interesting challenge and worth exploring.

rick

> Thanks !
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1963819198.192910387.1467376781932.JavaMail.zimbra>