Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 30 Mar 2012 10:06:15 +0200
From:      =?ISO-8859-1?Q?Johnny_Bergstr=F6m?= <freebsd@joonix.se>
To:        freebsd-fs@freebsd.org
Subject:   CARP + HAST + ZFS + NFS = problems?
Message-ID:  <CAPUPEBm_MMLEf3GKO0OeZ3xisYehEMffW=HZWgO6ZsQO15EUuQ@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
Hello!

This is my first post in the mailing list so please bare with me.

I have been looking for a good solution to provide a storage server
with high availability that will serve photos via nfs.
Stumbled upon this article:
http://www.aisecure.net/2012/02/07/hast-freebsd-zfs-with-carp-failover/
which is mostly what my current lab testing looks like.

This looked like a promising setup and I trust ZFS since I've used it
in more simple setups before. But unfortunately there has been some
problems in my tests.
The whole idea is that the slave machine it should handle failures
from the master such as dropping the network interface or even
crashing / resetting. Many times my tests has worked flawlessly,
clients get stalled on the NFS mount until the new server wakes up and
then continue writing data as if nothing happened.

But more than once, I've had ZFS metadata corruption or even
unrecoverable errors where zfs import -F doesn't work and it tells me
to restore the pool from backup.

My tests consists of simply doing ifconfig down while a client is
writing data and also doing hard reset on the master machine.

This is what the lab setup looks like:
2 virtual machines running with 1G of ram as Xen guests.
2 ZFS mirrored virtual drives via hast devices.
CARP setup which is monitored by devd and that executes my script
similar to the one in the article, with additions to start/stop nfsd.

The problems with corruptions has never happened when I try to use ZFS
directly on the drives without having HAST in between. Also I've
noticed the virtual harddrives have problems flushing:
Mar 29 10:01:01 storage1 hastd[6690]: [disk2] (primary) Remote request
failed (Operation not supported by device): FLUSH.
Mar 29 10:01:02 storage1 hastd[6690]: [disk2] (primary) Unable to
flush disk cache on activemap update: Operation not supported by
device.

I'm wondering if anyone has used this setup on real machines and on
production, it seems a bit sketchy to me right now but in theory the
solution would be perfect.
My idea was to apply this on real machines with raidz of 3 drives,
which later would be expanded by an additional 3 drives.


Any suggestions on fixing this problem or an alternative configuration
that might be more stable?



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAPUPEBm_MMLEf3GKO0OeZ3xisYehEMffW=HZWgO6ZsQO15EUuQ>