From owner-freebsd-cluster@FreeBSD.ORG  Mon Sep 26 12:46:16 2005
Return-Path: <owner-freebsd-cluster@FreeBSD.ORG>
X-Original-To: freebsd-cluster@freebsd.org
Delivered-To: freebsd-cluster@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9724C16A41F;
	Mon, 26 Sep 2005 12:46:16 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2252743D48;
	Mon, 26 Sep 2005 12:46:15 +0000 (GMT)
	(envelope-from anderson@centtech.com)
Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220])
	by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id j8QCkENe047545;
	Mon, 26 Sep 2005 07:46:14 -0500 (CDT)
	(envelope-from anderson@centtech.com)
Message-ID: <4337ED91.8080200@centtech.com>
Date: Mon, 26 Sep 2005 07:46:09 -0500
From: Eric Anderson <anderson@centtech.com>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.11) Gecko/20050914
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: filip wuytack <filip@wuytack.net>
References: <20050924141025.GA1236@uk.tiscali.com>
	<4337DF56.6030407@centtech.com> <4337E8A7.6070107@wuytack.net>
In-Reply-To: <4337E8A7.6070107@wuytack.net>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV 0.82/1102/Sun Sep 25 09:04:56 2005 on mh2.centtech.com
X-Virus-Status: Clean
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-cluster@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Clustering FreeBSD <freebsd-cluster.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>, 
	<mailto:freebsd-cluster-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-cluster>
List-Post: <mailto:freebsd-cluster@freebsd.org>
List-Help: <mailto:freebsd-cluster-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-cluster>,
	<mailto:freebsd-cluster-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 12:46:16 -0000

filip wuytack wrote:
> 
> 
> Eric Anderson wrote:
> 
>> Brian Candler wrote:
>>
>>> Hello,
>>>
>>> I was wondering if anyone would care to share their experiences in
>>> synchronising filesystems across a number of nodes in a cluster. I 
>>> can think
>>> of a number of options, but before changing what I'm doing at the 
>>> moment I'd
>>> like to see if anyone has good experiences with any of the others.
>>>
>>> The application: a clustered webserver. The users' CGIs run in a chroot
>>> environment, and these clearly need to be identical (otherwise a CGI 
>>> running
>>> on one box would behave differently when running on a different box).
>>> Ultimately I'd like to synchronise the host OS on each server too.
>>>
>>> Note that this is a single-master, multiple-slave type of filesystem
>>> synchronisation I'm interested in.
>>>
>>>
>>> 1. Keep a master image on an admin box, and rsync it out to the 
>>> frontends
>>> ------------------------------------------------------------------------- 
>>>
>>>
>>> This is what I'm doing at the moment. Install a master image in
>>> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
>>> rsync it. [Actually I'm exporting it using NFS, and the frontends run 
>>> rsync
>>> locally when required to update their local copies against the NFS 
>>> master]
>>>
>>> Disadvantages:
>>>
>>> - rsyncing a couple of gigs of data is not particularly fast, even 
>>> when only
>>> a few files have changed
>>>
>>> - if a sysadmin (wrongly) changes a file on a front-end instead of on 
>>> the
>>> master copy in the admin box, then the change will be lost when the next
>>> rsync occurs. They might think they've fixed a problem, and then 
>>> (say) 24
>>> hours later their change is wiped. However if this is a config file, the
>>> fact that the old file has been reinstated might not be noticed until 
>>> the
>>> daemon is restarted or the box rebooted - maybe months later. This I 
>>> think
>>> is the biggest fundamental problem.
>>>
>>> - files can be added locally and they will remain indefinitely 
>>> (unless we
>>> use rsync --delete which is a bit scary). If this is done then adding 
>>> a new
>>> machine into the cluster by rsyncing from the master will not pick up 
>>> these
>>> extra files.
>>>
>>> So, here are the alternatives I'm considering, and I'd welcome any
>>> additional suggestions too.
>>
>>
>>
>> Here's a few ideas on this: do multiple rsyncs, one for each top level 
>> directory.  That might speed up your total rsync process.  Another 
>> similar method is using a content revisioning system.  This is only 
>> good for some cases, but something like subversion might work ok here.
>>
>>
>>
>>> 2. Run the images directly off NFS
>>> ----------------------------------
>>>
>>> I've had this running before, even the entire O/S, and it works just 
>>> fine.
>>> However the NFS server itself then becomes a critical
>>> single-point-of-failure: if it has to be rebooted and is out of 
>>> service for
>>> 2 minutes, then the whole cluster is out of service for that time.
>>>
>>> I think this is only feasible if I can build a highly-available NFS 
>>> server,
>>> which really means a pair of boxes serving the same data. Since the 
>>> system
>>> image is read-only from the point of view of the frontends, this 
>>> should be
>>> easy enough:
>>>
>>>       frontends            frontends
>>>         | | |                | | |
>>>          NFS   ----------->   NFS
>>>        server 1    sync     server 2
>>>
>>> As far as I know, NFS clients don't support the idea of failing over 
>>> from
>>> one server to another, so I'd have to make a server pair which 
>>> transparently
>>> fails over.
>>>
>>> I could make one NFS server take over the other server's IP address 
>>> using
>>> carp or vrrp. However, I suspect that the clients might notice. I 
>>> know that
>>> NFS is 'stateless' in the sense that a server can be rebooted, but for a
>>> client to be redirected from one server to the other, I expect that 
>>> these
>>> filesytems would have to be *identical*, down to the level of the inode
>>> numbers being the same.
>>>
>>> If that's true, then rsync between the two NFS servers won't cut it. 
>>> I was
>>> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
>>> block-identical read-only mirror image on NFS server 2 - this also 
>>> has the
>>> advantage that any updates are close to instantaneous.
>>>
>>> What worries me here is how NFS server 2, which has the mirrored 
>>> filesystem
>>> mounted read-only, will take to having the data changed under its 
>>> nose. Does
>>> it for example keep caches of inodes in memory, and what would happen if
>>> those inodes on disk were to change? I guess I can always just 
>>> unmount and
>>> remount the filesystem on NFS server 2 after each change.
>>
>>
>>
>> I've tried doing something similar.  I used fiber attached storage, 
>> and had multiple hosts mounting the same partition.  It seemed as 
>> though when host A mounted the filesystem read-write, and then host B 
>> mounted it read-only, any changes made by host A were not seen by B, 
>> and even remounting did not always bring it up to current state.  I 
>> believe it has to do with the buffer cache and host A's desire to keep 
>> things (like inode changes, block maps, etc) in cache and not write 
>> them to disk. FreeBSD does not currently have a multi-system cache 
>> coherency protocol to distribute that information to other hosts.  
>> This is something I think would be very useful for many people.  I 
>> suppose you could just mount the filesystem when you know a change has 
>> happened, but you still may not see the change.  Maybe mounting the 
>> filesystem on host A with the sync option would help.
>>
>>> My other concern is about susceptibility to DoS-type attacks: if one
>>> frontend were to go haywire and start hammering the NFS servers 
>>> really hard,
>>> it could impact on all the other machines in the cluster.
>>>
>>> However, the problems of data synchronisation are solved: any change 
>>> made on
>>> the NFS server is visible identically to all front-ends, and 
>>> sysadmins can't
>>> make changes on the front-ends because the NFS export is read-only.
>>
>>
>>
>> This was my first thought too, and a highly available NFS server is 
>> something any NFS heavy installation wants (needs).  There are a few 
>> implementations of clustered filesystems out there, but non for 
>> FreeBSD (yet).   What that allows is multiple machines talking to a 
>> shared storage with read/write access.  Very handy, but since you only 
>> need read-only access, I think your problem is much simpler, and you 
>> can get away with a lot less.
>>
>>
>>> 3. Use a network distributed filesystem - CODA? AFS?
>>> ----------------------------------------------------
>>>
>>> If each frontend were to access the filesystem as a read-only network 
>>> mount,
>>> but have a local copy to work with in the case of disconnected 
>>> operation,
>>> then the SPOF of an NFS server would be eliminated.
>>>
>>> However, I have no experience with CODA, and although it's been in 
>>> the tree
>>> since 2002, the README's don't inspire confidence:
>>>
>>>    "It is mostly working, but hasn't been run long enough to be sure 
>>> all the
>>>    bugs are sorted out. ... This code is not SMP ready"
>>>
>>> Also, a local cache is no good if the data you want during disconnected
>>> operation is not in the cache at that time, which I think means this 
>>> idea is
>>> not actually a very good one.
>>
>>
>>
>> There is also a port for coda.  I've been reading about this,  and 
>> it's an interesting filesystem, but I'm just not sure of it's 
>> usefulness yet.
>>
>>
>>> 4. Mount filesystems read-only
>>> ------------------------------
>>>
>>> On each front-end I could store /webroot/cgi on a filesystem mounted
>>> read-only to prevent tampering (as long as the sysadmin doesn't 
>>> remount it
>>> read-write of course). That would work reasonably well, except that 
>>> being
>>> mounted read-only I couldn't use rsync to update it!
>>>
>>> It might also work with geom_mirror and ggated/ggatec, except for the 
>>> issue
>>> I raised before about changing blocks on a filesystem under the nose 
>>> of a
>>> client who is actively reading from it.
>>
>>
>>
>> I suppose you could mount r/w only when doing the rsync, then switch 
>> back to ro once complete.  You should be able to do this online, 
>> without any issues or taking the filesystem offline.
>>
>>
>>> 5. Using a filesystem which really is read-only
>>> -----------------------------------------------
>>>
>>> Better tamper-protection could be had by keeping data in a filesystem
>>> structure which doesn't support any updates at all - such as cd9660 or
>>> geom_uzip.
>>>
>>> The issue here is how to roll out a new version of the data. I could 
>>> push
>>> out a new filesystem image into a second partition, but it would then be
>>> necessary to unmount the old filesystem and remount the new on the same
>>> place, and you can't really unmount a filesystem which is in use. So 
>>> this
>>> would require a reboot.
>>>
>>> I was thinking that some symlink trickery might help:
>>>
>>>     /webroot/cgi -> /webroot/cgi1
>>>     /webroot/cgi1     # filesystem A mounted here
>>>     /webroot/cgi2     # filesystem B mounted here
>>>
>>> It should be possible to unmount /webroot/cgi2, dd in a new image, 
>>> remount
>>> it, and change the symlink to point to /webroot/cgi2. After a little 
>>> while,
>>> hopefully all the applications will stop using files in 
>>> /webroot/cgi1, so
>>> this one can be unmounted and a new one put in its place on the next 
>>> update.
>>> However this is not guaranteed, especially if there are long-lived 
>>> processes
>>> using binary images in this partition. You'd still have to stop and 
>>> restart
>>> all those processes.
>>>
>>> If reboots were acceptable, then the filesystem image could also be 
>>> stored
>>> in ramdisk pulled in via pxeboot. This makes sense especially for 
>>> geom_uzip
>>> where the data is pre-compressed. However I would still prefer to avoid
>>> frequent reboots if at all possible. Also, whilst a ramdisk might be 
>>> OK for
>>> the root filesystem, a typical CGI environment (with perl, php, ruby,
>>> python, and loads of libraries) would probably be too large anyway.
>>>
>>>
>>> 6. Journaling filesystem replication
>>> ------------------------------------
>>>
>>> If the data were stored on a journaling filesystem on the master box, 
>>> and
>>> the journal logs were distributed out to the slaves, then they would all
>>> have identical filesystem copies and only a minimal amount of data would
>>> need to be pushed out to each machine on each change. (This would be 
>>> rather
>>> like NetApps and their snap-mirroring system). However I'm not aware 
>>> of any
>>> journaling filesystem for FreeBSD, let alone whether it would support
>>> filesystem replication in this way.
>>
>>
>>
>> There is a project underway for UFSJ (UFS journaling).   Maybe once it 
>> is complete, and bugs are ironed out, one could implement a journal 
>> distribution piece to send the journal updates to multiple hosts and 
>> achieve what you are thinking, however, that only distributes the 
>> meta-data, and not the actual data.
>>
>>
> Have a look at dragonfly BSD for this. They are working on a journaling 
> filesystem that will do just that.

Do you have a link to some information on this?  I've been looking at 
Dragonfly, but I'm having trouble finding good information on what is 
already working, in planning, etc.

Eric


-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------