From owner-freebsd-isp@FreeBSD.ORG  Mon Sep 26 12:25:37 2005
Return-Path: <owner-freebsd-isp@FreeBSD.ORG>
X-Original-To: freebsd-isp@freebsd.org
Delivered-To: freebsd-isp@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0306D16A41F
	for <freebsd-isp@freebsd.org>; Mon, 26 Sep 2005 12:25:37 +0000 (GMT)
	(envelope-from filip@wuytack.net)
Received: from london.wuytack.net (host-84-9-106-97.bulldogdsl.com
	[84.9.106.97]) by mx1.FreeBSD.org (Postfix) with ESMTP id 00ABC43D48
	for <freebsd-isp@freebsd.org>; Mon, 26 Sep 2005 12:25:35 +0000 (GMT)
	(envelope-from filip@wuytack.net)
Received: (qmail 99551 invoked by uid 1003); 26 Sep 2005 12:42:58 -0000
Received: from filip@wuytack.net by london.wuytack.net by uid 89 with
	qmail-scanner-1.22 
	(clamscan: 0.71. spamassassin: 2.63.  Clear:RC:1(82.110.72.114):. 
	Processed in 5.41758 secs); 26 Sep 2005 12:42:58 -0000
Received: from unknown (HELO ?127.0.0.1?) (filip@wuytack.net@82.110.72.114)
	by 10.11.12.4 with SMTP; 26 Sep 2005 12:42:52 -0000
Message-ID: <4337E8A7.6070107@wuytack.net>
Date: Mon, 26 Sep 2005 13:25:11 +0100
From: filip wuytack <filip@wuytack.net>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Eric Anderson <anderson@centtech.com>
References: <20050924141025.GA1236@uk.tiscali.com>
	<4337DF56.6030407@centtech.com>
In-Reply-To: <4337DF56.6030407@centtech.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org,
	Brian Candler <B.Candler@pobox.com>
Subject: Re: Options for synchronising filesystems
X-BeenThere: freebsd-isp@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Internet Services Providers <freebsd-isp.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-isp>,
	<mailto:freebsd-isp-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-isp>
List-Post: <mailto:freebsd-isp@freebsd.org>
List-Help: <mailto:freebsd-isp-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-isp>,
	<mailto:freebsd-isp-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Sep 2005 12:25:37 -0000


Eric Anderson wrote:
> Brian Candler wrote:
> 
>> Hello,
>>
>> I was wondering if anyone would care to share their experiences in
>> synchronising filesystems across a number of nodes in a cluster. I can 
>> think
>> of a number of options, but before changing what I'm doing at the 
>> moment I'd
>> like to see if anyone has good experiences with any of the others.
>>
>> The application: a clustered webserver. The users' CGIs run in a chroot
>> environment, and these clearly need to be identical (otherwise a CGI 
>> running
>> on one box would behave differently when running on a different box).
>> Ultimately I'd like to synchronise the host OS on each server too.
>>
>> Note that this is a single-master, multiple-slave type of filesystem
>> synchronisation I'm interested in.
>>
>>
>> 1. Keep a master image on an admin box, and rsync it out to the frontends
>> -------------------------------------------------------------------------
>>
>> This is what I'm doing at the moment. Install a master image in
>> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and
>> rsync it. [Actually I'm exporting it using NFS, and the frontends run 
>> rsync
>> locally when required to update their local copies against the NFS 
>> master]
>>
>> Disadvantages:
>>
>> - rsyncing a couple of gigs of data is not particularly fast, even 
>> when only
>> a few files have changed
>>
>> - if a sysadmin (wrongly) changes a file on a front-end instead of on the
>> master copy in the admin box, then the change will be lost when the next
>> rsync occurs. They might think they've fixed a problem, and then (say) 24
>> hours later their change is wiped. However if this is a config file, the
>> fact that the old file has been reinstated might not be noticed until the
>> daemon is restarted or the box rebooted - maybe months later. This I 
>> think
>> is the biggest fundamental problem.
>>
>> - files can be added locally and they will remain indefinitely (unless we
>> use rsync --delete which is a bit scary). If this is done then adding 
>> a new
>> machine into the cluster by rsyncing from the master will not pick up 
>> these
>> extra files.
>>
>> So, here are the alternatives I'm considering, and I'd welcome any
>> additional suggestions too.
> 
> 
> Here's a few ideas on this: do multiple rsyncs, one for each top level 
> directory.  That might speed up your total rsync process.  Another 
> similar method is using a content revisioning system.  This is only good 
> for some cases, but something like subversion might work ok here.
> 
> 
> 
>> 2. Run the images directly off NFS
>> ----------------------------------
>>
>> I've had this running before, even the entire O/S, and it works just 
>> fine.
>> However the NFS server itself then becomes a critical
>> single-point-of-failure: if it has to be rebooted and is out of 
>> service for
>> 2 minutes, then the whole cluster is out of service for that time.
>>
>> I think this is only feasible if I can build a highly-available NFS 
>> server,
>> which really means a pair of boxes serving the same data. Since the 
>> system
>> image is read-only from the point of view of the frontends, this 
>> should be
>> easy enough:
>>
>>       frontends            frontends
>>         | | |                | | |
>>          NFS   ----------->   NFS
>>        server 1    sync     server 2
>>
>> As far as I know, NFS clients don't support the idea of failing over from
>> one server to another, so I'd have to make a server pair which 
>> transparently
>> fails over.
>>
>> I could make one NFS server take over the other server's IP address using
>> carp or vrrp. However, I suspect that the clients might notice. I know 
>> that
>> NFS is 'stateless' in the sense that a server can be rebooted, but for a
>> client to be redirected from one server to the other, I expect that these
>> filesytems would have to be *identical*, down to the level of the inode
>> numbers being the same.
>>
>> If that's true, then rsync between the two NFS servers won't cut it. I 
>> was
>> thinking of perhaps using geom_mirror plus ggated/ggatec to make a
>> block-identical read-only mirror image on NFS server 2 - this also has 
>> the
>> advantage that any updates are close to instantaneous.
>>
>> What worries me here is how NFS server 2, which has the mirrored 
>> filesystem
>> mounted read-only, will take to having the data changed under its 
>> nose. Does
>> it for example keep caches of inodes in memory, and what would happen if
>> those inodes on disk were to change? I guess I can always just unmount 
>> and
>> remount the filesystem on NFS server 2 after each change.
> 
> 
> I've tried doing something similar.  I used fiber attached storage, and 
> had multiple hosts mounting the same partition.  It seemed as though 
> when host A mounted the filesystem read-write, and then host B mounted 
> it read-only, any changes made by host A were not seen by B, and even 
> remounting did not always bring it up to current state.  I believe it 
> has to do with the buffer cache and host A's desire to keep things (like 
> inode changes, block maps, etc) in cache and not write them to disk. 
> FreeBSD does not currently have a multi-system cache coherency protocol 
> to distribute that information to other hosts.  This is something I 
> think would be very useful for many people.  I suppose you could just 
> mount the filesystem when you know a change has happened, but you still 
> may not see the change.  Maybe mounting the filesystem on host A with 
> the sync option would help.
> 
>> My other concern is about susceptibility to DoS-type attacks: if one
>> frontend were to go haywire and start hammering the NFS servers really 
>> hard,
>> it could impact on all the other machines in the cluster.
>>
>> However, the problems of data synchronisation are solved: any change 
>> made on
>> the NFS server is visible identically to all front-ends, and sysadmins 
>> can't
>> make changes on the front-ends because the NFS export is read-only.
> 
> 
> This was my first thought too, and a highly available NFS server is 
> something any NFS heavy installation wants (needs).  There are a few 
> implementations of clustered filesystems out there, but non for FreeBSD 
> (yet).   What that allows is multiple machines talking to a shared 
> storage with read/write access.  Very handy, but since you only need 
> read-only access, I think your problem is much simpler, and you can get 
> away with a lot less.
> 
> 
>> 3. Use a network distributed filesystem - CODA? AFS?
>> ----------------------------------------------------
>>
>> If each frontend were to access the filesystem as a read-only network 
>> mount,
>> but have a local copy to work with in the case of disconnected operation,
>> then the SPOF of an NFS server would be eliminated.
>>
>> However, I have no experience with CODA, and although it's been in the 
>> tree
>> since 2002, the README's don't inspire confidence:
>>
>>    "It is mostly working, but hasn't been run long enough to be sure 
>> all the
>>    bugs are sorted out. ... This code is not SMP ready"
>>
>> Also, a local cache is no good if the data you want during disconnected
>> operation is not in the cache at that time, which I think means this 
>> idea is
>> not actually a very good one.
> 
> 
> There is also a port for coda.  I've been reading about this,  and it's 
> an interesting filesystem, but I'm just not sure of it's usefulness yet.
> 
> 
>> 4. Mount filesystems read-only
>> ------------------------------
>>
>> On each front-end I could store /webroot/cgi on a filesystem mounted
>> read-only to prevent tampering (as long as the sysadmin doesn't 
>> remount it
>> read-write of course). That would work reasonably well, except that being
>> mounted read-only I couldn't use rsync to update it!
>>
>> It might also work with geom_mirror and ggated/ggatec, except for the 
>> issue
>> I raised before about changing blocks on a filesystem under the nose of a
>> client who is actively reading from it.
> 
> 
> I suppose you could mount r/w only when doing the rsync, then switch 
> back to ro once complete.  You should be able to do this online, without 
> any issues or taking the filesystem offline.
> 
> 
>> 5. Using a filesystem which really is read-only
>> -----------------------------------------------
>>
>> Better tamper-protection could be had by keeping data in a filesystem
>> structure which doesn't support any updates at all - such as cd9660 or
>> geom_uzip.
>>
>> The issue here is how to roll out a new version of the data. I could push
>> out a new filesystem image into a second partition, but it would then be
>> necessary to unmount the old filesystem and remount the new on the same
>> place, and you can't really unmount a filesystem which is in use. So this
>> would require a reboot.
>>
>> I was thinking that some symlink trickery might help:
>>
>>     /webroot/cgi -> /webroot/cgi1
>>     /webroot/cgi1     # filesystem A mounted here
>>     /webroot/cgi2     # filesystem B mounted here
>>
>> It should be possible to unmount /webroot/cgi2, dd in a new image, 
>> remount
>> it, and change the symlink to point to /webroot/cgi2. After a little 
>> while,
>> hopefully all the applications will stop using files in /webroot/cgi1, so
>> this one can be unmounted and a new one put in its place on the next 
>> update.
>> However this is not guaranteed, especially if there are long-lived 
>> processes
>> using binary images in this partition. You'd still have to stop and 
>> restart
>> all those processes.
>>
>> If reboots were acceptable, then the filesystem image could also be 
>> stored
>> in ramdisk pulled in via pxeboot. This makes sense especially for 
>> geom_uzip
>> where the data is pre-compressed. However I would still prefer to avoid
>> frequent reboots if at all possible. Also, whilst a ramdisk might be 
>> OK for
>> the root filesystem, a typical CGI environment (with perl, php, ruby,
>> python, and loads of libraries) would probably be too large anyway.
>>
>>
>> 6. Journaling filesystem replication
>> ------------------------------------
>>
>> If the data were stored on a journaling filesystem on the master box, and
>> the journal logs were distributed out to the slaves, then they would all
>> have identical filesystem copies and only a minimal amount of data would
>> need to be pushed out to each machine on each change. (This would be 
>> rather
>> like NetApps and their snap-mirroring system). However I'm not aware 
>> of any
>> journaling filesystem for FreeBSD, let alone whether it would support
>> filesystem replication in this way.
> 
> 
> There is a project underway for UFSJ (UFS journaling).   Maybe once it 
> is complete, and bugs are ironed out, one could implement a journal 
> distribution piece to send the journal updates to multiple hosts and 
> achieve what you are thinking, however, that only distributes the 
> meta-data, and not the actual data.
> 
> 
Have a look at dragonfly BSD for this. They are working on a journaling 
filesystem that will do just that.

~ Fil


> Good luck finding your ultimate solution!
> 
> Eric
> 
>