From owner-freebsd-cluster@FreeBSD.ORG Mon Sep 26 12:46:16 2005 Return-Path: X-Original-To: freebsd-cluster@freebsd.org Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9724C16A41F; Mon, 26 Sep 2005 12:46:16 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from mh2.centtech.com (moat3.centtech.com [207.200.51.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2252743D48; Mon, 26 Sep 2005 12:46:15 +0000 (GMT) (envelope-from anderson@centtech.com) Received: from [10.177.171.220] (neutrino.centtech.com [10.177.171.220]) by mh2.centtech.com (8.13.1/8.13.1) with ESMTP id j8QCkENe047545; Mon, 26 Sep 2005 07:46:14 -0500 (CDT) (envelope-from anderson@centtech.com) Message-ID: <4337ED91.8080200@centtech.com> Date: Mon, 26 Sep 2005 07:46:09 -0500 From: Eric Anderson User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.11) Gecko/20050914 X-Accept-Language: en-us, en MIME-Version: 1.0 To: filip wuytack References: <20050924141025.GA1236@uk.tiscali.com> <4337DF56.6030407@centtech.com> <4337E8A7.6070107@wuytack.net> In-Reply-To: <4337E8A7.6070107@wuytack.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV 0.82/1102/Sun Sep 25 09:04:56 2005 on mh2.centtech.com X-Virus-Status: Clean Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-cluster@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Clustering FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 12:46:16 -0000 filip wuytack wrote: > > > Eric Anderson wrote: > >> Brian Candler wrote: >> >>> Hello, >>> >>> I was wondering if anyone would care to share their experiences in >>> synchronising filesystems across a number of nodes in a cluster. I >>> can think >>> of a number of options, but before changing what I'm doing at the >>> moment I'd >>> like to see if anyone has good experiences with any of the others. >>> >>> The application: a clustered webserver. The users' CGIs run in a chroot >>> environment, and these clearly need to be identical (otherwise a CGI >>> running >>> on one box would behave differently when running on a different box). >>> Ultimately I'd like to synchronise the host OS on each server too. >>> >>> Note that this is a single-master, multiple-slave type of filesystem >>> synchronisation I'm interested in. >>> >>> >>> 1. Keep a master image on an admin box, and rsync it out to the >>> frontends >>> ------------------------------------------------------------------------- >>> >>> >>> This is what I'm doing at the moment. Install a master image in >>> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and >>> rsync it. [Actually I'm exporting it using NFS, and the frontends run >>> rsync >>> locally when required to update their local copies against the NFS >>> master] >>> >>> Disadvantages: >>> >>> - rsyncing a couple of gigs of data is not particularly fast, even >>> when only >>> a few files have changed >>> >>> - if a sysadmin (wrongly) changes a file on a front-end instead of on >>> the >>> master copy in the admin box, then the change will be lost when the next >>> rsync occurs. They might think they've fixed a problem, and then >>> (say) 24 >>> hours later their change is wiped. However if this is a config file, the >>> fact that the old file has been reinstated might not be noticed until >>> the >>> daemon is restarted or the box rebooted - maybe months later. This I >>> think >>> is the biggest fundamental problem. >>> >>> - files can be added locally and they will remain indefinitely >>> (unless we >>> use rsync --delete which is a bit scary). If this is done then adding >>> a new >>> machine into the cluster by rsyncing from the master will not pick up >>> these >>> extra files. >>> >>> So, here are the alternatives I'm considering, and I'd welcome any >>> additional suggestions too. >> >> >> >> Here's a few ideas on this: do multiple rsyncs, one for each top level >> directory. That might speed up your total rsync process. Another >> similar method is using a content revisioning system. This is only >> good for some cases, but something like subversion might work ok here. >> >> >> >>> 2. Run the images directly off NFS >>> ---------------------------------- >>> >>> I've had this running before, even the entire O/S, and it works just >>> fine. >>> However the NFS server itself then becomes a critical >>> single-point-of-failure: if it has to be rebooted and is out of >>> service for >>> 2 minutes, then the whole cluster is out of service for that time. >>> >>> I think this is only feasible if I can build a highly-available NFS >>> server, >>> which really means a pair of boxes serving the same data. Since the >>> system >>> image is read-only from the point of view of the frontends, this >>> should be >>> easy enough: >>> >>> frontends frontends >>> | | | | | | >>> NFS -----------> NFS >>> server 1 sync server 2 >>> >>> As far as I know, NFS clients don't support the idea of failing over >>> from >>> one server to another, so I'd have to make a server pair which >>> transparently >>> fails over. >>> >>> I could make one NFS server take over the other server's IP address >>> using >>> carp or vrrp. However, I suspect that the clients might notice. I >>> know that >>> NFS is 'stateless' in the sense that a server can be rebooted, but for a >>> client to be redirected from one server to the other, I expect that >>> these >>> filesytems would have to be *identical*, down to the level of the inode >>> numbers being the same. >>> >>> If that's true, then rsync between the two NFS servers won't cut it. >>> I was >>> thinking of perhaps using geom_mirror plus ggated/ggatec to make a >>> block-identical read-only mirror image on NFS server 2 - this also >>> has the >>> advantage that any updates are close to instantaneous. >>> >>> What worries me here is how NFS server 2, which has the mirrored >>> filesystem >>> mounted read-only, will take to having the data changed under its >>> nose. Does >>> it for example keep caches of inodes in memory, and what would happen if >>> those inodes on disk were to change? I guess I can always just >>> unmount and >>> remount the filesystem on NFS server 2 after each change. >> >> >> >> I've tried doing something similar. I used fiber attached storage, >> and had multiple hosts mounting the same partition. It seemed as >> though when host A mounted the filesystem read-write, and then host B >> mounted it read-only, any changes made by host A were not seen by B, >> and even remounting did not always bring it up to current state. I >> believe it has to do with the buffer cache and host A's desire to keep >> things (like inode changes, block maps, etc) in cache and not write >> them to disk. FreeBSD does not currently have a multi-system cache >> coherency protocol to distribute that information to other hosts. >> This is something I think would be very useful for many people. I >> suppose you could just mount the filesystem when you know a change has >> happened, but you still may not see the change. Maybe mounting the >> filesystem on host A with the sync option would help. >> >>> My other concern is about susceptibility to DoS-type attacks: if one >>> frontend were to go haywire and start hammering the NFS servers >>> really hard, >>> it could impact on all the other machines in the cluster. >>> >>> However, the problems of data synchronisation are solved: any change >>> made on >>> the NFS server is visible identically to all front-ends, and >>> sysadmins can't >>> make changes on the front-ends because the NFS export is read-only. >> >> >> >> This was my first thought too, and a highly available NFS server is >> something any NFS heavy installation wants (needs). There are a few >> implementations of clustered filesystems out there, but non for >> FreeBSD (yet). What that allows is multiple machines talking to a >> shared storage with read/write access. Very handy, but since you only >> need read-only access, I think your problem is much simpler, and you >> can get away with a lot less. >> >> >>> 3. Use a network distributed filesystem - CODA? AFS? >>> ---------------------------------------------------- >>> >>> If each frontend were to access the filesystem as a read-only network >>> mount, >>> but have a local copy to work with in the case of disconnected >>> operation, >>> then the SPOF of an NFS server would be eliminated. >>> >>> However, I have no experience with CODA, and although it's been in >>> the tree >>> since 2002, the README's don't inspire confidence: >>> >>> "It is mostly working, but hasn't been run long enough to be sure >>> all the >>> bugs are sorted out. ... This code is not SMP ready" >>> >>> Also, a local cache is no good if the data you want during disconnected >>> operation is not in the cache at that time, which I think means this >>> idea is >>> not actually a very good one. >> >> >> >> There is also a port for coda. I've been reading about this, and >> it's an interesting filesystem, but I'm just not sure of it's >> usefulness yet. >> >> >>> 4. Mount filesystems read-only >>> ------------------------------ >>> >>> On each front-end I could store /webroot/cgi on a filesystem mounted >>> read-only to prevent tampering (as long as the sysadmin doesn't >>> remount it >>> read-write of course). That would work reasonably well, except that >>> being >>> mounted read-only I couldn't use rsync to update it! >>> >>> It might also work with geom_mirror and ggated/ggatec, except for the >>> issue >>> I raised before about changing blocks on a filesystem under the nose >>> of a >>> client who is actively reading from it. >> >> >> >> I suppose you could mount r/w only when doing the rsync, then switch >> back to ro once complete. You should be able to do this online, >> without any issues or taking the filesystem offline. >> >> >>> 5. Using a filesystem which really is read-only >>> ----------------------------------------------- >>> >>> Better tamper-protection could be had by keeping data in a filesystem >>> structure which doesn't support any updates at all - such as cd9660 or >>> geom_uzip. >>> >>> The issue here is how to roll out a new version of the data. I could >>> push >>> out a new filesystem image into a second partition, but it would then be >>> necessary to unmount the old filesystem and remount the new on the same >>> place, and you can't really unmount a filesystem which is in use. So >>> this >>> would require a reboot. >>> >>> I was thinking that some symlink trickery might help: >>> >>> /webroot/cgi -> /webroot/cgi1 >>> /webroot/cgi1 # filesystem A mounted here >>> /webroot/cgi2 # filesystem B mounted here >>> >>> It should be possible to unmount /webroot/cgi2, dd in a new image, >>> remount >>> it, and change the symlink to point to /webroot/cgi2. After a little >>> while, >>> hopefully all the applications will stop using files in >>> /webroot/cgi1, so >>> this one can be unmounted and a new one put in its place on the next >>> update. >>> However this is not guaranteed, especially if there are long-lived >>> processes >>> using binary images in this partition. You'd still have to stop and >>> restart >>> all those processes. >>> >>> If reboots were acceptable, then the filesystem image could also be >>> stored >>> in ramdisk pulled in via pxeboot. This makes sense especially for >>> geom_uzip >>> where the data is pre-compressed. However I would still prefer to avoid >>> frequent reboots if at all possible. Also, whilst a ramdisk might be >>> OK for >>> the root filesystem, a typical CGI environment (with perl, php, ruby, >>> python, and loads of libraries) would probably be too large anyway. >>> >>> >>> 6. Journaling filesystem replication >>> ------------------------------------ >>> >>> If the data were stored on a journaling filesystem on the master box, >>> and >>> the journal logs were distributed out to the slaves, then they would all >>> have identical filesystem copies and only a minimal amount of data would >>> need to be pushed out to each machine on each change. (This would be >>> rather >>> like NetApps and their snap-mirroring system). However I'm not aware >>> of any >>> journaling filesystem for FreeBSD, let alone whether it would support >>> filesystem replication in this way. >> >> >> >> There is a project underway for UFSJ (UFS journaling). Maybe once it >> is complete, and bugs are ironed out, one could implement a journal >> distribution piece to send the journal updates to multiple hosts and >> achieve what you are thinking, however, that only distributes the >> meta-data, and not the actual data. >> >> > Have a look at dragonfly BSD for this. They are working on a journaling > filesystem that will do just that. Do you have a link to some information on this? I've been looking at Dragonfly, but I'm having trouble finding good information on what is already working, in planning, etc. Eric -- ------------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Anything that works is better than anything that doesn't. ------------------------------------------------------------------------