From owner-freebsd-isp@FreeBSD.ORG Mon Sep 26 12:25:37 2005 Return-Path: X-Original-To: freebsd-isp@freebsd.org Delivered-To: freebsd-isp@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0306D16A41F for ; Mon, 26 Sep 2005 12:25:37 +0000 (GMT) (envelope-from filip@wuytack.net) Received: from london.wuytack.net (host-84-9-106-97.bulldogdsl.com [84.9.106.97]) by mx1.FreeBSD.org (Postfix) with ESMTP id 00ABC43D48 for ; Mon, 26 Sep 2005 12:25:35 +0000 (GMT) (envelope-from filip@wuytack.net) Received: (qmail 99551 invoked by uid 1003); 26 Sep 2005 12:42:58 -0000 Received: from filip@wuytack.net by london.wuytack.net by uid 89 with qmail-scanner-1.22 (clamscan: 0.71. spamassassin: 2.63. Clear:RC:1(82.110.72.114):. Processed in 5.41758 secs); 26 Sep 2005 12:42:58 -0000 Received: from unknown (HELO ?127.0.0.1?) (filip@wuytack.net@82.110.72.114) by 10.11.12.4 with SMTP; 26 Sep 2005 12:42:52 -0000 Message-ID: <4337E8A7.6070107@wuytack.net> Date: Mon, 26 Sep 2005 13:25:11 +0100 From: filip wuytack User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Eric Anderson References: <20050924141025.GA1236@uk.tiscali.com> <4337DF56.6030407@centtech.com> In-Reply-To: <4337DF56.6030407@centtech.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-isp@freebsd.org, freebsd-cluster@freebsd.org, Brian Candler Subject: Re: Options for synchronising filesystems X-BeenThere: freebsd-isp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Internet Services Providers List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Sep 2005 12:25:37 -0000 Eric Anderson wrote: > Brian Candler wrote: > >> Hello, >> >> I was wondering if anyone would care to share their experiences in >> synchronising filesystems across a number of nodes in a cluster. I can >> think >> of a number of options, but before changing what I'm doing at the >> moment I'd >> like to see if anyone has good experiences with any of the others. >> >> The application: a clustered webserver. The users' CGIs run in a chroot >> environment, and these clearly need to be identical (otherwise a CGI >> running >> on one box would behave differently when running on a different box). >> Ultimately I'd like to synchronise the host OS on each server too. >> >> Note that this is a single-master, multiple-slave type of filesystem >> synchronisation I'm interested in. >> >> >> 1. Keep a master image on an admin box, and rsync it out to the frontends >> ------------------------------------------------------------------------- >> >> This is what I'm doing at the moment. Install a master image in >> /webroot/cgi, add packages there (chroot /webroot/cgi pkg_add ...), and >> rsync it. [Actually I'm exporting it using NFS, and the frontends run >> rsync >> locally when required to update their local copies against the NFS >> master] >> >> Disadvantages: >> >> - rsyncing a couple of gigs of data is not particularly fast, even >> when only >> a few files have changed >> >> - if a sysadmin (wrongly) changes a file on a front-end instead of on the >> master copy in the admin box, then the change will be lost when the next >> rsync occurs. They might think they've fixed a problem, and then (say) 24 >> hours later their change is wiped. However if this is a config file, the >> fact that the old file has been reinstated might not be noticed until the >> daemon is restarted or the box rebooted - maybe months later. This I >> think >> is the biggest fundamental problem. >> >> - files can be added locally and they will remain indefinitely (unless we >> use rsync --delete which is a bit scary). If this is done then adding >> a new >> machine into the cluster by rsyncing from the master will not pick up >> these >> extra files. >> >> So, here are the alternatives I'm considering, and I'd welcome any >> additional suggestions too. > > > Here's a few ideas on this: do multiple rsyncs, one for each top level > directory. That might speed up your total rsync process. Another > similar method is using a content revisioning system. This is only good > for some cases, but something like subversion might work ok here. > > > >> 2. Run the images directly off NFS >> ---------------------------------- >> >> I've had this running before, even the entire O/S, and it works just >> fine. >> However the NFS server itself then becomes a critical >> single-point-of-failure: if it has to be rebooted and is out of >> service for >> 2 minutes, then the whole cluster is out of service for that time. >> >> I think this is only feasible if I can build a highly-available NFS >> server, >> which really means a pair of boxes serving the same data. Since the >> system >> image is read-only from the point of view of the frontends, this >> should be >> easy enough: >> >> frontends frontends >> | | | | | | >> NFS -----------> NFS >> server 1 sync server 2 >> >> As far as I know, NFS clients don't support the idea of failing over from >> one server to another, so I'd have to make a server pair which >> transparently >> fails over. >> >> I could make one NFS server take over the other server's IP address using >> carp or vrrp. However, I suspect that the clients might notice. I know >> that >> NFS is 'stateless' in the sense that a server can be rebooted, but for a >> client to be redirected from one server to the other, I expect that these >> filesytems would have to be *identical*, down to the level of the inode >> numbers being the same. >> >> If that's true, then rsync between the two NFS servers won't cut it. I >> was >> thinking of perhaps using geom_mirror plus ggated/ggatec to make a >> block-identical read-only mirror image on NFS server 2 - this also has >> the >> advantage that any updates are close to instantaneous. >> >> What worries me here is how NFS server 2, which has the mirrored >> filesystem >> mounted read-only, will take to having the data changed under its >> nose. Does >> it for example keep caches of inodes in memory, and what would happen if >> those inodes on disk were to change? I guess I can always just unmount >> and >> remount the filesystem on NFS server 2 after each change. > > > I've tried doing something similar. I used fiber attached storage, and > had multiple hosts mounting the same partition. It seemed as though > when host A mounted the filesystem read-write, and then host B mounted > it read-only, any changes made by host A were not seen by B, and even > remounting did not always bring it up to current state. I believe it > has to do with the buffer cache and host A's desire to keep things (like > inode changes, block maps, etc) in cache and not write them to disk. > FreeBSD does not currently have a multi-system cache coherency protocol > to distribute that information to other hosts. This is something I > think would be very useful for many people. I suppose you could just > mount the filesystem when you know a change has happened, but you still > may not see the change. Maybe mounting the filesystem on host A with > the sync option would help. > >> My other concern is about susceptibility to DoS-type attacks: if one >> frontend were to go haywire and start hammering the NFS servers really >> hard, >> it could impact on all the other machines in the cluster. >> >> However, the problems of data synchronisation are solved: any change >> made on >> the NFS server is visible identically to all front-ends, and sysadmins >> can't >> make changes on the front-ends because the NFS export is read-only. > > > This was my first thought too, and a highly available NFS server is > something any NFS heavy installation wants (needs). There are a few > implementations of clustered filesystems out there, but non for FreeBSD > (yet). What that allows is multiple machines talking to a shared > storage with read/write access. Very handy, but since you only need > read-only access, I think your problem is much simpler, and you can get > away with a lot less. > > >> 3. Use a network distributed filesystem - CODA? AFS? >> ---------------------------------------------------- >> >> If each frontend were to access the filesystem as a read-only network >> mount, >> but have a local copy to work with in the case of disconnected operation, >> then the SPOF of an NFS server would be eliminated. >> >> However, I have no experience with CODA, and although it's been in the >> tree >> since 2002, the README's don't inspire confidence: >> >> "It is mostly working, but hasn't been run long enough to be sure >> all the >> bugs are sorted out. ... This code is not SMP ready" >> >> Also, a local cache is no good if the data you want during disconnected >> operation is not in the cache at that time, which I think means this >> idea is >> not actually a very good one. > > > There is also a port for coda. I've been reading about this, and it's > an interesting filesystem, but I'm just not sure of it's usefulness yet. > > >> 4. Mount filesystems read-only >> ------------------------------ >> >> On each front-end I could store /webroot/cgi on a filesystem mounted >> read-only to prevent tampering (as long as the sysadmin doesn't >> remount it >> read-write of course). That would work reasonably well, except that being >> mounted read-only I couldn't use rsync to update it! >> >> It might also work with geom_mirror and ggated/ggatec, except for the >> issue >> I raised before about changing blocks on a filesystem under the nose of a >> client who is actively reading from it. > > > I suppose you could mount r/w only when doing the rsync, then switch > back to ro once complete. You should be able to do this online, without > any issues or taking the filesystem offline. > > >> 5. Using a filesystem which really is read-only >> ----------------------------------------------- >> >> Better tamper-protection could be had by keeping data in a filesystem >> structure which doesn't support any updates at all - such as cd9660 or >> geom_uzip. >> >> The issue here is how to roll out a new version of the data. I could push >> out a new filesystem image into a second partition, but it would then be >> necessary to unmount the old filesystem and remount the new on the same >> place, and you can't really unmount a filesystem which is in use. So this >> would require a reboot. >> >> I was thinking that some symlink trickery might help: >> >> /webroot/cgi -> /webroot/cgi1 >> /webroot/cgi1 # filesystem A mounted here >> /webroot/cgi2 # filesystem B mounted here >> >> It should be possible to unmount /webroot/cgi2, dd in a new image, >> remount >> it, and change the symlink to point to /webroot/cgi2. After a little >> while, >> hopefully all the applications will stop using files in /webroot/cgi1, so >> this one can be unmounted and a new one put in its place on the next >> update. >> However this is not guaranteed, especially if there are long-lived >> processes >> using binary images in this partition. You'd still have to stop and >> restart >> all those processes. >> >> If reboots were acceptable, then the filesystem image could also be >> stored >> in ramdisk pulled in via pxeboot. This makes sense especially for >> geom_uzip >> where the data is pre-compressed. However I would still prefer to avoid >> frequent reboots if at all possible. Also, whilst a ramdisk might be >> OK for >> the root filesystem, a typical CGI environment (with perl, php, ruby, >> python, and loads of libraries) would probably be too large anyway. >> >> >> 6. Journaling filesystem replication >> ------------------------------------ >> >> If the data were stored on a journaling filesystem on the master box, and >> the journal logs were distributed out to the slaves, then they would all >> have identical filesystem copies and only a minimal amount of data would >> need to be pushed out to each machine on each change. (This would be >> rather >> like NetApps and their snap-mirroring system). However I'm not aware >> of any >> journaling filesystem for FreeBSD, let alone whether it would support >> filesystem replication in this way. > > > There is a project underway for UFSJ (UFS journaling). Maybe once it > is complete, and bugs are ironed out, one could implement a journal > distribution piece to send the journal updates to multiple hosts and > achieve what you are thinking, however, that only distributes the > meta-data, and not the actual data. > > Have a look at dragonfly BSD for this. They are working on a journaling filesystem that will do just that. ~ Fil > Good luck finding your ultimate solution! > > Eric > >