From owner-freebsd-isp Fri Feb 20 05:51:29 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id FAA21645 for freebsd-isp-outgoing; Fri, 20 Feb 1998 05:51:29 -0800 (PST) (envelope-from owner-freebsd-isp@FreeBSD.ORG) Received: from sabre.goldsword.com (sabre.goldsword.com [199.170.202.32]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id FAA21634 for ; Fri, 20 Feb 1998 05:51:15 -0800 (PST) (envelope-from jfarmer@sabre.goldsword.com) Received: (from root@localhost) by sabre.goldsword.com (8.8.8/8.8.8) id IAA21923; Fri, 20 Feb 1998 08:27:21 -0500 (EST) Date: Fri, 20 Feb 1998 08:27:21 -0500 (EST) From: "John T. Farmer" Message-Id: <199802201327.IAA21923@sabre.goldsword.com> To: agdolla@datanet.hu, freebsd-isp@FreeBSD.ORG Subject: Re: fault tolerant :)) setup Cc: jfarmer@goldsword.com Sender: owner-freebsd-isp@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Thu, 19 Feb 1998 19:02:53 +0100 (NFT) Gabor Dolla said: >I'd like to hear opinions on fault-tolerant setups.... > >Say, you have two identical machines, one is a mail server the other is >the www server, and when one of them is down the other does both jobs. > >A few years back I worked for a company which had some Digital Alpha >servers. Digital had a nice disk tower with an Y cable so both servers >were able to access the same disks. Are there such products available for >PCs ? I used to design real-time factory data-acq. & control systems, so I'll give this a go. There are at least 5 issues to deal with: 1. Keeping data in sync between two machines where only one machine is active at a time. For simplicity, think of it as one machine running & the 2nd machine in "hot-standby." 2. Handling incoming network traffic that is associated with a specific IP address. 3. Dealing with the transaction that is disrupted when the primary machine fails. 4. Monitoring the primary machine for failure, determining that it has failed, and locking it out to allow the secondary machine to complete or restart the transaction. 5. Restoring the primary machine to service (a subset of startup conditions). How these issues are dealt with are _highly_ dependant on the application(s) involved & the "transparency" required for fault handling. There are also commerical systems available that use tools like this (and others) to provide fault-recovery or prevention (two big players in this are Tandem Computer, now part of Compaq; and Stratus Computers). On the smaller side, several things can be done to improve the typical server setup. Examples are: - Raid Disk Subsystems, possibly with dual control ports. - "round-robin" DNS entries to spread the load over several machines. In addition, dynamic DNS updates with _very_ short expire times, could be used to "disable" access to a faulting server (at the cost of defeating DNS "cacheing"). - With an SNMP controllable hub/switch, the port used by a faulting machine can be locked out of service. - Of course, any single point failure modes should be avoided or removed if possible. For example, everybody thinks about redunant "hot-swap" power supplies in servers. What about the power circuit to it? Are both supplies fed from the same UPS? The same breaker? There are some reasonable things that are easy to overlook but anybody serious about reliable service can implement without a lot of expense. However, the costs do escalate rapidly with the improvement in availablity diminishing just a rapidly. You might move availablity from from 90% to 95% by spending $XXXus. To go from 95% to 96% might require spending 10 * $XXXus. John ------------------------------------------------------------------------- John T. Farmer Proprietor, GoldSword Systems jfarmer@goldsword.com Public Internet Access in East Tennessee Office: (423)691-6498 for info, e-mail to info@goldsword.com Network Design, Internet Services & Servers, Consulting To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-isp" in the body of the message