From owner-freebsd-questions Tue May 25 8:40:18 1999 Delivered-To: freebsd-questions@freebsd.org Received: from hecate.webcom.com (hecate.webcom.com [209.1.28.39]) by hub.freebsd.org (Postfix) with ESMTP id CD1B81577B for ; Tue, 25 May 1999 08:40:14 -0700 (PDT) (envelope-from graeme@echidna.com) Received: from kigal.webcom.com (kigal.webcom.com [209.1.28.57]) by hecate.webcom.com (8.9.1/8.9.1) with SMTP id JAA31815; Tue, 25 May 1999 09:40:11 -0700 Received: from [204.143.69.29] by inanna.webcom.com (WebCom SMTP 1.2.1) with SMTP id 33775444; Tue May 25 08:36 PDT 1999 Message-Id: <374AC435.15C6C178@echidna.com> Date: Tue, 25 May 1999 11:39:33 -0400 From: Graeme Tait Organization: Echidna X-Mailer: Mozilla 4.5 [en] (Win98; U) X-Accept-Language: en Mime-Version: 1.0 To: Juergen Nickelsen Cc: Alex Heiphetz , freebsd-questions@freebsd.org Subject: Re: 100% dependability/failsafe/security/hardware References: <388916.3136624169@ockholm.jn.berlin.snafu.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Juergen Nickelsen wrote: > > --On Mon, 24. Mai 1999 18:52 -0400 Alex Heiphetz > wrote: > > > 3. How to provide 100% failsafe system? > > *All* hardware redundant: CPUs, RAM, secondary storage, data paths, > power supplies, fans, UPSs, etc.; proactive hardware monitoring I would say you need to start this project with an acceptable downtime target. There is no such thing as perfection, and you can't engineer an appropriate solution without quantifying your requirements. Beyond a certain point, the reliability target becomes the primary driver of design and cost. The first question I would ask it what reliability you expect from your Internet connectivity. I'm assuming from your original post that these servers are accessed via the Internet. I have modest experience of colocating our server with a quality provider. In the last six months, we've lost connectivity to the server for a total of about 3 hours, part a network problem involving the colo, part failure of a UPS feeding our server in the colo. There have been other instances of degraded connectivity. Our plain-vanilla-Pentium/FreeBSD box has *never* missed a beat. Any other downtime has been elective on our part, for upgrades, etc. I have longer-term experience of using a provider hosted by Exodus, the major Internet colo provider. They have experienced the odd few hours downtime (complete outages) a year due to Exodus's problems - again they have both lost network connectivity, and lost AC power, plus there have been many (usually brief) instances of degraded connectivity. So if you are dependent of such a single-point failure, there's not much point in engineering your equipment to vastly higher standards. Absent widespread implementation of advanced DNS features (as posted on recently either here or to freebsd-isp) that would allow multiple systems to be located separately but answer to a common host name with automatic failover, I don't know how you can easily circumvent unreliability in your Internet provider. I'm trying to deal with this issue for the server we operate. Our problem is that we may not be able to achieve satisfactory repair time if our hardware at the colo fails, because of access problems and personnel availability. The current plan is to have duplicate hardware, one system active, one hot standby (but possibly offloading background tasks like offine analysis and backup). Rather than buy one verrry expensive system, we'll have two modest (but decent quality), inexpensive systems - probably costing less overall, and probably appreciably more reliable overall. The only common components would be ones that are the responsibility of our colo provider, and they have plenty of resources to fix problems. BTW, a major benefit of total duplication is that upgrades can be performed and tested on the standby machine, a switch effected, and then implemented on what was the live machine. If you want 100% uptime, you need a plan to eliminate elective downtime, and reduce the risk of live changes to the online system. The hard part of all this is (1) detecting failure (in either machine); (2) achieving automatic failover; and (3) what I see as the hardest of all, how to deal with dynamic data (such as a live database) on the running machine, and ensure the backup machine picks up where the other left off, as nearly as possible, without loss of essential data (like orders or email). BTW, one thing that concerns me particularly here is that I've seen at least two cases of "uninterruptable" power fail in colo situations. Because this means all systems (both duplicate servers) suffer an unclean shutdown, there's the potential for both to be taken out at once, especially as in such situations, it's not unusual to see multiple power glitches or line transients. So one might want to furnish each machine with a separate, small UPS offering status feedback, enabling a clean shutdown on prolonged primary power loss. -- Graeme Tait To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message