From owner-freebsd-hackers  Wed Jan  7 21:13:44 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id VAA01607
          for hackers-outgoing; Wed, 7 Jan 1998 21:13:44 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from nomis.simon-shapiro.org (nomis.i-Connect.Net [206.190.143.100])
          by hub.freebsd.org (8.8.7/8.8.7) with SMTP id VAA01600
          for <hackers@FreeBSD.ORG>; Wed, 7 Jan 1998 21:13:35 -0800 (PST)
          (envelope-from shimon@nomis.Simon-Shapiro.ORG)
Received: (qmail 7146 invoked by uid 1000); 7 Feb 2036 06:28:23 -0000
Message-ID: <XFMail.360206222823.shimon@simon-shapiro.org>
X-Mailer: XFMail 1.3-alpha-010198 [p0] on FreeBSD
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.BSF.3.95q.980106113011.22841B-100000@misery.sdf.com>
Date: Wed, 06 Feb 2036 22:28:23 -0800 (PST)
Reply-To: shimon@simon-shapiro.org
Organization: The Simon Shapiro Foundation
From: Simon Shapiro <shimon@simon-shapiro.org>
To: Tom <tom@sdf.com>
Subject: Re: SFT
Cc: hackers@FreeBSD.ORG, Capriotti <capriotti@geocities.com>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk


On 06-Jan-98 Tom wrote:
> 
> On Tue, 6 Jan 1998, Capriotti wrote:
> 
>> Does anyone have any news about System Fault Tolerance under Free ?
>> 
>> Like what Novell has, from mirrowed disks to mirrowed servers ?
> 
>   Mirrored disks can be done with ccd.
> 
>   Mirrored servers are basically what Unix types call a cluster.  On
> Novell this is easy, because Novell boxes are basically just file
> servers,
> but a Unix box could be doing many different things.  Migrating tasks
> from the failed system to the working system, and assumption
> of the IP traffic, is difficult.

I am working on such a system.  Full time.  Our solution os a bit different
from Novell's.  We are aiming at true non-stop, full utilization model (no
standby).

Simople mirroring is an expensive and non-scalable solution to data
persistance, but has very little to do with high availability.

To throw a monkey wrench at this discussion, please consider that the most
common (by far) cause of system service failure (crashes are severe failure
to service, right?) is software bugs.  Far behind it lag cables and
connectors, yet again far behind these are disks and memory.  It is my
hunch that DRAM fails as much as hard disks do (per storage unit per time)
or even more.

This flies in the face of software solutions to high availability.

[ Obviously this is exactly one half of the story.  You come up with the
other half :-) ]


----------


Sincerely Yours, 

Simon Shapiro
Shimon@Simon-Shapiro.ORG                      Voice:   503.799.2313