From owner-freebsd-cluster  Wed Mar  6  8:24:59 2002
Delivered-To: freebsd-cluster@freebsd.org
Received: from gate.nentec.de (gate2.nentec.de [194.25.215.66])
	by hub.freebsd.org (Postfix) with ESMTP id 204A537B405
	for <freebsd-cluster@FreeBSD.ORG>; Wed,  6 Mar 2002 08:24:41 -0800 (PST)
Received: from nenny.nentec.de (root@nenny.nentec.de [153.92.64.1])
	by gate.nentec.de (8.9.3/8.9.3) with ESMTP id RAA18433;
	Wed, 6 Mar 2002 17:24:34 +0100
Received: from andromeda (andromeda [153.92.64.34])
	by nenny.nentec.de (8.11.3/8.11.3/SuSE Linux 8.11.1-0.5) with ESMTP id g26GOTh11733;
	Wed, 6 Mar 2002 17:24:29 +0100
Message-ID: <XFMail.020306172424.sporner@nentec.de>
X-Mailer: XFMail 1.4.0 on Linux
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
In-Reply-To: <Pine.LNX.4.33.0203060849090.7642-100000@snaresland.acl.lanl.gov>
Date: Wed, 06 Mar 2002 17:24:24 +0100 (MET)
Reply-To: Andy Sporner <sporner@nentec.de>
Organization: NENTEC Netywerktechnologie GmbH
From: Andy Sporner <sporner@nentec.de>
To: Ronald G Minnich <rminnich@lanl.gov>
Subject: RE: FreeBSD Cluster at SLU
Cc: freebsd-cluster@FreeBSD.ORG,
	Jason Fried <jfried@cluster.nix.selu.edu>
X-Virus-Scanned: by AMaViS-perl11-milter (http://amavis.org/)
Sender: owner-freebsd-cluster@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-cluster.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-cluster>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-cluster>
X-Loop: FreeBSD.ORG

Hi Ron,

Hopefully this thread will bring some life to this group...

> 
>> Within reason I agree...  However having things in one place defeats
>> the high availabilty on a cluster, but we may be talking about
>> different things here.
> 
> no, this is actually funny thinking about uptime.

Yes I agree.

Here is where I am coming from.  In 1995 I started working with Sequent 
Clusters.  I saw a need to provide clustering such as what was done with
VAX clusters some time ago.  This means a large cluster is about 32 nodes.
The application for such clusters is business applications (such as Oracle
and the like).  However I do realize that times have changed and clusters are
much larger now.  But in that time I have done systems architecture at two
major corporations that calculated down time in terms of Millions of dollars per
hour.  So I am well aware of the impacts that need to be addressed.

Up until now, my focus has been to provide application failover and nothing 
more (in the tradition of the original Sequent clusters)--except for a few
differences, most notably the lack of a distributed lock manager.  But since
the goal is simple application failover, it wasn't needed.  I'm not up to date
on what Oracle has been up to with version 8, but they may have implemented
this outside of the O/S by now.  Version 7, which I did have exposure to 
needed the support in the O/S.

Again stating my focus to be making a computing platform so that networking
services can be scaled in a reliable way.  That is to create a platform that
has NO SPOF.  Every component has a redundant member.  I don't think I have
to tell you that even this doesn't work well completely.... ;-)

> 
> We're just trying to get to a system with SPOF, harder than it looks.
> 

Clear.  The "Monitor Node" does all of the administration on my clustering
system and the other nodes are passive.  There is a "Lady in waiting" should
the master fail.  This is dynamically computed as nodes enter and leave the
cluster, in a deterministic manner so that there can be no doubt, which node
will take over the Monitor responsibility in the event of the monitor node
failing...  As the monitor node updates it's configuration, it passes the 
updates to the other nodes.  There is a lot of logic to prevent stale nodes
from entering the cluster and other mishaps like the "Split Brain" scenario.

>> reliable to get to 99.999% uptime.
> 
> You can actually do this with one node. It's doing it with lots of nodes
> that is hard.

Clear again, but I think only an IBM mainframe or something of that type of
hardware reliability.  But let's not split hairs over this, because it would
take us off topic.

> 
>> The cluster approach I designed
>> has replication of configuration that covers this, so your "Cluster
>> Monitor" node can fail-over when that machine fails (should it...).
> 
> How large have you made your system to date? how many nodes? have you
> built it?
> 

6 nodes and it works very well.  I have a new version that provides a
centralized interface to look at the uptime and statistics of all of the
nodes.  This is a prelude to a single process table image across all of
the nodes in the cluster.   This is the next major release, which is 
easily a year away (unless I find helpers! :-)

The idea is that whereever a process is started, it makes an entry in 
the process table.  The PID's are assigned in a N-Modulus approach so that
the PID determines the home node of the process.  When a process migrates, 
it keeps it's entry on the home node and a new entry is created on the
new host node.  If it should move again, the home node is updated.  I haven't
started implementing or benchmarking this yet, so it could change, but that
is the initial idea.

Since the model is for making a scalable networking application platform,
all of the aspects of the process move with the process (including sockets).
The idea is that you can telnet into a machine, and have your in.telnetd and
shell migrate to another machine without breaking the connection.   This uses
a gateway device, which keeps track of all of the sessions.  When a process
moves, the session is updated to point to the new host machine.  This gateway
needs to be redundant, so here is where the current generation of the cluster
software is put to work.  

There is no hard coded limit on how many nodes can be in the cluster.  As I
recall Mosix has a limit.  Last I head they also had some issues on how to
create a network coherent memory space.  Lastly I think there was some problem
with open-source (because of some military impact in Isreal).

But I have digressed,  the point is to apply an SMP approach to a network
of computers, such as Numa does, but without the O/S being a single point
of failure.  If a node does, only programs that had resources there fail and
can be immediately restarted.  The larger the cluster, hopefully the smaller
the impact, then it is simply a matter of simple statistics to calculate 
downtime.

Regards


Andy

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-cluster" in the body of the message