From owner-freebsd-cluster Wed Mar 6 8:24:59 2002 Delivered-To: freebsd-cluster@freebsd.org Received: from gate.nentec.de (gate2.nentec.de [194.25.215.66]) by hub.freebsd.org (Postfix) with ESMTP id 204A537B405 for ; Wed, 6 Mar 2002 08:24:41 -0800 (PST) Received: from nenny.nentec.de (root@nenny.nentec.de [153.92.64.1]) by gate.nentec.de (8.9.3/8.9.3) with ESMTP id RAA18433; Wed, 6 Mar 2002 17:24:34 +0100 Received: from andromeda (andromeda [153.92.64.34]) by nenny.nentec.de (8.11.3/8.11.3/SuSE Linux 8.11.1-0.5) with ESMTP id g26GOTh11733; Wed, 6 Mar 2002 17:24:29 +0100 Message-ID: X-Mailer: XFMail 1.4.0 on Linux X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Wed, 06 Mar 2002 17:24:24 +0100 (MET) Reply-To: Andy Sporner Organization: NENTEC Netywerktechnologie GmbH From: Andy Sporner To: Ronald G Minnich Subject: RE: FreeBSD Cluster at SLU Cc: freebsd-cluster@FreeBSD.ORG, Jason Fried X-Virus-Scanned: by AMaViS-perl11-milter (http://amavis.org/) Sender: owner-freebsd-cluster@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Hi Ron, Hopefully this thread will bring some life to this group... > >> Within reason I agree... However having things in one place defeats >> the high availabilty on a cluster, but we may be talking about >> different things here. > > no, this is actually funny thinking about uptime. Yes I agree. Here is where I am coming from. In 1995 I started working with Sequent Clusters. I saw a need to provide clustering such as what was done with VAX clusters some time ago. This means a large cluster is about 32 nodes. The application for such clusters is business applications (such as Oracle and the like). However I do realize that times have changed and clusters are much larger now. But in that time I have done systems architecture at two major corporations that calculated down time in terms of Millions of dollars per hour. So I am well aware of the impacts that need to be addressed. Up until now, my focus has been to provide application failover and nothing more (in the tradition of the original Sequent clusters)--except for a few differences, most notably the lack of a distributed lock manager. But since the goal is simple application failover, it wasn't needed. I'm not up to date on what Oracle has been up to with version 8, but they may have implemented this outside of the O/S by now. Version 7, which I did have exposure to needed the support in the O/S. Again stating my focus to be making a computing platform so that networking services can be scaled in a reliable way. That is to create a platform that has NO SPOF. Every component has a redundant member. I don't think I have to tell you that even this doesn't work well completely.... ;-) > > We're just trying to get to a system with SPOF, harder than it looks. > Clear. The "Monitor Node" does all of the administration on my clustering system and the other nodes are passive. There is a "Lady in waiting" should the master fail. This is dynamically computed as nodes enter and leave the cluster, in a deterministic manner so that there can be no doubt, which node will take over the Monitor responsibility in the event of the monitor node failing... As the monitor node updates it's configuration, it passes the updates to the other nodes. There is a lot of logic to prevent stale nodes from entering the cluster and other mishaps like the "Split Brain" scenario. >> reliable to get to 99.999% uptime. > > You can actually do this with one node. It's doing it with lots of nodes > that is hard. Clear again, but I think only an IBM mainframe or something of that type of hardware reliability. But let's not split hairs over this, because it would take us off topic. > >> The cluster approach I designed >> has replication of configuration that covers this, so your "Cluster >> Monitor" node can fail-over when that machine fails (should it...). > > How large have you made your system to date? how many nodes? have you > built it? > 6 nodes and it works very well. I have a new version that provides a centralized interface to look at the uptime and statistics of all of the nodes. This is a prelude to a single process table image across all of the nodes in the cluster. This is the next major release, which is easily a year away (unless I find helpers! :-) The idea is that whereever a process is started, it makes an entry in the process table. The PID's are assigned in a N-Modulus approach so that the PID determines the home node of the process. When a process migrates, it keeps it's entry on the home node and a new entry is created on the new host node. If it should move again, the home node is updated. I haven't started implementing or benchmarking this yet, so it could change, but that is the initial idea. Since the model is for making a scalable networking application platform, all of the aspects of the process move with the process (including sockets). The idea is that you can telnet into a machine, and have your in.telnetd and shell migrate to another machine without breaking the connection. This uses a gateway device, which keeps track of all of the sessions. When a process moves, the session is updated to point to the new host machine. This gateway needs to be redundant, so here is where the current generation of the cluster software is put to work. There is no hard coded limit on how many nodes can be in the cluster. As I recall Mosix has a limit. Last I head they also had some issues on how to create a network coherent memory space. Lastly I think there was some problem with open-source (because of some military impact in Isreal). But I have digressed, the point is to apply an SMP approach to a network of computers, such as Numa does, but without the O/S being a single point of failure. If a node does, only programs that had resources there fail and can be immediately restarted. The larger the cluster, hopefully the smaller the impact, then it is simply a matter of simple statistics to calculate downtime. Regards Andy To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-cluster" in the body of the message