From owner-freebsd-cluster Thu Jul 11 1:56:49 2002 Delivered-To: freebsd-cluster@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4BB2C37B405 for ; Thu, 11 Jul 2002 01:56:38 -0700 (PDT) Received: from gate.nentec.de (gate2.nentec.de [194.25.215.66]) by mx1.FreeBSD.org (Postfix) with ESMTP id A398943E42 for ; Thu, 11 Jul 2002 01:56:36 -0700 (PDT) (envelope-from sporner@nentec.de) Received: from nenny.nentec.de (root@nenny.nentec.de [153.92.64.1]) by gate.nentec.de (8.11.3/8.9.3) with ESMTP id g6B8uXA23013; Thu, 11 Jul 2002 10:56:33 +0200 Received: from nentec.de (andromeda.nentec.de [153.92.64.34]) by nenny.nentec.de (8.11.3/8.11.3/SuSE Linux 8.11.1-0.5) with ESMTP id g6B8uVZ06860; Thu, 11 Jul 2002 10:56:31 +0200 Message-ID: <3D2D483E.4040100@nentec.de> Date: Thu, 11 Jul 2002 10:56:30 +0200 From: Andy Sporner User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:0.9.8) Gecko/20020204 X-Accept-Language: de-at, de, en, en-us MIME-Version: 1.0 To: aaron g Cc: cliftonr@lava.net, freebsd-cluster@FreeBSD.ORG Subject: Re: SPREAD clusters References: <20020709212404.16403.qmail@operamail.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: by AMaViS-perl11-milter (http://amavis.org/) Sender: owner-freebsd-cluster@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Hi, I will try to answer both emails at once ;-) (wow! multitasking batch mode!) I think a good clarification is in order. ;-) My idea of a "perfect cluster" is one that applications don't realize that they are on one. That in addition to achieving the five 9's of reliablity (99.999% uptime). In my working experience I have seen the clustering system at Fermi-Lab in the early 90's and had my opinions about it. I worked with clusters on Dynix/PTX (Sequent) and even gave back some enhancements since 1995, I formed a company to make load balancing applications in the mid '90's and the rights got sold to a company in Boston that makes 1U servers. Now I am working for German company making a very high speed network control switch (that can make complex routing decisions of network traffic) that can be used to front-end such a cluster as I am proposing--though a software solution will work equally well ;-) While working on Sequent clusters, I got familiar with the Numa-Q product and about it's workings. I came to the conclusion that it only addressed the SMP bottleneck (Amdahl's law) but really didn't add that much more of reliablity. So the idea for Phase 2 was to make a 'Numa' like system--which in effect it is, that removes the OS on the node as a single point of failure. In their Numa architecture it was one instance of the OS across many physical nodes using special hardware to be able to address any page of memory across the complex making up a node. The problem is that one member of the complex could bring down the entire system. What I wanted to do was to start where they were, but have a separate O/S image on each node with a cooperative process space--yes like Mosix, but totally transparently. When a system becomes too busy, rather than swapping a process out to disk, it can be swapped to another node. Sort of like SMP (Symmetrical Multi-Processing) in a network. If a node dies, just those processes that had memory there die, but the OS (cooperatively speaking) just goes on running--and rather than waiting for a reboot, the dead processes just get restarted again. With such a system, the five 9's should be very easy to reach. That being said, there are a lot of challenges--especially with respect to the system scheduler and the VM system that have to be addressed. I have a rough concept that I have been going over though the last 4 years and have never had a chance to commit it to a document. I suppose it is probably about time to do so. I even came up with a way that network applications can survive a node move as well, though it requires a special protocol and a front-end device to achieve this. For the sake of the front-end device and potential single points of failures, we have phase-1 of the clustering software, but ultimately, phase-2 should completely replace phase-1 for everything else. While speaking about phase-1, the goal is simple generic failover of applications. There is a small feature that didn't cost much in the implementation to add a weight to the applications so that they could be started on nodes in a more intelligent manner, with respect to the resources on the machine. For the moment they are static (IE: the summation of the weights of already running applications are done to find out if enough resources are present to start a new one. Instead of looking at the configured Maximum weight, the actual application usage (by merit of the CSE patch) can be collected instead. From this jobs can be shut-down and restart on other nodes when the statistics on a node changed. Bye! Andy These things have been aaron g wrote: >>From my limited knowledge of the project I beleive this is >not out of the question. Infact it may have been Andy >himself who made reference to the VMS like clustering >technology. I'm not really in a place to give a definitive >answer but I too am interested in a solution for the >situation you describe. > >- aarong > >----- Original Message ----- >From: Clifton Royston >Date: Tue, 9 Jul 2002 09:51:32 -1000 >To: Andy Sporner >Subject: Re: SPREAD clusters > >> Is there a document explaining the scope of the project, what kinds >>of problems it's intended to address, and the overall outline or >>roadmap? I'm having a hard time getting that from the URL you posted. >>(I'm also new to this list, obviously.) >> >> Is the current project aimed at application failover and load-balancing >>for specific applications, i.e. providing the software equivalent of a >>"layer 4" or load balancing Ethernet switch? >> >> Or does it generically instantiate all network applications into the >>same failover and load-balancing environment? >> >> Or is it more like Mosix, in which servers join a kind of "hive mind" >>where any processor can vfork() a process onto a different server with >>more RAM/CPU available, but processes have to remain on the original >>machine to do device I/O? >> >> Or is it like Digital (R.I.P.s) Vax VMS or "TrueUNIX" clustering, >>where for most purposes the clustered servers behaved like a single >>machine, with shared storage, unified access to file systems and >>devices, etc.? >> >> My main practical interest is in the nitty-gritty of building >>practical highly reliable and highly scalable mail server clusters, >>both for mail delivery (SMTP,LMTP) and mail retrieval (POP, IMAP.) The >>main challenge in doing this right now is dealing with the need for all >>servers to have a coherent common view of the file systems where mail >>is stored. This means the cluster solution needs to include shared >>storage, either via NFS or via some better mechanism which provides >>reliable sharing of file systems between multiple servers and allows >>for server failure without interruption of data access. >> >> Is this kind of question outside the scope of the current project? >> >> -- Clifton >> >>-- >> Clifton Royston -- LavaNet Systems Architect -- cliftonr@lava.net >>"What do we need to make our world come alive? >> What does it take to make us sing? >> While we're waiting for the next one to arrive..." - Sisters of Mercy >> >>To Unsubscribe: send mail to majordomo@FreeBSD.org >>with "unsubscribe freebsd-cluster" in the body of the message >> > > > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-cluster" in the body of the message