From owner-freebsd-arch  Thu Jun 27 17:37:35 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from www.xyz.com (www.xyz.com [199.26.172.28])
	by hub.freebsd.org (Postfix) with ESMTP id 5391137B4E6
	for <arch@FreeBSD.ORG>; Thu, 27 Jun 2002 17:35:50 -0700 (PDT)
Received: from www.xyz.com (localhost [127.0.0.1])
	by www.xyz.com (8.12.4/8.12.4) with ESMTP id g5S0ZJmP098253;
	Thu, 27 Jun 2002 17:35:19 -0700 (PDT)
	(envelope-from nerd@xyz.com)
Message-Id: <200206280035.g5S0ZJmP098253@www.xyz.com>
To: "Gary Thorpe" <gat7634@hotmail.com>
Cc: arch@FreeBSD.ORG
From: nerd@xyz.com
Subject: Re: Larry McVoy's slides on cache coherent clusters 
In-reply-to: Your message of "Thu, 27 Jun 2002 14:18:31 EDT."
             <F115p3MSi6xzmeWgUSp000017fa@hotmail.com> 
Date: Thu, 27 Jun 2002 17:35:19 -0700
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

So you know where I'm coming from, I used to be an engineer in the
base OS group (I owned the disk driver) at Sequent, the company with
the best NUMA product out there even if we went the way of Beta VCRs.

>The slides seem to be talking about NUMA (Non-Uniform Memory Access) 
>machines which use CC (Cache Coherancy). These types of machines implement a 
>cluster purely in hardware from what I have read of them (single memory 
>address space is really distributed shared memory coordinated in hardware by 
>high speed switches etc) and use much faster/lower latency communication 
>methods. Examples would be SGI's Origin2000 and Origin3000 and maybe Sun's 
>Starfire line. The big advantage is scaling and redundancy, since no one 
>part of teh system is essential for the whole thing working (which is how 
>clusters should also work ideally).

We (Sequent) were the first and best implementation out there with our
NUMA-Q line...  SGI & Sun both rely on huge memory backbones rather
than finesse in software to achieve performance and they still fall
short.  DG tried too but I've heard nothing of them of late, sort of
like the US vice presidents (quick, name the last 4).

NUMA buys you no redundancy in the real sense of the word, that is,
the hardware architecture is more complex and thus more likely to
fail.  Of course since you have a number of quads (or whatever an
implementation may chose for the basic unit) once you've had a
hardware fault you can easily remove a single quad and reboot.
Unfortunately your uptime requirements have gone to hell the second a
reboot is needed.  As far as scaling goes, you are right, code with
minimal SMP awareness (Oracle) running on a top notch OS will scale
incredibly well.

>I think this ties in to Mr. Lambert's question about the future of FreeBSD 
>very much. I think the NUMA model will eventually dominate all future large 
>systems in the next 10 years (and SMP will come to be standard on small 
>systems) and FreBSD will probably have to run efficiently on them to compete 
>with Linux etc. Having seemless clusters (by this I mean clusters that work 
>as a single system with one system image and identity) would probably be a 
>an interesting problem also, since only a few OSes have made any serious 
>attempt at implementing them. PVM, MPI, and MOSIX cannot for example migrate 
>I/O among machines (network load balancing maybe?).

*TO ME* clustering and single memory image are contradictory.  You
cluster for redundancy, that is to get rid of any and all single
points of failure.  If the janitor trips over a power cord thus taking
a big bite out of your memory space you'll quickly realize that this
is not redundancy.

At Sequent we found that the #1 key to scalability in a NUMA world was
to NEVER move memory from one quad to the next.  This means that
programs should try to migrate between procs on the same quad if
possible, only move off quad as a last resort.  Memory allocation has
to be very aware of the fact that it is running on a collection of SMP
boxen with high costs to go from proc-to-proc and prohibitive costs to
go from quad-to-quad.  Of course it follows that I/O must never be
allowed to move over the memory backplane if possible.  We had quad
aware routing at all layers of the I/O stack to achieve this.

Of course YMMV.  Last I looked neither Sun nor SGI had figured out how
to squeeze the performance and scalability that we had.  IBM who
bought, chewed up, and then threw Sequent away didn't seem to have the
corporate acuity to realize that there were lessons to be learned from
small companies.  Oh well, I'm bitter, sue me, no, forget that, IBM
probably will.

In another email on the same thread, Matt Dillon wrote:

>NUMA then becomes just another, faster transport mechanism.  That is
>the direction I believe the BSDs will take... transparent clustering
>with NUMA transport, network transport, or a hybrid of both.

Matt: If you don't have a single memory immage you don't have NUMA.
If you do have it then the transport mechanism will be saturated just
moving "RAM" around and will not be available for network, I/O or
whatever else.

-michael

michael at michael dot galassi dot org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message