From owner-freebsd-arch Thu Jun 27 17:37:35 2002 Delivered-To: freebsd-arch@freebsd.org Received: from www.xyz.com (www.xyz.com [199.26.172.28]) by hub.freebsd.org (Postfix) with ESMTP id 5391137B4E6 for ; Thu, 27 Jun 2002 17:35:50 -0700 (PDT) Received: from www.xyz.com (localhost [127.0.0.1]) by www.xyz.com (8.12.4/8.12.4) with ESMTP id g5S0ZJmP098253; Thu, 27 Jun 2002 17:35:19 -0700 (PDT) (envelope-from nerd@xyz.com) Message-Id: <200206280035.g5S0ZJmP098253@www.xyz.com> To: "Gary Thorpe" Cc: arch@FreeBSD.ORG From: nerd@xyz.com Subject: Re: Larry McVoy's slides on cache coherent clusters In-reply-to: Your message of "Thu, 27 Jun 2002 14:18:31 EDT." Date: Thu, 27 Jun 2002 17:35:19 -0700 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG So you know where I'm coming from, I used to be an engineer in the base OS group (I owned the disk driver) at Sequent, the company with the best NUMA product out there even if we went the way of Beta VCRs. >The slides seem to be talking about NUMA (Non-Uniform Memory Access) >machines which use CC (Cache Coherancy). These types of machines implement a >cluster purely in hardware from what I have read of them (single memory >address space is really distributed shared memory coordinated in hardware by >high speed switches etc) and use much faster/lower latency communication >methods. Examples would be SGI's Origin2000 and Origin3000 and maybe Sun's >Starfire line. The big advantage is scaling and redundancy, since no one >part of teh system is essential for the whole thing working (which is how >clusters should also work ideally). We (Sequent) were the first and best implementation out there with our NUMA-Q line... SGI & Sun both rely on huge memory backbones rather than finesse in software to achieve performance and they still fall short. DG tried too but I've heard nothing of them of late, sort of like the US vice presidents (quick, name the last 4). NUMA buys you no redundancy in the real sense of the word, that is, the hardware architecture is more complex and thus more likely to fail. Of course since you have a number of quads (or whatever an implementation may chose for the basic unit) once you've had a hardware fault you can easily remove a single quad and reboot. Unfortunately your uptime requirements have gone to hell the second a reboot is needed. As far as scaling goes, you are right, code with minimal SMP awareness (Oracle) running on a top notch OS will scale incredibly well. >I think this ties in to Mr. Lambert's question about the future of FreeBSD >very much. I think the NUMA model will eventually dominate all future large >systems in the next 10 years (and SMP will come to be standard on small >systems) and FreBSD will probably have to run efficiently on them to compete >with Linux etc. Having seemless clusters (by this I mean clusters that work >as a single system with one system image and identity) would probably be a >an interesting problem also, since only a few OSes have made any serious >attempt at implementing them. PVM, MPI, and MOSIX cannot for example migrate >I/O among machines (network load balancing maybe?). *TO ME* clustering and single memory image are contradictory. You cluster for redundancy, that is to get rid of any and all single points of failure. If the janitor trips over a power cord thus taking a big bite out of your memory space you'll quickly realize that this is not redundancy. At Sequent we found that the #1 key to scalability in a NUMA world was to NEVER move memory from one quad to the next. This means that programs should try to migrate between procs on the same quad if possible, only move off quad as a last resort. Memory allocation has to be very aware of the fact that it is running on a collection of SMP boxen with high costs to go from proc-to-proc and prohibitive costs to go from quad-to-quad. Of course it follows that I/O must never be allowed to move over the memory backplane if possible. We had quad aware routing at all layers of the I/O stack to achieve this. Of course YMMV. Last I looked neither Sun nor SGI had figured out how to squeeze the performance and scalability that we had. IBM who bought, chewed up, and then threw Sequent away didn't seem to have the corporate acuity to realize that there were lessons to be learned from small companies. Oh well, I'm bitter, sue me, no, forget that, IBM probably will. In another email on the same thread, Matt Dillon wrote: >NUMA then becomes just another, faster transport mechanism. That is >the direction I believe the BSDs will take... transparent clustering >with NUMA transport, network transport, or a hybrid of both. Matt: If you don't have a single memory immage you don't have NUMA. If you do have it then the transport mechanism will be saturated just moving "RAM" around and will not be available for network, I/O or whatever else. -michael michael at michael dot galassi dot org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message