From owner-freebsd-hackers Thu Jul 26 12:42:28 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from kinkajou.arc.nasa.gov (kinkajou.arc.nasa.gov [128.102.196.170]) by hub.freebsd.org (Postfix) with ESMTP id D50E537B405 for ; Thu, 26 Jul 2001 12:42:21 -0700 (PDT) (envelope-from lamaster@nas.nasa.gov) Received: from localhost (lamaster@localhost) by kinkajou.arc.nasa.gov (8.9.3+Sun/8.9.1) with ESMTP id MAA26382 for ; Thu, 26 Jul 2001 12:42:10 -0700 (PDT) X-Authentication-Warning: kinkajou.arc.nasa.gov: lamaster owned process doing -bs Date: Thu, 26 Jul 2001 12:42:09 -0700 (PDT) From: Hugh LaMaster X-Sender: lamaster@kinkajou.arc.nasa.gov To: freebsd-hackers@FreeBSD.ORG Subject: Re: MPP and new processor designs. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Wed, 25 Jul 2001, Christopher R. Bowman wrote: > "Leo Bicknell " wrote: > > > > A number of new chips have been released lately, along with some > > enhancements to existing processors that all fall into the same > > logic of parallelizing some operations. Why, just today I ran > > across an article about http://www.theregister.co.uk/content/3/20576.html, > > which bosts 128 ALU's on a single chip. > > > > This got me to thinking about an interesting way of using these > > chips. Rather than letting the hardware parallelize instructions > > from a single stream, what about feeding it multiple streams of > > instructions. That is, treat it like multiple CPU's running two > > (or more) processes at once. > > > > I'm sure the hardware isn't quite designed for this at the moment > > and so it couldn't "just be done", but if you had say 128 ALU's > > most single user systems could dedicate one ALU to a process > > and never context switch, in the traditional sense. For systems > > that run lots of processors the rate limiting on a single process > > wouldn't be a big issue, and you could gain lots of effiencies > > in the global aspect by not context-switching in the traditional > > sense. > > > > Does anyone know of something like this being tried? Traditional > > 2-8 way SMP systems probably don't have enough processors (I'm > > thinking 64 is a minimum to make this interesting) and require > > other glue to make multiple independant processors work together. > > Has anyone tried this with them all in one package, all clocked > > together, etc? To answer the question, "Does anyone know of something like this being tried?" the answer is yes, all this has been done before. There were actually some "data flow" features in one of the IBM supercomputers from the 1960's, and the Denelcor HEP from the 1980's was similar to what the above description sounds like, although I don't see a lot of details so I can't be sure. If anyone is interested in parallel architectures, they might want to read archives of the Usenet newsgroup comp.arch, which carried news of, and debates about, these and many more architectural ideas. IEEE Micro during the 1990's carried many good articles on architectural issues. To summarize 40 years of debate about these issues [ ;-) ;-) ;-) ] a few widely accepted comp.arch principles would be: - Other technologies will play minor roles until CMOS runs out of gas, which should happen any day now ;-) (this prediction has been continous for the last decade). - Since the mid-70's (that is 25 years now), logic/gates/real-estate are no longer (economically) scarce - Therefore, the key to the value/efficiency of any computer architecture is how well it uses memory - There are two key components to memory hierarchy performance- latency and bandwidth - Different applications have different requirements wrt latency and B/W - some require fastest possible effective latency ("traditional" jobs); some can benefit from greatly increased B/W at the expense of increased latency (traditional supercomputer jobs, including large numerical simulations, image processing, and some other "vectorizable" jobs); some jobs are amenable to the availability of large numbers of process threads working on the parts of a decomposed problem ("parallelizable" jobs) The short answer is that a tremendous amount of time and energy has gone to working on various approaches to these problems; many books have been written, papers published, and there has been a large knowledge base built up over the years. > As I work for the above mentioned processor company, I though I might > jump in here rather quickly and dispel an notion that you will be > running any type of Linux or Unix on these processors any time soon. > > This chip is a reconfigurable data flow architecture with support for > control flow. You really need to think about this chip in the dataflow > paradigm. In addition you have to examine what the reporter said. > While it is true that there are 128 ALUs on the chip and that it can > perform in the neighborhood of 15BOPs these are only ALUs, they are not > full processors. They don't run a program as you would on a typical von > Neumann processor. The ALUs don't even have a program counter (not to > mention MMUs). Instead, to program one of these chips you tell each ALU > what function to perform and tell the ALUs how their input and output > ports are connected. Then you sit back and watch as the data streams > through in a pipelined fashion. Because all ALUs operate in parallel you > can get some spectacular operations/second counts even at low > frequencies. Think of it, even at only 100Mhz 100 ALUs operating in > parallel give you 10 Billion operations per second. : > finally, I do think that perhaps we have hit the point of diminishing > returns with the current complexity of processors. Part of the > Hennesy/Patterson approach to architecture that led to RISC was not > reduction of instructions sets because that is good as a goal in it's > own right, but rather a reduction of complexity as an engineering design > goal since this leads to faster product design cycles which allows you > to more aggressively target and take advantage of improving process > technology. I think that the time may come where we want to dump huge > caches and multiway super scalar processing since they take up lots of > die space and pay diminishing returns. Perhaps in the future we would > be better off with 20 or 50 simple first generation MIPS type cores on a > chip. In a large multi-user system with high availability of jobs you > might be able to leverage the design of the single core to truly high > aggregate performance. You would, of course, not do anything for the > single user workstation where you are only surfing or word processing, > but in a large commercial setting with lots of independent jobs you > might see better utilization of all that silicon by running more > processes slower. In my (this is 100% personal opinion) view, there may be little to be gained by optimizing the logic portion of the CPU beyond what has already been done. In general, I think if you start with a simple architecture, such as MIPS mentioned above, optimize it with superscalar features (as has been done), and use all the remaining available chip real estate for L2 and L3 cache - you can't have too much L3 cache - and, then when you go off-chip, the next step is to start replicating it up to the single shared-memory system limit (which some reckon to be somewhere between 64 and 1024 processors, typically), and then start clustering such systems together. In a *BSD Hackers context, that means supporting: - efficient thread scheduling for lots of threads per process, for example, 64, each running on a different processor [ rfork(2), rfork_thread(3) ] of a 64+ processor system - efficient, scalable support of SMP up to the processor hardware communication limit, which could be O(2^^10) processors: http://www.sgi.com/newsroom/3rd_party/071901_nasa.html At the same time, whatever the economic single system size limit turns out to be, and whether the limit is because of limits on the access to shared data structures in the OS, or, shared memory limits, the systems will need to be clustered at some point. Lots of clustering work has gone on: http://www.beowulf.org/ http://stonesoup.esd.ornl.gov/ http://www.scientificamerican.com/2001/0801issue/0801hargrove.html http://www.globus.org/ Clusters of 512 systems have already been built, with 1250 on the drawing board. So, potentially, one could have 2^^20+ CPUs in a single cluster. Then, with "Grid" software, the cluster could be connected to other clusters over the net with certain common services: http://www.globus.org/ For some reason, much more work on SMP, cluster software, and Grid software, seems to have been done on linux (and on many commercial operating systems), than on BSD-based systems-- I'm not sure why. -- Hugh LaMaster, M/S 233-21, Email: lamaster@nas.nasa.gov NASA Ames Research Center Or: lamaster@nren.nasa.gov Moffett Field, CA 94035-1000 Or: lamaster@kinkajou.arc.nasa.gov Phone: 650/604-1056 Disc: Unofficial, personal *opinion*. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message