From owner-freebsd-hackers  Thu Jul 26 12:42:28 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from kinkajou.arc.nasa.gov (kinkajou.arc.nasa.gov [128.102.196.170])
	by hub.freebsd.org (Postfix) with ESMTP id D50E537B405
	for <freebsd-hackers@FreeBSD.ORG>; Thu, 26 Jul 2001 12:42:21 -0700 (PDT)
	(envelope-from lamaster@nas.nasa.gov)
Received: from localhost (lamaster@localhost)
	by kinkajou.arc.nasa.gov (8.9.3+Sun/8.9.1) with ESMTP id MAA26382
	for <freebsd-hackers@FreeBSD.ORG>; Thu, 26 Jul 2001 12:42:10 -0700 (PDT)
X-Authentication-Warning: kinkajou.arc.nasa.gov: lamaster owned process doing -bs
Date: Thu, 26 Jul 2001 12:42:09 -0700 (PDT)
From: Hugh LaMaster <lamaster@nas.nasa.gov>
X-Sender: lamaster@kinkajou.arc.nasa.gov
To: freebsd-hackers@FreeBSD.ORG
Subject: Re: MPP and new processor designs.
Message-ID: <Pine.GSO.4.05.10107261211400.24821-100000@kinkajou.arc.nasa.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

 
On Wed, 25 Jul 2001, Christopher R. Bowman wrote:

> "Leo Bicknell <bicknell@ufp.org>" wrote:
> >
> > A number of new chips have been released lately, along with some
> > enhancements to existing processors that all fall into the same
> > logic of parallelizing some operations.  Why, just today I ran
> > across an article about http://www.theregister.co.uk/content/3/20576.html,
> > which bosts 128 ALU's on a single chip.
> > 
> > This got me to thinking about an interesting way of using these
> > chips.  Rather than letting the hardware parallelize instructions
> > from a single stream, what about feeding it multiple streams of
> > instructions.  That is, treat it like multiple CPU's running two
> > (or more) processes at once.
> > 
> > I'm sure the hardware isn't quite designed for this at the moment
> > and so it couldn't "just be done", but if you had say 128 ALU's
> > most single user systems could dedicate one ALU to a process
> > and never context switch, in the traditional sense.   For systems
> > that run lots of processors the rate limiting on a single process
> > wouldn't be a big issue, and you could gain lots of effiencies
> > in the global aspect by not context-switching in the traditional
> > sense.
> > 
> > Does anyone know of something like this being tried?  Traditional
> > 2-8 way SMP systems probably don't have enough processors (I'm
> > thinking 64 is a minimum to make this interesting) and require
> > other glue to make multiple independant processors work together.
> > Has anyone tried this with them all in one package, all clocked
> > together, etc?


To answer the question, "Does anyone know of something like this 
being tried?" the answer is yes, all this has been done before.
There were actually some "data flow" features in one of the
IBM supercomputers from the 1960's, and the Denelcor HEP from 
the 1980's was similar to what the above description sounds like,
although I don't see a lot of details so I can't be sure.

If anyone is interested in parallel architectures, they might want 
to read archives of the Usenet newsgroup comp.arch, which carried
news of, and debates about, these and many more architectural ideas.
IEEE Micro during the 1990's carried many good articles on architectural
issues.

To summarize 40 years of debate about these issues [ ;-) ;-) ;-) ]
a few widely accepted comp.arch principles would be:

- Other technologies will play minor roles until CMOS runs out of gas,
  which should happen any day now ;-)  (this prediction has been continous
  for the last decade).
- Since the mid-70's (that is 25 years now), logic/gates/real-estate
  are no longer (economically) scarce
- Therefore, the key to the value/efficiency of any computer architecture
  is how well it uses memory
- There are two key components to memory hierarchy performance- latency 
  and bandwidth
- Different applications have different requirements wrt latency and B/W -
  some require fastest possible effective latency ("traditional" jobs);
  some can benefit from greatly increased B/W at the expense of increased
  latency (traditional supercomputer jobs, including large numerical
  simulations, image processing, and some other "vectorizable" jobs);
  some jobs are amenable to the availability of large numbers of process
  threads working on the parts of a decomposed problem ("parallelizable"
  jobs)


The short answer is that a tremendous amount of time and energy has gone
to working on various approaches to these problems; many books have been
written, papers published, and there has been a large knowledge base
built up over the years.


> As I work for the above mentioned processor company, I though I might
> jump in here rather quickly and dispel an notion that you will be
> running any type of Linux or Unix on these processors any time soon.
> 
> This chip is a reconfigurable data flow architecture with support for
> control flow.  You really need to think about this chip in the dataflow
> paradigm.  In addition you have to examine what the reporter said. 
> While it is true that there are 128 ALUs on the chip and that it can
> perform in the neighborhood of 15BOPs these are only ALUs, they are not
> full processors.  They don't run a program as you would on a typical von
> Neumann processor.  The ALUs don't even have a program counter (not to
> mention MMUs).  Instead, to program one of these chips you tell each ALU
> what function to perform and tell the ALUs how their input and output
> ports are connected.  Then you sit back and watch as the data streams
> through in a pipelined fashion. Because all ALUs operate in parallel you
> can get some spectacular operations/second counts even at low
> frequencies.  Think of it, even at only 100Mhz 100 ALUs operating in
> parallel give you 10 Billion operations per second.
:


> finally, I do think that perhaps we have hit the point of diminishing
> returns with the current complexity of processors.  Part of the
> Hennesy/Patterson approach to architecture that led to RISC was not
> reduction of instructions sets because that is good as a goal in it's
> own right, but rather a reduction of complexity as an engineering design
> goal since this leads to faster product design cycles which allows you
> to more aggressively target and take advantage of improving process
> technology. I think that the time may come where we want to dump huge
> caches and multiway super scalar processing since they take up lots of
> die space and pay diminishing returns.  Perhaps in the future we would
> be better off with 20 or 50 simple first generation MIPS type cores on a
> chip.  In a large multi-user system with high availability of jobs you
> might be able to leverage the design of the single core to truly high
> aggregate performance. You would, of course, not do anything for the
> single user workstation where you are only surfing or word processing,
> but in a large commercial setting with lots of independent jobs you
> might see better utilization of all that silicon by running more
> processes slower.


In my (this is 100% personal opinion) view, there may be little to be
gained by optimizing the logic portion of the CPU beyond what has
already been done.  In general, I think if you start with a simple
architecture, such as MIPS mentioned above, optimize it with 
superscalar features (as has been done), and use all the remaining
available chip real estate for L2 and L3 cache - you can't have
too much L3 cache - and, then when you go off-chip, the next step is 
to start replicating it up to the single shared-memory system limit 
(which some reckon to be somewhere between 64 and 1024 processors, 
typically), and then start clustering such systems together.


In a *BSD Hackers context, that means supporting:

- efficient thread scheduling for lots of threads per process,
  for example, 64, each running on a different processor
  [ rfork(2), rfork_thread(3) ] of a 64+ processor system
- efficient, scalable support of SMP up to the processor hardware
  communication limit, which could be O(2^^10) processors:

  http://www.sgi.com/newsroom/3rd_party/071901_nasa.html

At the same time, whatever the economic single system size limit 
turns out to be, and whether the limit is because of limits on
the access to shared data structures in the OS, or, shared 
memory limits, the systems will need to be clustered at some point.
Lots of clustering work has gone on:

  http://www.beowulf.org/
  http://stonesoup.esd.ornl.gov/
  http://www.scientificamerican.com/2001/0801issue/0801hargrove.html
  http://www.globus.org/

Clusters of 512 systems have already been built, with 1250 on
the drawing board.  So, potentially, one could have 2^^20+ CPUs
in a single cluster.  Then, with "Grid" software, the cluster 
could be connected to other clusters over the net with certain
common services:

  http://www.globus.org/

For some reason, much more work on SMP, cluster software, 
and Grid software, seems to have been done on linux (and on 
many commercial operating systems), than on BSD-based systems--
I'm not sure why.


--
 Hugh LaMaster, M/S 233-21,    Email: lamaster@nas.nasa.gov
 NASA Ames Research Center     Or:    lamaster@nren.nasa.gov
 Moffett Field, CA 94035-1000  Or:    lamaster@kinkajou.arc.nasa.gov
 Phone: 650/604-1056           Disc:  Unofficial, personal *opinion*.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message