From owner-freebsd-smp  Sat Jun 22 14:33:15 1996
Return-Path: owner-smp
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id OAA10884
          for smp-outgoing; Sat, 22 Jun 1996 14:33:15 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id OAA10865
          for <smp@freebsd.org>; Sat, 22 Jun 1996 14:33:10 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA22764; Sat, 22 Jun 1996 14:27:35 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199606222127.OAA22764@phaeton.artisoft.com>
Subject: Re: SMP version?
To: jed@webstart.com (James E. [Jed] Donnelley)
Date: Sat, 22 Jun 1996 14:27:35 -0700 (MST)
Cc: rminnich@sarnoff.com, thomaspf@microsoft.com, davidg@Root.COM,
        smp@freebsd.org, jed@llnl.gov, mail@ppgsoft.com
In-Reply-To: <199606220714.AAA27881@aimnet.com> from "James E. [Jed] Donnelley" at Jun 22, 96 00:14:28 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-smp@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk


This is *VERY* exciting stuff!

> As you can learn in a bit greater detail below, we at LLNL are
> considering using an open version of Unix (e.g. FreeBSD or Linux)
> for an SMP running on multiple Intel processors connected
> via Scalable Coherent Interface:
> 
> http://www.cmpcmm.com/cc/standards.html#SCI

Ah.  I didn't realize that LAMP used SCI (I've heard of LAMP).

It looks like what you are talking about is cache-miss-based
transport-using page fetching... otherwise known as a distributed
cache coherency system.  8-).  The difference is that the
memory sharing is implemented by cache miss instead of by explicit
reference, right, so it's transparent?


> I was referred to your work as above.  I did look at your
> Web page as noted.  The focus there seems to be on message passing
> (e.g. MPI?).  Did I read that incorrectly?  We are currently
> focusing on shared memory.  As you will read below, our
> project will only be worthwhile if we can run a multiprocessing
> application with multiple processors sharing a common memory
> image (different register sets - e.g. as the Unicos model).
> We are not interested in pursuing "virtual" shared memory
> at this time (though I would be interested to hear of any
> work you have done in this area - particularly performance
> studies).

The Sarnoff work is in cluster computing; this is, indeed, different,
since it implies some scheduling and other assymetry that it looks
like (from the WWW reference I could find) that SCI would not
have.

There are two implementations of DSM (distributed shared memory,
for the list archive readers) fro FreeBSD.  Probably the best
known is the modified NFS with distributed cache coherency.  A
miss from the vnode pager on the remote NFS mounted vnode causes
a page replacement via the net.  This is a lot less ambitious,
than an SCI implementation, and probably performs at a lower
level -- though not significantly lower, since you can argue
transport latency.


> We are trying to determine how much work it will be to get
> "there" from "here" using SCI (over our in-house developed
> optical network).  I have previously developed such an
> operating system from scratch, but would naturally hope
> to be able to get such a system running from a FreeBSD
> or Linux base with much (!) less effort.  Any thoughts
> from your experience that you would be willing to share
> would be greatly appreciated.

I guess I'm still a little confused where "there" ends up
being... are you interested in providing SCI interconnect
between SMP boxes, or are you interested in SCI interconnect
of uniprocessor systems in order to *build* SMP boxes... or
are you trying to build *large* SMP boxes from multiple small
SMP boxes, etc.?

Arguably, from the decriptions of SCI, it looks like you could
build a large scalle distributed dataflow architecture... is
this your intent, or are you working on LAMP, etc.?


I think FreeBSD would be a good choice here for a number of
reasons, since all of these are possible directions from the
existing code base.


Actually, someone (probably John Dyson) needs to write up
a VM architecture description; here are some high points,
however:

o	Unified VM/buffer cache

	Lack of cache unification on a system would be, I
	think, a primary obstacle to implementing SCI
	coherency.  You would need to implement local
	coherency as well so that a buffer page miss did
	the right thing.  One of the biggest benfits is
	the avoidance of a bmap() for each kernel reference
	of user pages.


o	Memory pages are referenced from files by vnode/offset

	This reference model has advantages for cache-based
	distributed reference; the SCI interconnect could
	be conceivably implemented as a file system layer
	using the vnode pager; this would not be the most
	efficient implementation, but it would be an easy
	to approach prototype interface to let you hit the
	ground running.  In addition, though the vnode/offest
	mapping model has a number of drawbacks relative to
	premature page discarding (which are solvable, given
	some work on the /sys/kern/vfs_subr.c to kill vclean),
	it would be relatively easy to add Sun-style VOP_GETPAGE,
	VOP_PUTPAGE operations to the FS for reference-based
	cache miss detection (based on the SCI transport
	indication of a stale page)


FreeBSD uses a modified zone allocation policy for kernel memory
allocation.  Each call to the kernel "malloc" routine takes a
zone designator, similar to that used by the Mach VM system.

The zone allocation takes place in what are, effectively, SLAB
page-based allocations (using kmem_malloc).  It isn't a real full
SLAB allocation because of bitmap embedding, but it's close
enough that conversion would be pretty simple.

The use of a zone-based SLAB allocator is actually a significant
win of a standard SLAB allocator because of object type memory
persistance being relatively equivalent anywhere in a zone.  It
could be improved by providing allocator persistance hits, or
by segmenting the address space based on, for instance, a one
byte segment identification decode, or simple short/medium/long
tagging, but as it is, the zoning provides significant protection
from kernel memory fragmentation on non-page boundries (which you
might see with a standard SLAB allocator, such as those used by
Solaris and SVR4).  FreeBSD, admittedly, could use some work
on SLAB managedment, but that's trivial code, on the order of
hash management (ie: transcription of Knuth into code, like
everyone else does it).

In addition, the seperation into zones allows you to to flag the
zone identifiers (which has not been done in the current code) to
determine whether the allocated resource is local to a processor
or should be allocated globally.

This is a potentially significant win for scalability.  The loss
in scalability of Intel processors which led to the 5 bit APIC
ID limitation was the standard "diminishing returns" argument
for bus contention; however Sequent was able to overcome this
limitation with a clever design, which I don't think gets sufficient
creit for the MP case in the Vahalia book.  What Sequent did
was establish a per processor page pool with high and low water
marks, from which pages are preferentially taken for a processor's
page allocation requests.  The page pool is refilled at the low
water mark, or emtied at the high water mark.  This page pool
banding means that the THE PROCESSOR DOES NOT NEED TO HOLD THE
GLOBAL MUTEX TO GET PAGES.  This allowed SEquent to hit the full
32 processor APIC limit without significantly damaging their
scalability at the traditionally predicted 8 processor limit.


Finally, there is work under way (by John Dyson) to support shared
process address space; this is similar to the Unicos model, which
you reference -- though, obviously, you would need to deal with
the hard page table entries on multiple processors to trigger
the SCI based page-level cache coherency.  This started with a
Sequent-style "sfork" implementation.  John is in possession of
some kernel threading code (from another engineer) which operates
on a partial sharing model, that he is converting to a full
sharing model: he said that he thinks our cost per thread will
be the cost of a process in the kernel (proc, upages, minor etc.),
saving the per process page table pages using VM space sharing.


So I think no matter what direction you are actually going in,
FreeBSD is pretty much poised to help you out.


(John, David, Poul, folks -- correct me if I've mangled something)


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.