From owner-freebsd-smp Sat Jun 22 14:33:15 1996 Return-Path: owner-smp Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id OAA10884 for smp-outgoing; Sat, 22 Jun 1996 14:33:15 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id OAA10865 for ; Sat, 22 Jun 1996 14:33:10 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id OAA22764; Sat, 22 Jun 1996 14:27:35 -0700 From: Terry Lambert Message-Id: <199606222127.OAA22764@phaeton.artisoft.com> Subject: Re: SMP version? To: jed@webstart.com (James E. [Jed] Donnelley) Date: Sat, 22 Jun 1996 14:27:35 -0700 (MST) Cc: rminnich@sarnoff.com, thomaspf@microsoft.com, davidg@Root.COM, smp@freebsd.org, jed@llnl.gov, mail@ppgsoft.com In-Reply-To: <199606220714.AAA27881@aimnet.com> from "James E. [Jed] Donnelley" at Jun 22, 96 00:14:28 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-smp@freebsd.org X-Loop: FreeBSD.org Precedence: bulk This is *VERY* exciting stuff! > As you can learn in a bit greater detail below, we at LLNL are > considering using an open version of Unix (e.g. FreeBSD or Linux) > for an SMP running on multiple Intel processors connected > via Scalable Coherent Interface: > > http://www.cmpcmm.com/cc/standards.html#SCI Ah. I didn't realize that LAMP used SCI (I've heard of LAMP). It looks like what you are talking about is cache-miss-based transport-using page fetching... otherwise known as a distributed cache coherency system. 8-). The difference is that the memory sharing is implemented by cache miss instead of by explicit reference, right, so it's transparent? > I was referred to your work as above. I did look at your > Web page as noted. The focus there seems to be on message passing > (e.g. MPI?). Did I read that incorrectly? We are currently > focusing on shared memory. As you will read below, our > project will only be worthwhile if we can run a multiprocessing > application with multiple processors sharing a common memory > image (different register sets - e.g. as the Unicos model). > We are not interested in pursuing "virtual" shared memory > at this time (though I would be interested to hear of any > work you have done in this area - particularly performance > studies). The Sarnoff work is in cluster computing; this is, indeed, different, since it implies some scheduling and other assymetry that it looks like (from the WWW reference I could find) that SCI would not have. There are two implementations of DSM (distributed shared memory, for the list archive readers) fro FreeBSD. Probably the best known is the modified NFS with distributed cache coherency. A miss from the vnode pager on the remote NFS mounted vnode causes a page replacement via the net. This is a lot less ambitious, than an SCI implementation, and probably performs at a lower level -- though not significantly lower, since you can argue transport latency. > We are trying to determine how much work it will be to get > "there" from "here" using SCI (over our in-house developed > optical network). I have previously developed such an > operating system from scratch, but would naturally hope > to be able to get such a system running from a FreeBSD > or Linux base with much (!) less effort. Any thoughts > from your experience that you would be willing to share > would be greatly appreciated. I guess I'm still a little confused where "there" ends up being... are you interested in providing SCI interconnect between SMP boxes, or are you interested in SCI interconnect of uniprocessor systems in order to *build* SMP boxes... or are you trying to build *large* SMP boxes from multiple small SMP boxes, etc.? Arguably, from the decriptions of SCI, it looks like you could build a large scalle distributed dataflow architecture... is this your intent, or are you working on LAMP, etc.? I think FreeBSD would be a good choice here for a number of reasons, since all of these are possible directions from the existing code base. Actually, someone (probably John Dyson) needs to write up a VM architecture description; here are some high points, however: o Unified VM/buffer cache Lack of cache unification on a system would be, I think, a primary obstacle to implementing SCI coherency. You would need to implement local coherency as well so that a buffer page miss did the right thing. One of the biggest benfits is the avoidance of a bmap() for each kernel reference of user pages. o Memory pages are referenced from files by vnode/offset This reference model has advantages for cache-based distributed reference; the SCI interconnect could be conceivably implemented as a file system layer using the vnode pager; this would not be the most efficient implementation, but it would be an easy to approach prototype interface to let you hit the ground running. In addition, though the vnode/offest mapping model has a number of drawbacks relative to premature page discarding (which are solvable, given some work on the /sys/kern/vfs_subr.c to kill vclean), it would be relatively easy to add Sun-style VOP_GETPAGE, VOP_PUTPAGE operations to the FS for reference-based cache miss detection (based on the SCI transport indication of a stale page) FreeBSD uses a modified zone allocation policy for kernel memory allocation. Each call to the kernel "malloc" routine takes a zone designator, similar to that used by the Mach VM system. The zone allocation takes place in what are, effectively, SLAB page-based allocations (using kmem_malloc). It isn't a real full SLAB allocation because of bitmap embedding, but it's close enough that conversion would be pretty simple. The use of a zone-based SLAB allocator is actually a significant win of a standard SLAB allocator because of object type memory persistance being relatively equivalent anywhere in a zone. It could be improved by providing allocator persistance hits, or by segmenting the address space based on, for instance, a one byte segment identification decode, or simple short/medium/long tagging, but as it is, the zoning provides significant protection from kernel memory fragmentation on non-page boundries (which you might see with a standard SLAB allocator, such as those used by Solaris and SVR4). FreeBSD, admittedly, could use some work on SLAB managedment, but that's trivial code, on the order of hash management (ie: transcription of Knuth into code, like everyone else does it). In addition, the seperation into zones allows you to to flag the zone identifiers (which has not been done in the current code) to determine whether the allocated resource is local to a processor or should be allocated globally. This is a potentially significant win for scalability. The loss in scalability of Intel processors which led to the 5 bit APIC ID limitation was the standard "diminishing returns" argument for bus contention; however Sequent was able to overcome this limitation with a clever design, which I don't think gets sufficient creit for the MP case in the Vahalia book. What Sequent did was establish a per processor page pool with high and low water marks, from which pages are preferentially taken for a processor's page allocation requests. The page pool is refilled at the low water mark, or emtied at the high water mark. This page pool banding means that the THE PROCESSOR DOES NOT NEED TO HOLD THE GLOBAL MUTEX TO GET PAGES. This allowed SEquent to hit the full 32 processor APIC limit without significantly damaging their scalability at the traditionally predicted 8 processor limit. Finally, there is work under way (by John Dyson) to support shared process address space; this is similar to the Unicos model, which you reference -- though, obviously, you would need to deal with the hard page table entries on multiple processors to trigger the SCI based page-level cache coherency. This started with a Sequent-style "sfork" implementation. John is in possession of some kernel threading code (from another engineer) which operates on a partial sharing model, that he is converting to a full sharing model: he said that he thinks our cost per thread will be the cost of a process in the kernel (proc, upages, minor etc.), saving the per process page table pages using VM space sharing. So I think no matter what direction you are actually going in, FreeBSD is pretty much poised to help you out. (John, David, Poul, folks -- correct me if I've mangled something) Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.