Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 4 Jun 1996 17:01:14 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        phk@freebsd.org (Poul-Henning Kamp)
Cc:        sef@kithrup.com, smp@freebsd.org
Subject:   Re: Unix/NT synchronization model (was: SMP progress?)
Message-ID:  <199606050001.RAA27740@phaeton.artisoft.com>
In-Reply-To: <1926.833927708@critter.tfs.com> from "Poul-Henning Kamp" at Jun 4, 96 03:35:08 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> >Okay, so that's an extremely short-ranged goal ;).  But I don't expect true
> >symmetric MP to be happening for quite some time yet -- there's just too
> >much that would have to be changed.  (Locks around nearly every structure
> >reference in the kernel, for example.)
> 
> And it may not actually >improve< performance to do so...

Actually, I disagree with this.  I think that the higher grain the
parallelism, the better the scalability.  As far as UP performance
is concerned, the syncronization points should be macrotized, but
I'd expect them to go down to spinlocks, at most, in the UP case,
not compile out altogether.  The reason I say this is that there
are significant performance wins to kernel multithreading.

When we (Novell/USG) implemented kernel multithreading and UFS
reentrancy on top of kernel multithreading, there was a 160%
performance increase in the UP case, even with the addition
heavy-weight (cache flushing for interprocessor synchronization)
mutexes instead of semaphores or spinlocks.  In other words, the
multithreading *more* than paid for the high-grained synchronization
overhead.


I would like to see a modified version of a combined Sequent/SVR4
model for high grain parallelism.  The SVR4/Solaris model is the
model described in "UNIX for Modern Architectures" whenever a
specific implementation is referenced.

The problem with the vanilla SVR4/Solaris model is the use of the
SLAB allocator.  Vahalia loves this allocator ("UNIX Internals:
The New Frontiers"), but it is adequate for scaling only to the
point of bus-loading with synchronization traffic.  A SLAB allocator
is a modified zone allocator (like that used in MACH) that
preallocates pages into zones so that like objects are stored in
like areas... if you use a sufficient granularity for your zones,
then this resolves the high/medium/low persistance object issues
that a simple zone allocator can't address (ie: minimizing kernel
heap fragmentation -- a problem all of the BSD implementations
currently suffer under -- Linux, too, for that matter).

The Sequent model, by contrast, uses a per-processor page pool and
pulls allocations from those pools (using zone allocation... yes, I
agree that this is not efficient, it's not relevent to the benefits
of a per processor page pool).  The win on a per processor page pool
is that a hierarchical lock management scheme allows allocation on
a per processor basis without having to grab the global mutex (and
therefore having to synchronize cache caontents for the mutex data
area between processors).

For mutex pages, this means that there is significant benefit to simply
marking the pages as non-cacheable, since the reload frequency is
dropped.  The downside is that this requires a hierarchical lock
manager which implements intention modes: you need to implement
six modes (R W X IR IW IX) instead of threee (R W X).  The benfit
here is that it is now possible to lock IX the per processor mutex
implemented as a counting semaphore in a non-cached page, and compute
deadlock without reference to other processors.  The internal nodes
under the per processr lock node can be in processor-local (cacheable)
pages, so locality is not sacraficed (as it would be in the SVR4/Solaris
model by the need to flush *all* the pages from the cache, or to mark
them non-cacheable).

The whole point of this exercise by Sequent was to be able to scale
to more than the default 8 processors (I believe the APIC ID is a 
4 bit value -- placing an upper limit of 32 processors on Intel MP
spec compliant MP implementations), which is where the SVR4/Solaris
cost/benefit ratio breaks down because of the immense amount of
bus activity needed to perform interprocessor synchronization through
a global VM lock.

The Sequent model fails because they are medium-granularity: in point
of fact, the FS was not reentrant; this can be proven by starting
multiple finds, which will run on different processors, and run seperately
and sequentially to completion because of the non-reentrancy of the
FS on a per-processor basis.  This is utterly bogus (the 160% number
from SVR4 should prove that).  The Sequent model also fails because
it does not allocate zones in terms of slabs.

In terms of practical effect, we will have a global lock for the system,
a non-cacheable sub-hierarchy for system-wide resources which can be
allocated to a given processor (no contention unless two processors
try to refill their pagepools at the same time, etc.), and a cacheable
sub-hierarchy per processor.

If a process references a resource owned by a processor other than the
one it is on, it's no problem, unless it is an IPC area (which must
be non-cacheable or support MESI, not MEI, updating in the MMU hardware).
If, on the other hand, it tries to discard the resource, then the
discard must either go through the processor that the resource was
allocated from, or their must be a synchronixzation update to tell
the owning processor that the reference has been discarded by another
processor (unlikely, assuming a scheduler implementation for processor
affinity).

In any case, the page allocation map can preferentially be updated while
the global mutex is held to fill the per processor page pool (or to
empty it, if it hits the high-water mark), so a processor can know,
on a given discard, who owns the page where the allocation exists.

Obviously, since the allcoation is to the process (/kernel thread), only
the valid own can discard it.  What is mutexed is the allocation bitmap
for the slab, and that can be done via messaging, to be run in the idle
loop of the processor... processor A releasing an allocation for a
process that was allocated for the process by processor B need not
cause a synchronization event for any other processor in the system
(this implies an IPC mechanism using non-cacheable pages, one page
per processor per processor, and a third lock subhierarchy in a
non-cached page, to be used when processor A and B are contending
for the IPC area -- only in the idle: total 260k for 8 processors:
8 * 8 * 4k + 4k -- 1M for 16 processors, 4M for 32 processors).  If
we have to worry about more than that, we can establish communications
clusters... would that we had someone beating down our door with 64
or more processor hardware.  8-).

I think it is ultimately unreasonable to place a granularity limit and
therefore a 4-8 processor limit on the architecture.  I also think
that any granularity limit inhernetly increases bus contention, and
even with Active Server technology coming on line quickly, the limit
in the end is still going to be I/O binding, not compute binding.
Anything we can do to reduce bus contention *must* be a win.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199606050001.RAA27740>