Date: Tue, 4 Jun 1996 17:01:14 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: phk@freebsd.org (Poul-Henning Kamp) Cc: sef@kithrup.com, smp@freebsd.org Subject: Re: Unix/NT synchronization model (was: SMP progress?) Message-ID: <199606050001.RAA27740@phaeton.artisoft.com> In-Reply-To: <1926.833927708@critter.tfs.com> from "Poul-Henning Kamp" at Jun 4, 96 03:35:08 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> >Okay, so that's an extremely short-ranged goal ;). But I don't expect true > >symmetric MP to be happening for quite some time yet -- there's just too > >much that would have to be changed. (Locks around nearly every structure > >reference in the kernel, for example.) > > And it may not actually >improve< performance to do so... Actually, I disagree with this. I think that the higher grain the parallelism, the better the scalability. As far as UP performance is concerned, the syncronization points should be macrotized, but I'd expect them to go down to spinlocks, at most, in the UP case, not compile out altogether. The reason I say this is that there are significant performance wins to kernel multithreading. When we (Novell/USG) implemented kernel multithreading and UFS reentrancy on top of kernel multithreading, there was a 160% performance increase in the UP case, even with the addition heavy-weight (cache flushing for interprocessor synchronization) mutexes instead of semaphores or spinlocks. In other words, the multithreading *more* than paid for the high-grained synchronization overhead. I would like to see a modified version of a combined Sequent/SVR4 model for high grain parallelism. The SVR4/Solaris model is the model described in "UNIX for Modern Architectures" whenever a specific implementation is referenced. The problem with the vanilla SVR4/Solaris model is the use of the SLAB allocator. Vahalia loves this allocator ("UNIX Internals: The New Frontiers"), but it is adequate for scaling only to the point of bus-loading with synchronization traffic. A SLAB allocator is a modified zone allocator (like that used in MACH) that preallocates pages into zones so that like objects are stored in like areas... if you use a sufficient granularity for your zones, then this resolves the high/medium/low persistance object issues that a simple zone allocator can't address (ie: minimizing kernel heap fragmentation -- a problem all of the BSD implementations currently suffer under -- Linux, too, for that matter). The Sequent model, by contrast, uses a per-processor page pool and pulls allocations from those pools (using zone allocation... yes, I agree that this is not efficient, it's not relevent to the benefits of a per processor page pool). The win on a per processor page pool is that a hierarchical lock management scheme allows allocation on a per processor basis without having to grab the global mutex (and therefore having to synchronize cache caontents for the mutex data area between processors). For mutex pages, this means that there is significant benefit to simply marking the pages as non-cacheable, since the reload frequency is dropped. The downside is that this requires a hierarchical lock manager which implements intention modes: you need to implement six modes (R W X IR IW IX) instead of threee (R W X). The benfit here is that it is now possible to lock IX the per processor mutex implemented as a counting semaphore in a non-cached page, and compute deadlock without reference to other processors. The internal nodes under the per processr lock node can be in processor-local (cacheable) pages, so locality is not sacraficed (as it would be in the SVR4/Solaris model by the need to flush *all* the pages from the cache, or to mark them non-cacheable). The whole point of this exercise by Sequent was to be able to scale to more than the default 8 processors (I believe the APIC ID is a 4 bit value -- placing an upper limit of 32 processors on Intel MP spec compliant MP implementations), which is where the SVR4/Solaris cost/benefit ratio breaks down because of the immense amount of bus activity needed to perform interprocessor synchronization through a global VM lock. The Sequent model fails because they are medium-granularity: in point of fact, the FS was not reentrant; this can be proven by starting multiple finds, which will run on different processors, and run seperately and sequentially to completion because of the non-reentrancy of the FS on a per-processor basis. This is utterly bogus (the 160% number from SVR4 should prove that). The Sequent model also fails because it does not allocate zones in terms of slabs. In terms of practical effect, we will have a global lock for the system, a non-cacheable sub-hierarchy for system-wide resources which can be allocated to a given processor (no contention unless two processors try to refill their pagepools at the same time, etc.), and a cacheable sub-hierarchy per processor. If a process references a resource owned by a processor other than the one it is on, it's no problem, unless it is an IPC area (which must be non-cacheable or support MESI, not MEI, updating in the MMU hardware). If, on the other hand, it tries to discard the resource, then the discard must either go through the processor that the resource was allocated from, or their must be a synchronixzation update to tell the owning processor that the reference has been discarded by another processor (unlikely, assuming a scheduler implementation for processor affinity). In any case, the page allocation map can preferentially be updated while the global mutex is held to fill the per processor page pool (or to empty it, if it hits the high-water mark), so a processor can know, on a given discard, who owns the page where the allocation exists. Obviously, since the allcoation is to the process (/kernel thread), only the valid own can discard it. What is mutexed is the allocation bitmap for the slab, and that can be done via messaging, to be run in the idle loop of the processor... processor A releasing an allocation for a process that was allocated for the process by processor B need not cause a synchronization event for any other processor in the system (this implies an IPC mechanism using non-cacheable pages, one page per processor per processor, and a third lock subhierarchy in a non-cached page, to be used when processor A and B are contending for the IPC area -- only in the idle: total 260k for 8 processors: 8 * 8 * 4k + 4k -- 1M for 16 processors, 4M for 32 processors). If we have to worry about more than that, we can establish communications clusters... would that we had someone beating down our door with 64 or more processor hardware. 8-). I think it is ultimately unreasonable to place a granularity limit and therefore a 4-8 processor limit on the architecture. I also think that any granularity limit inhernetly increases bus contention, and even with Active Server technology coming on line quickly, the limit in the end is still going to be I/O binding, not compute binding. Anything we can do to reduce bus contention *must* be a win. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199606050001.RAA27740>