From owner-freebsd-smp Sun Feb 2 11:48:28 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id LAA07402 for smp-outgoing; Sun, 2 Feb 1997 11:48:28 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id LAA07387 for ; Sun, 2 Feb 1997 11:48:24 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id MAA08273; Sun, 2 Feb 1997 12:44:50 -0700 From: Terry Lambert Message-Id: <199702021944.MAA08273@phaeton.artisoft.com> Subject: Re: SMP To: davem@jenolan.rutgers.edu (David S. Miller) Date: Sun, 2 Feb 1997 12:44:50 -0700 (MST) Cc: michaelh@cet.co.jp, netdev@roxanne.nuclecu.unam.mx, roque@di.fc.ul.pt, freebsd-smp@freebsd.org In-Reply-To: <199702021202.HAA09281@jenolan.caipgeneral> from "David S. Miller" at Feb 2, 97 07:02:25 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-smp@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > It almost sounds like there are cases where "short holds" and "less > contention" are hard to achieve. Can you give us an example? Or > are you saying that spending time on contention minimization is not > very fruitful. > > It is hard to achieve in certain cirsumstances yes, but it is worth > putting some effort towards just not "too much" if things begin to > look a bit abismal. This is why I suggested a data-flow abstraction be the first step instead of an object abstraction. What you really want to do is lock "dangerous" areas. These are where you do shared resource manipulations within a subsystem within a context. This is an important distinction: not all objects are shared, and therefore it is not important to have all objects be locked, or even some objects may be shared in a given context and unshared in another. One good example is probably directory entry manipulation in an FS subsystem in the VFS system in a kernel. This leads to thinking of contexts in terms of "work to do", and scheduling the CPU as a shared resource against the "work to do". If you further abstract this to scheduling quantum against "work to do", you "magically" take into account kernel as well as user execution contexts. Typically, you would have the kernel stack, registers, and program counter tracked via a "kernel thread"... really, and kernel schedulable entity. Cache coherency is tracked against CPU's... it's important to note at this point the existance of MEI PPC SMP implementations without an L2 cache. User processes are a consumer of kernel schedulable entities. For instance, say you wanted minimal context switch overhead, so you design a threading system in which all system calls can be asynchronus. You provide user space with a "wait for completion" system call as the only blocking system call for an otherwise asynchronus call gate. You implement the asynchronus call gate by handing off the blocking work to a kernel schedulable entity, and let *it* block on your behalf. Quantum is now counted in terms of time in user space, and once you are given a quantum in your multithreaded application, you consume as much of the quantum as you have work to do. Another way of looking at it is "once the system *GIVES* me a quantum, by *GOD* it's *MY* quantum, and I should not be forced to give it away simply because I have made a system call... it is an unacceptable *PUNISHMENT* for making system calls". Clearly, to accommodate this, you would have to have the ability regulate the number of kernel schedulable entities which could be in user space at one time (the number of "kernel threads" a "process" has). You would also need the ability to cooperatively schedule in user space... you would implement this by giving "the execution context" (really, a kernel schedulable entity which has migrated to user space through the call gate) to a user thread that could run. Then when a thread makes a blocking call, you convert it to a non-blocking call plus a context switch (on SPARC, you also flush register windows, etc., to implement the context switch). Each kernel schedulable entity permitted in user space for a "process" competes for a process quantum as an historically implemented process. SMP scaling comes from the kernel schedulable entities in a "process" potentially having a seperate CPU resource scheduled against them. Clearly, locking occurs only in interprocessor synchronization cases, as in migration of a kernel schedulable entity from one processor to another, and/or for global pool access. A user process exists as an address space mapping, and async completion queue, and a K.S.E. -> user space quota, which can be referenced by any nimber of K.S.E.'s in the kernel, and some quota-number of K.S.E.'s in user space. So I don't lock when I enter an FS lookup, and I don't lock when I traverse a directory for read (a directory block from a read is an idempotent snapshot of the directory state), and I *do* lock when I allocate a vnode reference instance, so that I can kick the reference count. And yes, I can make the object "delete itself" on a 1->0 reference count transition, if I want to, or I can lock the vnode pool pointer in the vrele case instead, and get a minimum hold time by having it hadle the 1->0 transition. > As Terry Lambert pointed out, even if > you have per-processor pools you run into many cache pulls between > processors when you are on one cpu, allocate a resource, get scheduled > on another and then free it from there. (the TLB invalidation cases > he described are not an issue at all for us on the kernel side, brain > damaged traditional unix kernel memory designs eat this overhead, we > won't, other than vmalloc() chunks we eat no mapping setup and > destruction overhead at all and thus nothing to keep "consistant" > with IPI's since it is not changing) ??? The problem I was describing was an adjacency of allocation problem, not a mapping setup issue. Unless you are marking your allocation areas non-pageable (a really big lose, IMO), you *will* have this problem in one form or another. For instance, I get a page of memory, and I allocate two 50 byte objects out of it. I modify both of the objects, then I give the second object to another processor. The other processor modifies the second object and the first processor modifies the first object again. In theory, there will be a cache overlap, where the cache line on CPU 1 contains stale data for object two, and the cache line on CPU 2 contains stale data for object one. When either cache line is written through, the other object will be damaged, right? Not immediately, but in the case of a cache line reload. In other words, in the general case of a process context switch with a rather full ready-to-run-queue, with the resource held such that it goes out of scope in the cache. How do you resolve this? Do you allocate only in terms of an execution context, so each execution context has a kernel page pool, and then assume that if the execution context migrates, the page pool will migrate as well? How do you handle shared objects, kernel threads, and resource-based IPC? Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.