From owner-freebsd-smp Mon Dec 16 10:29:34 1996 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.4/8.8.4) id KAA23052 for smp-outgoing; Mon, 16 Dec 1996 10:29:34 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id KAA23047; Mon, 16 Dec 1996 10:29:26 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id LAA01626; Mon, 16 Dec 1996 11:27:15 -0700 From: Terry Lambert Message-Id: <199612161827.LAA01626@phaeton.artisoft.com> Subject: Re: some questions concerning TLB shootdowns in FreeBSD To: toor@dyson.iquest.net (John S. Dyson) Date: Mon, 16 Dec 1996 11:27:14 -0700 (MST) Cc: terry@lambert.org, toor@dyson.iquest.net, phk@critter.tfs.com, peter@spinner.dialix.com, dyson@freebsd.org, smp@freebsd.org, haertel@ichips.intel.com In-Reply-To: <199612160058.TAA05793@dyson.iquest.net> from "John S. Dyson" at Dec 15, 96 07:58:05 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-smp@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > Terry, I am NOT discounting your suggestions -- but I am > bringing up challenges associated with your suggestions. OK... 8-). > > Some potential optimizations: > > > > 1) This only applys to written pages not marked copy-on-write; > > read-only pages and pages that will be copied on write (like > > those in your note about "address spaces are already shared") > > You still need to be able to globally invalidate pages that are mapped > read-only. (e.g. paging out (where things can happen anytime, or > object reclaimation -- that of course doesn't happen unless an object > is done with.)) OK, I buy this one. Not only would it have to be done with, it would have to be at the end of the LRU to get it forced out, right? In that particular case, it might be worthwhile to invalidate down to a low water mark of unreferenced pages when a reclaim is necessary. > > 2) Flushing can be "lazy" in most cases. That is, the page could > > be marked invalid for a particular CPU, and only flushed if > > that CPU needs to use it. For a generic first time implementation, > > a single unsigned long with CPU invalidity bits could be used > > (first time because it places a 32 processor limit, which I > > feel is an unacceptable limitation -- I want to run on Connection > > Machines some day) as an addition to the page attributes. For > > the most part, it is important to realize that this is a > > negative validity indicator. This dictates who has to do > > the work: the CPU that wants to access the page. The higher > > the process CPU affinity, the less this will happen. > > There is no special indication of a TLB entry being updated in a > processor from the page tables. So, once there is a page table > entry created, we have no indication when the processor grabs > it (I seem to remember that there are ways for coercing P6's > and perhaps P5's into doing it though.) The only way that I > would try to do it is to get information from Intel saying that > the method is "blessed." It could break things for other (non-Intel) > CPU's though. By default, you mark all cpu's but the one doing the initial mapping as "invalid". If the execution context needing the page stays on that CPU, you never have to update anyone. Only when you move an execution context between CPU's do you have to update, and then only the target CPU (and potentially preinvalidate those on the source CPU). Shared pages are more problematic, but at least only the CPU's with mapping bits turned off need to participate in the IPI hold. This presumes some (minor) scheduler hooks with per CPU run queues, which I admit is slightly different that what there is now. The scheduler would only choose to move a process between queues based on thread non-affinity (ie: multiple kernel threads trying for SMP scalability) or to load-balance the overall system. In both cases, it knows about the mapping on the source CPU and so can tell the target CPU about the mappings. The point is that the mappings on all CPU's need not be identical, only valid for the execution contexts in that CPU's run queue. You consider the kernel itself as a single shared execution context, but the kernel mapping only has to be done once, up front, at load time, or again at module load time. As CPU's destroy page references, they must hold a mutex (or the reference counts, which are only examined on create/destroy/change_CPU may be placed in non-cacheable memory). If the bitmap goes to zero references, the CPU doing the decrement returns it to the global pool. Is there a reason that the CPU destroying it's last reference would not be able to invalidate for itself at that time? Are you concerned that what is needed is "non-L2 cacheable"? If so, I don't think it's necessary. If it is necessary, then you're right, it would cause problems on other CPUs... the BeBox, a dual PPC603 machine, adds the second CPU in place of an L2 cache. As a result, it only supports MEI coherency protocol, *not* MESI. This would be a problem. On the other hand, it's like soft updates: you implement the calls at the synchronization nodes in the event graph, and for a trivial implementation, you synchronously complete the call at the time the call is made instead of when the first cycle would occur in the graph element list. For processors that support TLB update, you encapsulate the "blessed" code. For those that don't, you do the update immediately. The code making the call doesn't know what the macro will make it do on a given processor; the macro is only a synchronization point, and the act of resolving the synchronization itself is CPU specific (and therefore opaque). Personally, I don't think TLB update is an issue for multiple CPU's as long as there are per CPU reference counts and a global bitmap that indicates that a given CPU has no reference counts on it (the point being to avoid locking and to hide the non-cacheable references in the expensive operations you already have to do -- even the per CPU count can be cacheable, since it's private to that CPU). Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.