From owner-freebsd-smp Sun Dec 15 12:56:34 1996 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.4/8.8.4) id MAA29285 for smp-outgoing; Sun, 15 Dec 1996 12:56:34 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id MAA29253; Sun, 15 Dec 1996 12:56:23 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id NAA23823; Sun, 15 Dec 1996 13:32:03 -0700 From: Terry Lambert Message-Id: <199612152032.NAA23823@phaeton.artisoft.com> Subject: Re: some questions concerning TLB shootdowns in FreeBSD To: toor@dyson.iquest.net (John S. Dyson) Date: Sun, 15 Dec 1996 13:32:03 -0700 (MST) Cc: toor@dyson.iquest.net, phk@critter.tfs.com, peter@spinner.dialix.com, dyson@freebsd.org, smp@freebsd.org, haertel@ichips.intel.com In-Reply-To: <199612151654.LAA05078@dyson.iquest.net> from "John S. Dyson" at Dec 15, 96 11:54:22 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-smp@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > This won't work because processes seldom have the entire address space > > shared (vm_refcnt.) I am sure that when we get true kernel multithreading > > that will not be true though. In order to test if a section of an > > address space is shared, you have to do something like this (and > > this can take LOTS of time.) (I might have levels of indirection > > off here, I am also not taking into account submaps -- which > > complicate the code further, by entailing recursively calling > > the map/object traversal again -- but recursion is a major > > no-no in the kernel, as we have found.) > > > > Note that I do see that you were talking about shared address spaces, > but address spaces are already partially shared. To do the thing completely > requires traversing alot of the VM data structures. I would suggest > that a coarser grained scheme for pmap_update (invtlb) be considered in > the case of SMP. Also (Peter's ?) suggestion that we have individual > alternate page table's (and temporary mapping pages) for each CPU's > has merit. > > It is likely that large numbers of TLB flushes could be eliminated > if the above were implemented. Since global TLB flushes are going to > be fairly expensive, let's minimize them -- but scanning the VM > data structures is going to be expensive no matter how we do it. > > Note that I have put individual page invalidates into pmap -- we > need to usually remove those in the SMP code. (There are some > special mapping pages where we should probably continue doing > the page invalidates -- but those should also be per-cpu.) Some potential optimizations: 1) This only applys to written pages not marked copy-on-write; read-only pages and pages that will be copied on write (like those in your note about "address spaces are already shared") 2) Flushing can be "lazy" in most cases. That is, the page could be marked invalid for a particular CPU, and only flushed if that CPU needs to use it. For a generic first time implementation, a single unsigned long with CPU invalidity bits could be used (first time because it places a 32 processor limit, which I feel is an unacceptable limitation -- I want to run on Connection Machines some day) as an addition to the page attributes. For the most part, it is important to realize that this is a negative validity indicator. This dictates who has to do the work: the CPU that wants to access the page. The higher the process CPU affinity, the less this will happen. 3) For processes with shared address space, the common data area for all CPU's should be grown. Yes, I realize this means a seperate virtual address space for CPU private, CPU shared, and per process user space addressing. This is less of a burden than you might thing, if you divorce the kernel stack from the idea of process and place it squarely on the head of kernel threads. For a blocking call, a per CPU thread pool can be used as a context container. This would require another bitmap on sleep events so that the CPU's affected can be notified without blocking everyone. 4) One obvious consequence of a per SPU thread pool approach (which I also think is acceptable) is that the wakeup must be processed on the CPU on which it went to sleep. Theoretically, this should mean very little, since the CPU returning a kernel thread is not necessarily being bound to the process. Practically, it probably means that rebinding the CPU for a process can only occur on a blocking system call entry (by choosing which processor to acquire the kernel thread to handle the blocking call) or on involuntary context switch. This is acceptable, since a process making a blocking call or for which the CPU has been involuntarily relinquished will not feel the calculation overhead of the decision; it will be buried in the latency before it is next scheduled to run. For an involuntary context switch, this takes the form of picking which CPU's run queue to insert the process on. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.