From owner-freebsd-smp  Mon Dec 16 10:29:34 1996
Return-Path: <owner-smp>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.4/8.8.4) id KAA23052
          for smp-outgoing; Mon, 16 Dec 1996 10:29:34 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.8.4/8.8.4) with SMTP id KAA23047;
          Mon, 16 Dec 1996 10:29:26 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id LAA01626; Mon, 16 Dec 1996 11:27:15 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199612161827.LAA01626@phaeton.artisoft.com>
Subject: Re: some questions concerning TLB shootdowns in FreeBSD
To: toor@dyson.iquest.net (John S. Dyson)
Date: Mon, 16 Dec 1996 11:27:14 -0700 (MST)
Cc: terry@lambert.org, toor@dyson.iquest.net, phk@critter.tfs.com,
        peter@spinner.dialix.com, dyson@freebsd.org, smp@freebsd.org,
        haertel@ichips.intel.com
In-Reply-To: <199612160058.TAA05793@dyson.iquest.net> from "John S. Dyson" at Dec 15, 96 07:58:05 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-smp@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> Terry, I am NOT discounting your suggestions -- but I am
> bringing up challenges associated with your suggestions.

OK... 8-).

> > Some potential optimizations:
> > 
> > 1)	This only applys to written pages not marked copy-on-write;
> > 	read-only pages and pages that will be copied on write (like
> > 	those in your note about "address spaces are already shared")
>
> You still need to be able to globally invalidate pages that are mapped
> read-only.  (e.g. paging out (where things can happen anytime, or
> object reclaimation -- that of course doesn't happen unless an object
> is done with.))

OK, I buy this one.  Not only would it have to be done with, it would
have to be at the end of the LRU to get it forced out, right?  In
that particular case, it might be worthwhile to invalidate down to a low
water mark of unreferenced pages when a reclaim is necessary.

> > 2)	Flushing can be "lazy" in most cases.  That is, the page could
> > 	be marked invalid for a particular CPU, and only flushed if
> > 	that CPU needs to use it.  For a generic first time implementation,
> > 	a single unsigned long with CPU invalidity bits could be used
> > 	(first time because it places a 32 processor limit, which I
> > 	feel is an unacceptable limitation -- I want to run on Connection
> > 	Machines some day) as an addition to the page attributes.  For
> > 	the most part, it is important to realize that this is a
> > 	negative validity indicator.  This dictates who has to do
> > 	the work: the CPU that wants to access the page.  The higher
> > 	the process CPU affinity, the less this will happen.
>
> There is no special indication of a TLB entry being updated in a
> processor from the page tables.  So, once there is a page table
> entry created, we have no indication when the processor grabs
> it (I seem to remember that there are ways for coercing P6's
> and perhaps P5's into doing it though.)  The only way that I
> would try to do it is to get information from Intel saying that
> the method is "blessed." It could break things for other (non-Intel)
> CPU's though.

By default, you mark all cpu's but the one doing the initial mapping
as "invalid".  If the execution context needing the page stays on
that CPU, you never have to update anyone.  Only when you move an
execution context between CPU's do you have to update, and then only
the target CPU (and potentially preinvalidate those on the source
CPU).

Shared pages are more problematic, but at least only the CPU's with
mapping bits turned off need to participate in the IPI hold.

This presumes some (minor) scheduler hooks with per CPU run queues,
which I admit is slightly different that what there is now.  The
scheduler would only choose to move a process between queues based
on thread non-affinity (ie: multiple kernel threads trying for SMP
scalability) or to load-balance the overall system.  In both cases,
it knows about the mapping on the source CPU and so can tell the target
CPU about the mappings.

The point is that the mappings on all CPU's need not be identical,
only valid for the execution contexts in that CPU's run queue.  You
consider the kernel itself as a single shared execution context, but
the kernel mapping only has to be done once, up front, at load time,
or again at module load time.

As CPU's destroy page references, they must hold a mutex (or the
reference counts, which are only examined on create/destroy/change_CPU
may be placed in non-cacheable memory).  If the bitmap goes to zero
references, the CPU doing the decrement returns it to the global pool.
Is there a reason that the CPU destroying it's last reference would
not be able to invalidate for itself at that time?


Are you concerned that what is needed is "non-L2 cacheable"?  If so,
I don't think it's necessary.  If it is necessary, then you're right,
it would cause problems on other CPUs... the BeBox, a dual PPC603
machine, adds the second CPU in place of an L2 cache.  As a result,
it only supports MEI coherency protocol, *not* MESI.  This would be
a problem.

On the other hand, it's like soft updates: you implement the calls
at the synchronization nodes in the event graph, and for a trivial
implementation, you synchronously complete the call at the time
the call is made instead of when the first cycle would occur in the
graph element list.  For processors that support TLB update, you
encapsulate the "blessed" code.  For those that don't, you do the
update immediately.  The code making the call doesn't know what
the macro will make it do on a given processor; the macro is only
a synchronization point, and the act of resolving the synchronization
itself is CPU specific (and therefore opaque).  Personally, I don't
think TLB update is an issue for multiple CPU's as long as there are
per CPU reference counts and a global bitmap that indicates that a
given CPU has no reference counts on it (the point being to avoid
locking and to hide the non-cacheable references in the expensive
operations you already have to do -- even the per CPU count can be
cacheable, since it's private to that CPU).


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.