From owner-freebsd-smp  Sun Feb  2 11:48:28 1997
Return-Path: <owner-smp>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id LAA07402
          for smp-outgoing; Sun, 2 Feb 1997 11:48:28 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id LAA07387
          for <freebsd-smp@freebsd.org>; Sun, 2 Feb 1997 11:48:24 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id MAA08273; Sun, 2 Feb 1997 12:44:50 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199702021944.MAA08273@phaeton.artisoft.com>
Subject: Re: SMP
To: davem@jenolan.rutgers.edu (David S. Miller)
Date: Sun, 2 Feb 1997 12:44:50 -0700 (MST)
Cc: michaelh@cet.co.jp, netdev@roxanne.nuclecu.unam.mx, roque@di.fc.ul.pt,
        freebsd-smp@freebsd.org
In-Reply-To: <199702021202.HAA09281@jenolan.caipgeneral> from "David S. Miller" at Feb 2, 97 07:02:25 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-smp@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

>    It almost sounds like there are cases where "short holds" and "less
>    contention" are hard to achieve.  Can you give us an example?  Or
>    are you saying that spending time on contention minimization is not
>    very fruitful.
> 
> It is hard to achieve in certain cirsumstances yes, but it is worth
> putting some effort towards just not "too much" if things begin to
> look a bit abismal.

This is why I suggested a data-flow abstraction be the first step
instead of an object abstraction.

What you really want to do is lock "dangerous" areas.  These are
where you do shared resource manipulations within a subsystem
within a context.  This is an important distinction: not all
objects are shared, and therefore it is not important to have
all objects be locked, or even some objects may be shared in a
given context and unshared in another.

One good example is probably directory entry manipulation in an
FS subsystem in the VFS system in a kernel.


This leads to thinking of contexts in terms of "work to do", and
scheduling the CPU as a shared resource against the "work to do".
If you further abstract this to scheduling quantum against "work
to do", you "magically" take into account kernel as well as user
execution contexts.


Typically, you would have the kernel stack, registers, and program
counter tracked via a "kernel thread"... really, and kernel
schedulable entity.  Cache coherency is tracked against CPU's...
it's important to note at this point the existance of MEI PPC
SMP implementations without an L2 cache.  User processes are a
consumer of kernel schedulable entities.

For instance, say you wanted minimal context switch overhead, so
you design a threading system in which all system calls can be
asynchronus.

You provide user space with a "wait for completion" system call
as the only blocking system call for an otherwise asynchronus
call gate.  You implement the asynchronus call gate by handing
off the blocking work to a kernel schedulable entity, and let
*it* block on your behalf.  Quantum is now counted in terms of
time in user space, and once you are given a quantum in your
multithreaded application, you consume as much of the quantum as
you have work to do.

Another way of looking at it is "once the system *GIVES* me a
quantum, by *GOD* it's *MY* quantum, and I should not be forced
to give it away simply because I have made a system call... it
is an unacceptable *PUNISHMENT* for making system calls".

Clearly, to accommodate this, you would have to have the ability
regulate the number of kernel schedulable entities which could
be in user space at one time (the number of "kernel threads"
a "process" has).  You would also need the ability to cooperatively
schedule in user space... you would implement this by giving "the
execution context" (really, a kernel schedulable entity which has
migrated to user space through the call gate) to a user thread
that could run.  Then when a thread makes a blocking call, you
convert it to a non-blocking call plus a context switch (on SPARC,
you also flush register windows, etc., to implement the context
switch).

Each kernel schedulable entity permitted in user space for a
"process" competes for a process quantum as an historically
implemented process.  SMP scaling comes from the kernel schedulable
entities in a "process" potentially having a seperate CPU resource
scheduled against them.

Clearly, locking occurs only in interprocessor synchronization
cases, as in migration of a kernel schedulable entity from one
processor to another, and/or for global pool access.  A user
process exists as an address space mapping, and async completion
queue, and a K.S.E. -> user space quota, which can be referenced
by any nimber of K.S.E.'s in the kernel, and some quota-number
of K.S.E.'s in user space.


So I don't lock when I enter an FS lookup, and I don't lock when
I traverse a directory for read (a directory block from a read
is an idempotent snapshot of the directory state), and I *do*
lock when I allocate a vnode reference instance, so that I can
kick the reference count.

And yes, I can make the object "delete itself" on a 1->0 reference
count transition, if I want to, or I can lock the vnode pool
pointer in the vrele case instead, and get a minimum hold time by
having it hadle the 1->0 transition.


> As Terry Lambert pointed out, even if
> you have per-processor pools you run into many cache pulls between
> processors when you are on one cpu, allocate a resource, get scheduled
> on another and then free it from there.  (the TLB invalidation cases
> he described are not an issue at all for us on the kernel side, brain
> damaged traditional unix kernel memory designs eat this overhead, we
> won't, other than vmalloc() chunks we eat no mapping setup and
> destruction overhead at all and thus nothing to keep "consistant"
> with IPI's since it is not changing)

???

The problem I was describing was an adjacency of allocation problem,
not a mapping setup issue.  Unless you are marking your allocation
areas non-pageable (a really big lose, IMO), you *will* have this
problem in one form or another.

For instance, I get a page of memory, and I allocate two 50 byte
objects out of it.  I modify both of the objects, then I give
the second object to another processor.  The other processor
modifies the second object and the first processor modifies the
first object again.

In theory, there will be a cache overlap, where the cache line on
CPU 1 contains stale data for object two, and the cache line on
CPU 2 contains stale data for object one.  When either cache line
is written through, the other object will be damaged, right?  Not
immediately, but in the case of a cache line reload.  In other
words, in the general case of a process context switch with a
rather full ready-to-run-queue, with the resource held such that
it goes out of scope in the cache.

How do you resolve this?  Do you allocate only in terms of an
execution context, so each execution context has a kernel page
pool, and then assume that if the execution context migrates,
the page pool will migrate as well?  How do you handle shared
objects, kernel threads, and resource-based IPC?


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.