From owner-freebsd-hackers  Mon Jan  4 19:16:13 1999
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id TAA18311
          for freebsd-hackers-outgoing; Mon, 4 Jan 1999 19:16:13 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp01.primenet.com (smtp01.primenet.com [206.165.6.131])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id TAA18305
          for <hackers@FreeBSD.ORG>; Mon, 4 Jan 1999 19:16:11 -0800 (PST)
          (envelope-from tlambert@usr05.primenet.com)
Received: (from daemon@localhost)
	by smtp01.primenet.com (8.8.8/8.8.8) id UAA14900;
	Mon, 4 Jan 1999 20:15:45 -0700 (MST)
Received: from usr05.primenet.com(206.165.6.205)
 via SMTP by smtp01.primenet.com, id smtpd014856; Mon Jan  4 20:15:39 1999
Received: (from tlambert@localhost)
	by usr05.primenet.com (8.8.5/8.8.5) id UAA02641;
	Mon, 4 Jan 1999 20:15:38 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199901050315.UAA02641@usr05.primenet.com>
Subject: Re: question about re-entrancy.
To: bright@hotjobs.com (Alfred Perlstein)
Date: Tue, 5 Jan 1999 03:15:38 +0000 (GMT)
Cc: hackers@FreeBSD.ORG
In-Reply-To: <Pine.BSF.4.05.9812201621560.6331-100000@bright.fx.genx.net> from "Alfred Perlstein" at Dec 20, 98 04:36:51 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> I'm a bit confused.  I've been watching the discussions going on about
> SMP.  The issue of kernel re-entrancy has been brought up many times.
> 
> So here's the question:
> 
> I thought the kernel was re-entrant, when a process makes a syscall it's
> thread of execution enters the kernel where its arguments are validated
> and careful checks are done to ensure security then the syscall is done.
> 
> If during the system call the process is put to sleep or rescheduled,
> another process may enter the kernel.  As long as the splxxx() calls do
> not interfere with each other you have 2 processes excuting in the kernel
> at that point.  (more can enter as well)
> 
> Or.... what i'm now thinking is that once a process enters the kernel it
> must run until it exits the system call.  I have a problem with this
> though, processes such as nfsiod/nfsd enter the kernel through a syscall
> and never exit (according to design and implementation) how then are other
> processes allowed into the kernel?
> 
> Can anyone explain this a bit better?


The kernel is process reentrant, but not processor reentrant.

It helps if you think of CPU cycles as a consumable resource,
instead of thinking of processes as work to do.

There are three ways to enter the kernel:

1)      System call
2)      Interrupt
3)      Fault

Only one CPU can be in the kernel at a time.  If a second CPU
attempts to enter the kernel while another CPU is already in the
kernel, it spins on The Big Giant Lock (tm) ("We didn't make The 
Big Giant Lock, we only made it Big and Giant").
  
In general, there are two cases to consider:

1)      A CPU is already in the kernel
2)      No CPU is in the kernel (all CPU's are in user space)

In the first case, a fault or interrupt is given to the CPU already
in the kernel (the interrupts run in virtual wire mode on the APIC's,
making interrupt processing truly symmetric, unlike other OS's...);
in the second, it's given to whatever CPU it's given to.

In general, it's possible to modify the interrupt processing to
allow both processors in at the same time.  To do this, you have
to make sure that the subset of the kernel runnable at interrupt
time is reentrant.  While not entirely trivial, this is a pretty
easy task, except for entries via the system call interrupt (a
call gate is much better than an interrupt for system calls from
an SMP perspective in this case, since there's still the need to
grab The Big Giant Lock until you discriminate the interrupt
source).  This hasn't been done.


The tangent into spl is a red herring.  The spl mechanism is
basically a method of blocking operations that will "take too
long" from interrupting the current operation.  In other words,
interrupts and faults, since that's the only thing that can
take the processor away from its assigned task once in the
kernel.


The reason that multiple processors can't be in the kernel at the
same time is a bit more complicated.

Consider a hard RT (RealTime) system.

In order to meet hard RT requirements, you have to be able to
do kernel preemption.

Basically, the idea of kernel preemption is exactly the same
problem that you face trying to run two processors in the kernel
at the same time, or trying to run kernel threads.  Especially
when you are trying to implement soloutions for RT priority
inversion or trying to IPC between kernel threads.

The issue raised is how you protect kernel structures that may
be accessed by two processors at the same time... you have to
lock against the other processor.

Basically, the only preemption mechanism is faults (which can only
occur from user space, except for the Pentium lockup bug workaround)
and interrupts (which occur as a result of external events, like
another machine sending a packet to your ethernet interface).

There are a number of serious architectural issues you have to
address to be able to get multiple processes in the kernel at the
same time.  Fortunately, there are ways to cheat:

1)	Address the interrupt handler reentrancy by doing the
	minimal amount of work at interrupt time, and queueing
	the interrupt and context for the work you did for
	later processing (by a highest-priority kernel thread).
	Don't ACK (reenable) the interrupt to the controller
	(or APIC) until you have done the minimal processing
	and requeue.  For shared (PCI/EISA/MCA/SBUS/Other)
	interrupts, process all of the devices minimal code
	for each device that's asserting the interrupt before
	you ACK it.  This is the best granularity, since most
	of this is device bus work, and you aren't going to be
	able to multiply access the device bus simultaneously
	(you will need a Device Bus Lock for INB/OUTB/INW/OUTW).

2)	Divorce the kernel stack from the proc struct.  This
	allows you to reenter using an "entrancy instance" --
	allocating a kernel stack from a pool of kernel stacks
	not explicitly tied to a process.  This also allows you
	to preempt a processor running in the kernel to have it
	go do something else (you will need an Inactive Stack List
	lock and an Active Stack List Lock to allow you to grab
	an existing stack, or, if you are all out, allocate a new
	stack and add it to the stack list).  Combine this with
	an async call gate (basically a VMS style "event flag" or
	"AST" mechanism), and you don't need kernel threads to
	spread process load over multiple processors.

3)	Create per-CPU resource pools.  This process is described
	in excruciating detail in the Dynix VM Allocator Paper
	(Dynix, if you will remember, sclaed to 32 CPU's without
	serious degradation).  You allocate (and free) resources
	from the per CPU pool based on low (and high) watermarks,
	instead of contending in a global pool (you will need a
	Global Resource Lock for high and low watermark transfers
	to and from the global pool, and you will need to implement
	an IPI [Inter Processor Interrupt] mechanism for low
	resource conditions to override the watermarks during low
	resource conditions.  Typically, there will be little or
	no inter-CPU contention under normal operation).

This still leaves the need to have Big Giant Subsystem Locks (tm)
to push down locks through the system call (trap) mechanism to
each subsystem, until you eliminate context using object locks
(I don't like object locks), or, preferrably, critical section
locks.

Object locks are a necessary evil on lists of objects that must
be maintained, but lists of objects are themselves an evil, and
contention in these lists can be drastically reduces by implementing
a hierarchical intention mode lock manager, and breaking the lists
into sections and pushing the actual objects to be locked down
the hierarchy so that it's unlikely to cause contention between
execution contexts (statistically, the more subnodes, the less
likely you are to run into contention between any two objects,
taken at random).

Mostly, object lists should be done away with.  Cases where this
isn't possible (like the system open file table, the process list,
the directory name lookup cache, etc.) gain obvious concurrency
benefit from intention mode ("SIX") style locks.

For objects themselves, it's easy to either make them an extension
of one more level in the hierarchy (for example, directory vnodes
intended to be written).  You maintain an active transitive closure
calculation on the lock vectors from the root as locks are added;
this allows you to do deadlock detection by traversing the two
nodes in the lock relationship to the root of the lock graph;
otherwise, deadlock detection would be tantamount to soving the
travelling salesman problem).

In general, write locks are necessary, but read locks only need to
be held in cases where there are coherency issues.  For example, a
read lock across a directory entry block while it's being copied to
a user buffer (for a "snapshot" of director block contents, as is
provided by the getdents(2) system call).


The final issue is avoidance of cache-busting behaviour.  This
pretty much boils down to marking lock pages as non-cacheable,
and then assuring CPU affinity for processes (or threads -- in
effect, whetever you're calling your kernel execution contexts)
to make sure that the L1 cache has value and that the MESI (All
Intel SMP is based on MESI architecture) stays away from the "I"
(Invalidate) portion of the acronym.


It's a lot of work, but it's not rocket science.  The most
complicated piece is the lock manager; to do it right, you need
a five dimensional vector relationship (an arc describing the
intersection of three two dimensional vectors: the lock list
held by a locking entity, the hierarchical list of lockable objects,
and the relationship between multiple locking entities within a
single process -- threads within a process).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message