From owner-freebsd-smp  Sat Apr  5 09:05:27 1997
Return-Path: <owner-smp>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id JAA16776
          for smp-outgoing; Sat, 5 Apr 1997 09:05:27 -0800 (PST)
Received: from spinner.DIALix.COM (root@spinner.dialix.com [192.203.228.67])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id JAA16767
          for <smp@freebsd.org>; Sat, 5 Apr 1997 09:05:17 -0800 (PST)
Received: from spinner.DIALix.COM (peter@localhost.DIALix.oz.au [127.0.0.1])
          by spinner.DIALix.COM (8.8.5/8.8.5) with ESMTP id BAA18422;
          Sun, 6 Apr 1997 01:04:44 +0800 (WST)
Message-Id: <199704051704.BAA18422@spinner.DIALix.COM>
X-Mailer: exmh version 2.0gamma 1/27/96
To: cr@jcmax.com (Cyrus Rahman)
cc: smp@freebsd.org
Subject: Re: Questions about mp_lock 
In-reply-to: Your message of "Sat, 05 Apr 1997 11:17:25 EST."
             <9704051617.AA05092@corona.jcmax.com> 
Date: Sun, 06 Apr 1997 01:04:44 +0800
From: Peter Wemm <peter@spinner.dialix.com>
Sender: owner-smp@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Cyrus Rahman wrote:
> Could someone who had a hand in implementing the SMP kernel give me a hint
> about why the mp_lock count gets stored in the proc/user structure and
> switched out in cpu_switch()?
> 
> Seems kind of weird, since I would expect that a process getting switched in
> or out would always posses exactly one lock, and that any others would be
> the result of interrupts.  But it does appear that something more complicated
> is going on, and I can't exactly figure out what it is.

The main problem is that the kernel can be recursively entered while flow 
of execution is still "in the kernel".  One interrupt can interrupt 
another's handler, a process can take a page fault while doing a copyin, 
causing the kernel to be reentered via the trap handlers and end up in the 
vm system.

The catch is that when the kernel takes a page fault on a process's 
behalf, the odds are that the process is going to sleep while waiting for 
a block to be read from the disk etc.  When we context switch, the kernel 
stack goes with it.  If we switch from a context that's three levels deep 
to another one that's only two deep, we're going to return to user mode 
while holding the kernel lock, or if we switch from a 2-deep to a 3-deep 
context, the last part of the unwind in the new context is going to run 
in the kernel without the lock, and the other cpu can enter the kernel.

So, we switch the nest count with the process.  It's far from ideal, but 
it works reasonably well on two cpus.  However, there's plenty of scope 
for improvement..

Moving the kernel locking up a layer and having a seperate entry/exit lock
in the trap/syscall/interupt area would be a major win without too much
cost.  What we'd gain by that would be that we could then gradually move 
to a per-subsystem locking system perhaps based initially on which syscall 
or trap type.  It'd be quite possible to have one cpu in the kernel doing 
IP checksumming on a packet, another in the vfs system somewhere, another 
doing some copy-on-write page copies in the vm system and so on.  Things 
like getpid() would need no locking whatsoever.  But that's for later once 
the basics are working.

> Cyrus

Cheers,
-Peter