From owner-freebsd-smp  Sat Nov 23 18:26:44 1996
Return-Path: owner-smp
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id SAA23669
          for smp-outgoing; Sat, 23 Nov 1996 18:26:44 -0800 (PST)
Received: from spinner.DIALix.COM (root@spinner.DIALix.COM [192.203.228.67])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id SAA23659
          for <freebsd-smp@freefall.freebsd.org>; Sat, 23 Nov 1996 18:26:38 -0800 (PST)
Received: from spinner.DIALix.COM (peter@localhost.DIALix.oz.au [127.0.0.1])
          by spinner.DIALix.COM (8.8.3/8.8.3) with ESMTP id KAA24610;
          Sun, 24 Nov 1996 10:26:09 +0800 (WST)
Message-Id: <199611240226.KAA24610@spinner.DIALix.COM>
X-Mailer: exmh version 1.6.9 8/22/96
To: Steve Passe <smp@csn.net>
cc: dg@Root.COM, freebsd-smp@freefall.freebsd.org
Subject: Re: SMP -current merge 
In-reply-to: Your message of "Sat, 23 Nov 1996 16:40:05 MST."
             <199611232340.QAA20297@clem.systemsix.com> 
Date: Sun, 24 Nov 1996 10:26:08 +0800
From: Peter Wemm <peter@spinner.dialix.com>
Sender: owner-smp@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

Steve Passe wrote:
> Hi,
> 
> >   "man idprio"
> >
> >   idprio is a compliment to rtprio. It was an idea I had back when the
> >rtprio code was first submitted. I've always hated "nice" because it was
> >never possible to say "run this process _only_ if nothing else is runnable".
> 
> Ok, so something does need to change.  as most of us probably never run
> anything at idleprio the patch I submitted is better than nothing.
> 
> Possibilities:
> 
>  a 4th queue that is just for cpuidle[ NCPU ] procs.
> 
>  ???

No, it needs to be done correctly.  This means that we need to have the 
CPU's call mi_switch()/cpu_switch() and if no runnable jobs are available 
then they need to halt or busy spin.  halting is preferable, but that 
needs IPI's fully working with handlers to deal with them etc.  The idle 
processes will not exist.

Presently, the idle "process" is implented in the normal kernel via a 
partial process context, it's PTD is IdlePTD.  It doesn't have a struct 
proc, but it does have it's own user area and stack.  This is all 
hard-coded into the VM system's bootup.

Do do things "right", we need to be able to make NCPU of these (or better 
still, one for each "found" and "enabled" cpu, since it's 12KB of kernel 
memory per idle proc).

BTW, as as a general goal, we want to have as little as possible 
references to NCPU (or MAXCPU as it would be more correctly called).  It 
would be just great to not need it at all, but I suspect there will be 
some cases where it'd be required for the few pointers to the per-cpu 
space we'd need.

To "fix" this, I see something like this being the solution...  Instead of 
having globals replaced with arrays, eg: struct proc *SMPcurproc[NCPU];  
and #define curproc (SMPcurproc[cpunumbner()]), we need to move all these 
per-cpu global variables into a seperate VM page.    So, it'll become 
"extern struct proc *curproc", and whichever cpu is running will see it's 
own version.

To pull this off, we need a private per-cpu top-level page table directory 
entry pointing to a mostly empty 4K page directory entry (have I got the 
terminology the right way around?).  In the per-cpu 4K page map, there is 
one mapping to the private "data" page, one mapping for the local apic, 
one for the IO apic(s).

And here's the cute part..  We can store the cpunumber in the per-cpu data 
page, eliminating the need to read the APIC_ID register over and over 
again and all the shifting/masking and table lookups.  When cpu_switch() 
changes the process context, it simply copies the per-cpu PTDE from the 
old process context into the new processes PTD, so the private pages stay 
with the current cpu.. hence there is never any need to read the APIC_ID 
after going online.

I have other crazy ideas about certain problems too.. :-)

Re: the scheduling problems and trying to avoid processes bouncing between 
cpu's which is critically important for P6's with no shared L3 cache.

Presently, there is a routine called schedule() which is run 10 times/sec, 
or sooner if it's needed.  This builds the 32 run queues, handles 
priorities, niceness, etc etc.  The mi_switch()/cpu_switch() routines 
simply pick the first runnable job from the highest run queue.  This is 
nice and quick, but it means the priorities are adjusted every 1/10 second 
or so rather than every context switch.

We can use this to our advantage in the SMP case by having a seperate set 
of 32 RT run queues for each cpu...  A cpu can still call a global 
reschedule to adjust the contents of the run queues, and they are redone 
every hz/10 as already.  This means that the schedule() routine can take 
the "last cpu" of the process into consideration when assigning jobs 
amongst the cpus.

I'm sure there are a zillion more effective ways of doing this (and I'll 
bet Terry will point out at least one. :-), but this is "easy" and will 
reduce a good deal of the problem for a small number of cpus which is a 
good short term goal.  Once we have a reentrant kernel and can 
realistically support more than 4 or 8 cpus without them all bogging down 
in spin locks, then it would be worth doing something better.

Cheers,
-Peter