From owner-freebsd-smp Sat Nov 23 18:26:44 1996 Return-Path: owner-smp Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id SAA23669 for smp-outgoing; Sat, 23 Nov 1996 18:26:44 -0800 (PST) Received: from spinner.DIALix.COM (root@spinner.DIALix.COM [192.203.228.67]) by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id SAA23659 for ; Sat, 23 Nov 1996 18:26:38 -0800 (PST) Received: from spinner.DIALix.COM (peter@localhost.DIALix.oz.au [127.0.0.1]) by spinner.DIALix.COM (8.8.3/8.8.3) with ESMTP id KAA24610; Sun, 24 Nov 1996 10:26:09 +0800 (WST) Message-Id: <199611240226.KAA24610@spinner.DIALix.COM> X-Mailer: exmh version 1.6.9 8/22/96 To: Steve Passe cc: dg@Root.COM, freebsd-smp@freefall.freebsd.org Subject: Re: SMP -current merge In-reply-to: Your message of "Sat, 23 Nov 1996 16:40:05 MST." <199611232340.QAA20297@clem.systemsix.com> Date: Sun, 24 Nov 1996 10:26:08 +0800 From: Peter Wemm Sender: owner-smp@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk Steve Passe wrote: > Hi, > > > "man idprio" > > > > idprio is a compliment to rtprio. It was an idea I had back when the > >rtprio code was first submitted. I've always hated "nice" because it was > >never possible to say "run this process _only_ if nothing else is runnable". > > Ok, so something does need to change. as most of us probably never run > anything at idleprio the patch I submitted is better than nothing. > > Possibilities: > > a 4th queue that is just for cpuidle[ NCPU ] procs. > > ??? No, it needs to be done correctly. This means that we need to have the CPU's call mi_switch()/cpu_switch() and if no runnable jobs are available then they need to halt or busy spin. halting is preferable, but that needs IPI's fully working with handlers to deal with them etc. The idle processes will not exist. Presently, the idle "process" is implented in the normal kernel via a partial process context, it's PTD is IdlePTD. It doesn't have a struct proc, but it does have it's own user area and stack. This is all hard-coded into the VM system's bootup. Do do things "right", we need to be able to make NCPU of these (or better still, one for each "found" and "enabled" cpu, since it's 12KB of kernel memory per idle proc). BTW, as as a general goal, we want to have as little as possible references to NCPU (or MAXCPU as it would be more correctly called). It would be just great to not need it at all, but I suspect there will be some cases where it'd be required for the few pointers to the per-cpu space we'd need. To "fix" this, I see something like this being the solution... Instead of having globals replaced with arrays, eg: struct proc *SMPcurproc[NCPU]; and #define curproc (SMPcurproc[cpunumbner()]), we need to move all these per-cpu global variables into a seperate VM page. So, it'll become "extern struct proc *curproc", and whichever cpu is running will see it's own version. To pull this off, we need a private per-cpu top-level page table directory entry pointing to a mostly empty 4K page directory entry (have I got the terminology the right way around?). In the per-cpu 4K page map, there is one mapping to the private "data" page, one mapping for the local apic, one for the IO apic(s). And here's the cute part.. We can store the cpunumber in the per-cpu data page, eliminating the need to read the APIC_ID register over and over again and all the shifting/masking and table lookups. When cpu_switch() changes the process context, it simply copies the per-cpu PTDE from the old process context into the new processes PTD, so the private pages stay with the current cpu.. hence there is never any need to read the APIC_ID after going online. I have other crazy ideas about certain problems too.. :-) Re: the scheduling problems and trying to avoid processes bouncing between cpu's which is critically important for P6's with no shared L3 cache. Presently, there is a routine called schedule() which is run 10 times/sec, or sooner if it's needed. This builds the 32 run queues, handles priorities, niceness, etc etc. The mi_switch()/cpu_switch() routines simply pick the first runnable job from the highest run queue. This is nice and quick, but it means the priorities are adjusted every 1/10 second or so rather than every context switch. We can use this to our advantage in the SMP case by having a seperate set of 32 RT run queues for each cpu... A cpu can still call a global reschedule to adjust the contents of the run queues, and they are redone every hz/10 as already. This means that the schedule() routine can take the "last cpu" of the process into consideration when assigning jobs amongst the cpus. I'm sure there are a zillion more effective ways of doing this (and I'll bet Terry will point out at least one. :-), but this is "easy" and will reduce a good deal of the problem for a small number of cpus which is a good short term goal. Once we have a reentrant kernel and can realistically support more than 4 or 8 cpus without them all bogging down in spin locks, then it would be worth doing something better. Cheers, -Peter