From owner-freebsd-smp  Mon Sep  9 03:54:50 1996
Return-Path: owner-smp
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id DAA14851
          for smp-outgoing; Mon, 9 Sep 1996 03:54:50 -0700 (PDT)
Received: from spinner.DIALix.COM (spinner.DIALix.COM [192.203.228.67])
          by freefall.freebsd.org (8.7.5/8.7.3) with ESMTP id DAA14846
          for <freebsd-smp@freebsd.org>; Mon, 9 Sep 1996 03:54:44 -0700 (PDT)
Received: from spinner.DIALix.COM (localhost.DIALix.oz.au [127.0.0.1])
          by spinner.DIALix.COM (8.7.5/8.7.3) with ESMTP id SAA08111;
          Mon, 9 Sep 1996 18:43:58 +0800 (WST)
Message-Id: <199609091043.SAA08111@spinner.DIALix.COM>
X-Mailer: exmh version 1.6.7 5/3/96
To: rv@groa.uct.ac.za (Russell Vincent)
cc: freebsd-smp@freebsd.org
Subject: Re: Intel XXpress - some SMP benchmarks 
In-reply-to: Your message of "Sat, 09 Sep 1996 11:25:31 +0200."
             <m0v02ax-0004vqC@groa.uct.ac.za> 
Date: Mon, 09 Sep 1996 18:43:58 +0800
From: Peter Wemm <peter@spinner.dialix.com>
Sender: owner-smp@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Russell Vincent wrote:
> 'lmbench 1.0' results for:

Ahem.. Enough said. :-)  But regardless of the accuracy issue, it 
certainly gives an indication of the various bottlenecks.

>  o Option (3), although not that good in the benchmarks, certainly
>    appears faster in interactive use. That could just be my imagination,
>    though.  :-)

Several things to consider:
 - the second cpu is never pre-empted while running.  This is bad 
(obviously :-) since a process that does a while(1); till run on the cpu 
forever unless it gets killed or paged.  And on that note, we don't make 
any allowance for the page tables being changed while one cpu is in user 
mode.  (we flush during the context switch, but that doesn't help if a 
page is stolen).  I've been trying to decipher some of the more obscure 
parts of the apic docs, and it appears that we can sort-of simulate a 
round-robin approach on certain interrupts without too much reliability, 
but it's better than nothing I think.  (I have in mind setting all the cpu 
"priorities" the same, and let the apic's use their internal tie-breaking 
weighting.  I've not read enough on it yet, but I think it's possible...)

 - the smp_idleloop is currently killing the performance when one process 
is running, because the idleloop is constantly bouncing back and forwards 
between the two idle procs.  ie: _whichidqs is always true, so it's 
constantly locking, and unlocking causing extreme congestion on that lock. 
 There has got to be a better way to do the locking (I have ideas).  When 
one process leaves kernel mode, it's got a fight on it's hands to get back 
in. It's got to try and get the MESI cache line in a favourable state so 
that it can try a lock.  I'm suprised this hasn't turned up before now 
that I think about it.  I would expect the system would not do too well 
under heavy paging load... :-(

 - several major subsystems run a fair bit of code without spl protection 
(I'm thinking of VFS and VM).  If we could ever figure out how to clean 
the trap/exception/interrupt handling up enough to cleanly enter and exit 
a "locked" state, we could probably do wonders like having some parts of 
the kernel reentrant on both cpus.  Unfortunately, the trap code is 
extremely optimised for the single-processor case (and I do mean extreme.. 
:-), and is quite difficult to follow.  We had to introduce reference 
counting on the kernel mutex lock some time ago simply because parts of 
the kernel are reentered via the trap code from within the kernel.  A 
rethink needs to happen here to figure out how we can cut downt he locking 
overheads without penalising the uniprocessor case much.  That may mean 
having a seperate lock for the trap layer and the kernel, where only one 
cpu can be within the trap layer (with a simple, non-stacking lock), and 
the "kernel proper" lock is reference counted.  The "kernel proper" lock 
could probably then have the vfs and perhaps vm split off into seperate 
locks or locking strategies.  (and if somebody starts spouting jargon from 
his graph-theory book, that I for one don't understand a word of, I'll 
scream. :-)

- "less debug code"..  Have you looked very closely at the implications of 
your chipset bios settings?  Is it possible that some of the speedups are 
deferring cpu cache writebacks too long and one cpu is getting data from 
RAM that has just been entered into the other chipset's "write buffer"?  
(ie: cache thinks it's been written back, but it's not in RAM yet, so the 
MESI protocol is defeated?  I have no idea if this is possible or not.. 
just a wild guess.  if "lock cmpxchg" is truely atomic, then the problem 
you see should not be happening...  I presume you have tried the 
motherboard on "maximum pessimitic settings"?

Anyway, I've got a deadline in a few hours, I've already spent way too 
long on this.. :-]

Cheers,
-Peter