From owner-freebsd-smp  Mon Dec 23 22:39:54 1996
Return-Path: <owner-smp>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.4/8.8.4) id WAA11434
          for smp-outgoing; Mon, 23 Dec 1996 22:39:54 -0800 (PST)
Received: from spinner.DIALix.COM (root@spinner.DIALix.COM [192.203.228.67])
          by freefall.freebsd.org (8.8.4/8.8.4) with ESMTP id WAA11429
          for <smp@freebsd.org>; Mon, 23 Dec 1996 22:39:50 -0800 (PST)
Received: from spinner.DIALix.COM (peter@localhost.DIALix.oz.au [127.0.0.1])
          by spinner.DIALix.COM (8.8.4/8.8.4) with ESMTP id OAA19697;
          Tue, 24 Dec 1996 14:00:47 +0800 (WST)
Message-Id: <199612240600.OAA19697@spinner.DIALix.COM>
X-Mailer: exmh version 1.6.9 8/22/96
To: Erich Boleyn <erich@uruk.org>
cc: smp@freebsd.org, haertel@ichips.intel.com, wscott@ichips.intel.com
Subject: Re: I think we have the culprit!! (was -> Re: Eureka (maybe...) 
 (was -> Re: P6 problem idea ) )
In-reply-to: Your message of "Mon, 23 Dec 1996 22:21:49 PST."
             <E0vcQFJ-0003Y6-00@uruk.org> 
Date: Tue, 24 Dec 1996 14:00:46 +0800
From: Peter Wemm <peter@spinner.dialix.com>
Sender: owner-smp@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Erich Boleyn wrote:
> Erich Boleyn <erich@uruk.org> writes:
> > I tried shutting off the page global stuff, and while I don't have
> > a difinitively long run yet, it has run through 3 full kernel compiles
> > with no crash yet.  I'll run it for the next 1 1/2 hours and see if
> > it lives through that.  If so, I think we have our main culprit (I'll
> > also post the (small) code change which synchronizes the CPUs on TLB
> > shootdown before letting the sender continue).
> 
> Well, after 2 hours of kernel builds, and now a few sets of 4 parallel
> kernel builds later, the system is still running great.
> 
> I think we have our culprit...  the Page Global stuff (plus adding the
> TLB shootdown synchronization may be helping a little with stability, but
> it's absence doesn't appear to be the major cause).

Hmm.. Interesting...

I have a theory.

On the standard kernel when on a cpu_class >= PPro (not Pentium) we set 
the PG_G bits.  We also have an invltlb() function call as well as the 
page level invlpg() and invl2pg() calls.  (invl2pg just does two invlpg's 
in a single function call to lower the function call overheads).

On the SMP kernel, all three of these functions cause an "global 
invalidate" broadcast.  If the initiating cpu is actually trying to modify 
a PG_G page, this will screw up since the per-page invalidate gets 
converted to a global invalidate on the other cpu's, and hence they don't 
flush their PG_G page.

Does that sound like a plausable explanation?

If so, we need to refine the implementation of TLB shootdowns more so that 
we can initiate a per-page flush as well as a global flush..  This will 
require syncronisation, so if you can send your code you can save some 
reinvention.. :-)

> > All that said, I'm very surprised that this *isn't* also a serious
> > problem on the Pentium (the Pentium has the Page Global stuff as
> > well...  I didn't look to see if it is used for the Pentium as well
> > as the Pentium Pro).
> 
> I might be confused here, but as mentioned in the above comment, I
> thought this was implemented in the Pentium as well.  Can someone
> who remembers better (or has the "Appendix H" equivalent released
> documentation) comment?

Don't know about the Pentium, but we definately don't enable it on 
anything smaller than a PPro.

Cheers,
-Peter