From owner-freebsd-smp Mon Dec 23 22:39:54 1996 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.4/8.8.4) id WAA11434 for smp-outgoing; Mon, 23 Dec 1996 22:39:54 -0800 (PST) Received: from spinner.DIALix.COM (root@spinner.DIALix.COM [192.203.228.67]) by freefall.freebsd.org (8.8.4/8.8.4) with ESMTP id WAA11429 for ; Mon, 23 Dec 1996 22:39:50 -0800 (PST) Received: from spinner.DIALix.COM (peter@localhost.DIALix.oz.au [127.0.0.1]) by spinner.DIALix.COM (8.8.4/8.8.4) with ESMTP id OAA19697; Tue, 24 Dec 1996 14:00:47 +0800 (WST) Message-Id: <199612240600.OAA19697@spinner.DIALix.COM> X-Mailer: exmh version 1.6.9 8/22/96 To: Erich Boleyn cc: smp@freebsd.org, haertel@ichips.intel.com, wscott@ichips.intel.com Subject: Re: I think we have the culprit!! (was -> Re: Eureka (maybe...) (was -> Re: P6 problem idea ) ) In-reply-to: Your message of "Mon, 23 Dec 1996 22:21:49 PST." Date: Tue, 24 Dec 1996 14:00:46 +0800 From: Peter Wemm Sender: owner-smp@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Erich Boleyn wrote: > Erich Boleyn writes: > > I tried shutting off the page global stuff, and while I don't have > > a difinitively long run yet, it has run through 3 full kernel compiles > > with no crash yet. I'll run it for the next 1 1/2 hours and see if > > it lives through that. If so, I think we have our main culprit (I'll > > also post the (small) code change which synchronizes the CPUs on TLB > > shootdown before letting the sender continue). > > Well, after 2 hours of kernel builds, and now a few sets of 4 parallel > kernel builds later, the system is still running great. > > I think we have our culprit... the Page Global stuff (plus adding the > TLB shootdown synchronization may be helping a little with stability, but > it's absence doesn't appear to be the major cause). Hmm.. Interesting... I have a theory. On the standard kernel when on a cpu_class >= PPro (not Pentium) we set the PG_G bits. We also have an invltlb() function call as well as the page level invlpg() and invl2pg() calls. (invl2pg just does two invlpg's in a single function call to lower the function call overheads). On the SMP kernel, all three of these functions cause an "global invalidate" broadcast. If the initiating cpu is actually trying to modify a PG_G page, this will screw up since the per-page invalidate gets converted to a global invalidate on the other cpu's, and hence they don't flush their PG_G page. Does that sound like a plausable explanation? If so, we need to refine the implementation of TLB shootdowns more so that we can initiate a per-page flush as well as a global flush.. This will require syncronisation, so if you can send your code you can save some reinvention.. :-) > > All that said, I'm very surprised that this *isn't* also a serious > > problem on the Pentium (the Pentium has the Page Global stuff as > > well... I didn't look to see if it is used for the Pentium as well > > as the Pentium Pro). > > I might be confused here, but as mentioned in the above comment, I > thought this was implemented in the Pentium as well. Can someone > who remembers better (or has the "Appendix H" equivalent released > documentation) comment? Don't know about the Pentium, but we definately don't enable it on anything smaller than a PPro. Cheers, -Peter