From owner-freebsd-smp Tue May 6 07:35:36 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.5/8.8.5) id HAA13785 for smp-outgoing; Tue, 6 May 1997 07:35:36 -0700 (PDT) Received: from pluto.plutotech.com (root@pluto100.plutotech.com [206.168.67.137]) by hub.freebsd.org (8.8.5/8.8.5) with ESMTP id HAA13780 for ; Tue, 6 May 1997 07:35:34 -0700 (PDT) Received: from narnia.plutotech.com (narnia.plutotech.com [206.168.67.130]) by pluto.plutotech.com (8.8.5/8.8.3) with ESMTP id IAA27572 for ; Tue, 6 May 1997 08:35:34 -0600 (MDT) Message-Id: <199705061435.IAA27572@pluto.plutotech.com> To: smp@FreeBSD.org Subject: FYI Date: Tue, 06 May 1997 09:34:06 -0600 From: "Justin T. Gibbs" Sender: owner-smp@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk ------- Forwarded Message To: tech-kern@NetBSD.ORG Reply-To: tech-kern@NetBSD.ORG Subject: Fwd: Pentium Pro architecture (synchronization/speculative loads) bits Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 06 May 1997 01:24:24 -0700 From: Greg Earle Sender: tech-kern-owner@NetBSD.ORG Precedence: list Delivered-To: tech-kern@NetBSD.ORG X-UIDL: 39d6891e0396119075dd0fcb748174e9 Hi folks, Not sure who would be most interested in this, but I saw this fly by on the Plan 9 mailing list. I suspect that most of the info that could possibly be relevant to us would be of interest to anyone doing compiler work (do we do any customization/knob-twiddling to gcc ourselves for our platforms?) but at any rate, it might be worth filing away for future use nonetheless. - Greg P.S. You can find these bits in DejaNews using a search string of "Haertel & ~g (comp.os.plan9)" - ------- Forwarded Message From: presotto@plan9.bell-labs.com Message-Id: <199704221935.PAA02595@cse.psu.edu> To: 9fans@cse.psu.edu Date: Tue, 22 Apr 1997 15:35:29 -0400 Subject: How the Pentium Pro really works Sender: owner-9fans@cse.psu.edu Reply-To: 9fans@cse.psu.edu Precedence: bulk Here's 3 messages from Mike Haertel describing how the Pentium Pro works vis a vis synchronization. As he forcefully point out, the problem isn't speculative loads, its just queued stores. As expected, surrounding shared accesses with spin locks is sufficient. Only iffy operations like our current version of sleep/wakeup have to be more carefully handled. An interesting point is that the same model exists on the Pentium. However, the shorter pipelines and buffers in the Pentium are less likely to exacerbate the problem. We were just lucky. =================================== To: research.bell-labs.com!presotto Subject: Pentium Pro and coherence Date: Tue, 22 Apr 1997 01:15:56 -0700 From: Mike Haertel In article <199704211614.MAA02731@cse.psu.edu>, you wrote: > The Pro people have remained silent > on the subject (we've sent email). Hi, I am an architect at Intel. Who did you send email to? I'm surprised you got no response. In any case, perhaps I can clarify things a little. > Of course, I could be totally wrong about the speculative reads and > it may be the interlock instruction on the writer and not the > reader that causes the processors to become coherent. The caches are always coherent using an MESI protocol. The real problem is that not all written data in the system is in the cache(s). The Pentium Pro's memory ordering model is called "processor ordering" and is a formalization of the 486's semantics. The 486 had a write-through cache with write queue to memory which was not snooped by loads on other processors. Loosely speaking, this means the ordering of events originating from any one processor in the system, as observed by other processors, is always the same. However, different observers are allowed to disagree on the interleaving of events from two or more processors. The PPro does speculative and out-of-order loads. However, it has a mechanism called the "memory order buffer" to ensure that the above memory ordering model is not violated. Load and store instructions do not get retired until the processor can prove there are no memory ordering violations in the actual order of execution that was used. Stores do not get sent to memory until they are ready to be retired. If the processor detects a memory ordering violation, it discards all unretired operations (including the offending memory operation) and restarts execution at the oldest unretired instruction. i.e. when a violation is detected the MOB whacks the machine ... :-) For example, consider the following sequence: P1: load (1000) -> reg P2: store 10 -> (1000) load (1000) -> reg store 20 -> (1000) Suppose on P1, the 2nd load speculatively executes first (for whatever reason), and picks up 10 (the result of the first store on P2). Later, P2 executes the 2nd store (causing the cached copy of location 1000 on P1 to be invalidated), and finally P1 executes the 1st load. At this point, P1 discovers that a younger load has already read from the same location, and that the location was subsequently invalidated by P2. P1 says "a-ha! that violates the memory ordering model!", clobbers the speculative state of the machine from the offending instruction (the 1st load) onward, and resumes execution starting at the offending load. Serializing instructions like CPUID force the machine to wait until all queued stores have been written out. (Actually, serializing instructions force the machine to wait until they are retired, but they cannot retire until all older stores have retired, which has an effect equivalent to draining a store queue.) Note that serializing instructions do not serialize the other processors, only the local processor. You should be able to reproduce your bug by manually working through the possible processor-ordering-consistent interleavings of events from multiple processors. Note that you should think of a processor as also observing itself. Finally, since the caches are actually fully coherent, you should be able to do correct locking without too many serializing instructions, perhaps without any. Future Intel processors will implement the same memory ordering model. =================================== To: research.bell-labs.com!presotto Subject: Re: Pentium Pro and coherence Date: Tue, 22 Apr 1997 09:24:42 -0700 From: Mike Haertel > 0,0 blows us away. If I understand correctly, putting a > synchronizing instruction between the writes and subsequent read > > P1: P2: > x = 0 y = 0 > x = 1 y = 1 > cpuid cpuid > read y read x > > will cause the processor the instruction was executed on > to wait until all processors have gotten out their > queued stores and then blow away any inconsistencies on > caused by speculative loads. The cpuid waits only until the *local* processor has gotten out its queued stores. It doesn't wait for any of the other processors. However, in this example (where all processors do cpuid before any processor does a load) I think you're OK. The cpuid forces the local processor to wait until its queued writes have been globally observed. What this means is that you are effectively serializing access to "the bus" (really, the combination of the bus and the coherent caches--writes to M-state cache lines on the local processor count as "globally observed"). Some processor (say P2) is last to execute cpuid. This means that P1 has already executed cpuid, therefore P1's "x=1" has been globally observed, so P2's load is guaranteed to see x=1. Finally, I'd like to emphasize: The inconsistencies are NOT caused by speculative loads, they are caused by queued writes on other processors. > What we need is that if the following sequence is executed > > P1: P2: > x = 0 y = 0 > x = 1 y = 1 > read y read x > > has the values read will be one of > > 1 0 > 0 1 > 1 1 > > 0,0 blows us away. You could get 0,0 even on the 486 or Pentium. The difference is that the PPro has such deep pipelines and buffers that it is more likely to expose such bugs. =================================== To: research.bell-labs.com!presotto Date: Tue, 22 Apr 1997 11:01:30 -0700 From: Mike Haertel > Do you mind if I repost your mail to the 9fans list? Sure, go ahead. One other addendum I'd like to make: in your original post to 9fans, you mentioned some paranoia about similar problems possibly existing in other parts of the kernel. One bit of reassurance: any data structure protected by a spin lock is safe. Here's why: P1 P2 [already holding lock] wait for lock->busy == 0 store data->x grab lock store data->y use data->x and ->y lock->busy = 0 Because of processor ordering, when P2 observes lock->busy == 0, it also has observed all prior stores by P1. Hence P2 never gets an inconsistent view of P1's updates. This would not be the case if the Pentium Pro allowed speculative loads to violate processor ordering semantics. This is also probably not the case on other processors with weaker memory ordering semantics. Digital's Alpha may be one such processor, I'm not sure. On those processors, when releasing a spin lock you need a "lock release" synchronization instruction rather than a simple store. - ------- End of Forwarded Message ------- End of Forwarded Message