From owner-freebsd-hackers@FreeBSD.ORG Sat Jun 4 08:21:33 2005 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2C78E16A41C for ; Sat, 4 Jun 2005 08:21:33 +0000 (GMT) (envelope-from keir.fraser@cl.cam.ac.uk) Received: from mta2.cl.cam.ac.uk (mta2.cl.cam.ac.uk [128.232.0.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id 824CE43D49 for ; Sat, 4 Jun 2005 08:21:31 +0000 (GMT) (envelope-from keir.fraser@cl.cam.ac.uk) Received: from c063.vpn.cl.cam.ac.uk ([128.232.105.63]) by mta2.cl.cam.ac.uk with esmtp (Exim 3.092 #1) id 1DeTuM-00019k-00; Sat, 04 Jun 2005 09:21:30 +0100 In-Reply-To: References: Mime-Version: 1.0 (Apple Message framework v622) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Keir Fraser Date: Sat, 4 Jun 2005 09:17:57 +0100 To: Kip Macy , dillon@apollo.backplane.com X-Mailer: Apple Mail (2.622) X-Mailman-Approved-At: Sat, 04 Jun 2005 11:56:09 +0000 Cc: freebsd-hackers@freebsd.org Subject: Re: Possible instruction pipelining problem between HT's on the same die ? (fwd) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 04 Jun 2005 08:21:33 -0000 Hi, I did a fair amount of lock-free programming during my PhD and for Xen, so I may be able to shed some light on this situation. OTOH I may also be confused: the x86 memory model is poorly specified and the reference manuals are often badly written and misleading. I'll address the points and questions out of order.... > But I'm beginning to think that it isn't working as advertised. > I've > read the manuals over and over again and they seem to only > guarentee > write ordering between physical cpus, not between logical HT cpus, > and > even then it appears that a cpu can do a speculative read and > thus get an old value for A even after getting a new value for B. The ordering guarantees between HTs are identical to those between physical cpus. I'm referring to Section 7.6.19 of IARM (Intel IA-32 Reference Manual) Vol 3. It's slightly confusing that it says "can further be defined as 'write-ordered with store buffer forwarding'" but this forwarding only occurs separately *within* each logical cpu (the store buffer is statically partitioned between the two HTs), and this phrase is identical to the one describing physical cpu behaviour in Section 7.2.2 (ie. it is redundant to reiterate it in this later section). Reads can be speculatively executed out-of-order, but this property isn't unique to HTs. This race could in theory happen across physical cpus. > Now I was depending on the presumed write ordering, so if a foreign > cpu sees that B is updated it can assume that A has also been > updated. You *can* depend on write ordering. But this ordering is no help if CPU#1 has already executed, and is retiring, the read from A by the time it executes the read from B. It's CPU#1 that is screwing up, not CPU#0. > I looked at the various SFENCE/LFENCE/MFENCE instructions and they > do not seem to guarentee ordering for speculative accesses at all. > They all say that they do not protect against speculative reads. > Bus-locked instructions don't seem to avoid speculative reads > either. I think the reference manual is being almost wilfully misleading by referring to the speculative prefetch mechanism and its total independence from the fence instructions: "data could be speculatively loaded into the cache just before, during, or after the execution of an MFENCE instruction". It is important to realise that speculative execution of a memory-reading instruction is quite different from speculative prefetch into a cache. The latter should not matter to the programmer: the cache coherency protocol hides it. Consider the code example in the original email: > cpu #0 write A > write B > > (HT)cpu #1 read B > if (B) > read A <---- gets OLD data in A, not new data If CPU#1 prefetches A into its cache before it reads B, it may indeed see the old value of A; *but* when CPU#0 writes A it will invalidate that cacheline in all remote caches; *furthermore* CPU#0 cannot commit its update of B until after it has committed its update of A (x86 guarantees write order). So, if CPU#1 reads the new value of B, then any stale value of A in its cache has been invalidated by that point. All you need to ensure is that CPU#1 hasn't speculatively executed the read from A: precisely the purpose of MFENCE and LFENCE. This is more complicated if both CPUs are sharing their memory hierarchy. However, either cache lines are tagged with an HT identifier and so the cache logically operates as two separate variable-sized caches (in which case normal cache coherency rules apply as described above), or there is true cacheline sharing (in which case there is no stale data to worry about, as CPU#0 will directly update the cache data that CPU#1 will read from). Either way, there's no weakening of the memory model. -- Keir