From owner-freebsd-hackers@FreeBSD.ORG Fri Jun 3 20:57:25 2005 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 280DE16A41C for ; Fri, 3 Jun 2005 20:57:25 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id F19C243D4C for ; Fri, 3 Jun 2005 20:57:24 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.12.9p2/8.12.9) with ESMTP id j53KvO0e062013 for ; Fri, 3 Jun 2005 13:57:24 -0700 (PDT) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id j53KvOFw062012; Fri, 3 Jun 2005 13:57:24 -0700 (PDT) (envelope-from dillon) Date: Fri, 3 Jun 2005 13:57:24 -0700 (PDT) From: Matthew Dillon Message-Id: <200506032057.j53KvOFw062012@apollo.backplane.com> To: freebsd-hackers@freebsd.org Subject: Possible instruction pipelining problem between HT's on the same die ? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 03 Jun 2005 20:57:25 -0000 I've been tracking down a crash one of our users gets occassionally. He has a quad Intel(R) XEON(TM) CPU 2.00GHz (1996.61-MHz 686-class CPU) system. After getting a few of these crashes he pulled three of the four cpus out. But with just one physical cpu, with HTT turned on (so two logical cpus), he is still getting these crashes. This is the sequence that causes the bad data: cpu #0 write A write B (HT)cpu #1 read B if (B) read A <---- gets OLD data in A, not new data Now I was depending on the presumed write ordering, so if a foreign cpu sees that B is updated it can assume that A has also been updated. But I'm beginning to think that it isn't working as advertised. I've read the manuals over and over again and they seem to only guarentee write ordering between physical cpus, not between logical HT cpus, and even then it appears that a cpu can do a speculative read and thus get an old value for A even after getting a new value for B. I looked at the various SFENCE/LFENCE/MFENCE instructions and they do not seem to guarentee ordering for speculative accesses at all. They all say that they do not protect against speculative reads. Bus-locked instructions don't seem to avoid speculative reads either. I'm even more confused because this bug is occuring between two logical cpus on the same physical die. Is write ordering not guarenteed with respect to the other logical cpu? Can one logical cpu prefetch data early then then becomes obsolete by the time the instruction is actually run? Or perhaps its a pipeline bug... I just don't know. But it's damn annoying. The only solution I see is to use an actual serializing instruction like cpuid. I really do not want to have to use cpuid :-(. So, has anyone seen anything similar? -Matt