From owner-freebsd-stable@FreeBSD.ORG Sat Nov 29 08:34:07 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 07FC916A4CE; Sat, 29 Nov 2003 08:34:07 -0800 (PST) Received: from mail.sandvine.com (sandvine.com [199.243.201.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5599C43FE5; Sat, 29 Nov 2003 08:34:05 -0800 (PST) (envelope-from don@sandvine.com) Received: by mail.sandvine.com with Internet Mail Service (5.5.2657.72) id ; Sat, 29 Nov 2003 11:33:59 -0500 Message-ID: From: Don Bowman To: 'Uwe Doering' , freebsd-gnats-submit@FreeBSD.org Date: Sat, 29 Nov 2003 11:33:58 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2657.72) Content-Type: text/plain; charset="iso-8859-1" cc: freebsd-bugs@freebsd.org cc: freebsd-stable@freebsd.org Subject: RE: kern/59719 Re: 4.9 Stable Crashes on SuperMicro with SMP X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Nov 2003 16:34:07 -0000 From: Uwe Doering [mailto:gemini@geminix.org] > Jonathan Gilpin wrote: > > I've run memtest (memtest86.com) kindly provided by Don and > it passed all > > the tests. I've installed installed a kernel module to test > for memory > > errors and found that again no memory errors are found... > So this means it's > > either a problem with the CPU's or a geniune bug in the > kernel. (bugger!) > > No, that's unfortunately not what it means. If a memory test > fails you > can draw the conclusion that you have bad memory, but this > doesn't work > the other way round. If a memory test passes there is still a > possibility that a memory chip is the culprit since memory > test software > cannot find all errors. > > Also, there is the chip set on the mainboard that coordinates > bus access > etc. for the two CPUs. Mainboard and chip set developers are > known to > make errors, too. In this case you would have to swap the entire > mainboard, possible with one from a different manufacturer. > I can tell > you from my own experience that it is really hard to find reliable PC > hardware these days, in light of ever shorter and faster > product release > cycles. I have several hundred of the motherboard the poster is using, and it works reliably with MP operation with 4.X. The memtest86 that i sent him understands the ECC registers on the e7501 MCH, it should find all correctable and uncorrectable errors. --don