From owner-freebsd-stable@FreeBSD.ORG Fri Jul 9 23:53:42 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0182A106564A for ; Fri, 9 Jul 2010 23:53:42 +0000 (UTC) (envelope-from markus.gebert@hostpoint.ch) Received: from mail.adm.hostpoint.ch (mail.adm.hostpoint.ch [217.26.48.124]) by mx1.freebsd.org (Postfix) with ESMTP id BA7608FC0A for ; Fri, 9 Jul 2010 23:53:41 +0000 (UTC) Received: from 77-58-137-22.dclient.hispeed.ch ([77.58.137.22]:36576 helo=[172.16.1.3]) by mail.adm.hostpoint.ch with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.69 (FreeBSD)) (envelope-from ) id 1OXNO0-000Cpj-E0; Sat, 10 Jul 2010 01:53:40 +0200 Mime-Version: 1.0 (Apple Message framework v1078) Content-Type: text/plain; charset=us-ascii From: Markus Gebert In-Reply-To: <201007091603.31843.jhb@freebsd.org> Date: Sat, 10 Jul 2010 01:53:39 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <08562D52-02AA-46CF-BFCD-00D0A3C4DC34@hostpoint.ch> References: <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <201007091603.31843.jhb@freebsd.org> To: John Baldwin X-Mailer: Apple Mail (2.1078) Cc: freebsd-stable@freebsd.org Subject: Re: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Jul 2010 23:53:42 -0000 Hi John Am 09.07.2010 um 22:03 schrieb John Baldwin: > On Friday, July 09, 2010 11:26:00 am Markus Gebert wrote: >> -- >> MCA: Bank 4, Status 0xb400004000030c2b >> MCA: Global Cap 0x0000000000000105, Status 0x0000000000000007 >> MCA: Vendor "AuthenticAMD", ID 0x40f13, APIC ID 2 >> MCA: CPU 2 UNCOR BUSLG Observer WR I/O >> MCA: Address 0xfd00000000 >=20 > Using my local port of mcelog this is what I get for this check: >=20 > CPU 2 4 northbridge=20 > ADDR fd00000000=20 > Northbridge Master abort > link number =3D 4 > bit61 =3D error uncorrected > bus error 'local node observed, request didn't time out > generic write mem transaction > i/o access, level generic' > STATUS b400004000030c2b MCGSTATUS 7 > MCGCAP 105 APICID 2 SOCKETID 0=20 > CPUID Vendor AMD Family 15 Model 65 >=20 > I don't know what to tell you off hand. Did you buy this hardware = from Sun=20 > directly? If so, I would try bugging them about this, especially = given the=20 > error that the BIOS is logging. Yes, this hardware comes from Sun directly, but getting Sun (/Oracle) = support for this issue is gonna be tough. FreeBSD is unsupported, and in = a short test we couldn't reproduce the problem with a Linux kernel. = While I agree that a hardware issue has always been and still is a = possibility to be considered, the fact that we tested this on two = machines remains as well as the fact that 6.x, 7.x do not show the = behavior. Another possibility is of course, that the X4100 is prone to = such issues and somehow 6.x and 7.x have workarounds we're not aware of = or just do something different in way so that this issue does not get = triggered. > It does sound like a hardware issue, but in=20 > the chipset, not in the RAM, so you might need to swap out the main = board=20 > rather than the RAM. Yep. The MCA report does not indicate RAM problems, and the MCE itself = was not our only reason to replace RAM. We found a Sun document about = the X4200 series getting hypertransport errors when RAM of a certain = vendor is installed, so we swapped RAM to rule this one out. We did not replace the mainboard though, but testing on a second X4100 = should do about the same. > I'm curious if disabling USB legacy support in the BIOS causes it to = still die=20 > even with ehci not loaded. If so, then the SMI# for the ehci = controller must=20 > somehow prevent the issue, perhaps by triggering frequently enough to = slow the=20 > rate of I/O requests down? I disabled usb legacy support in the BIOS and booted a kernel with = usb+ohci+ukbd+ums but without ehci. Unfortunately, I cannot reproduce = the MCE. Just to get you right: your theory is that when we don't load the ehci = driver, then the ehci-controller isn't taken over during boot and = therefore handled through SMM so that SMIs might occur often enough to = throttle the system just enough to not let the problem appear? I'm not = very familiar with usb legacy support and SMM, but why would ehci be = handled through SMM when the only usb devices (the virtual keyboard and = mouse) actually sit on ohci? And why would disabling legacy support help = getting more SMIs to throttle the system? As I unterstand this, and I = might be terribly wrong, legacy support is what would cause SMIs in the = first place. Markus