Date: Sat, 10 Jul 2010 01:53:39 +0200 From: Markus Gebert <markus.gebert@hostpoint.ch> To: John Baldwin <jhb@freebsd.org> Cc: freebsd-stable@freebsd.org Subject: Re: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2? Message-ID: <08562D52-02AA-46CF-BFCD-00D0A3C4DC34@hostpoint.ch> In-Reply-To: <201007091603.31843.jhb@freebsd.org> References: <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <201007091603.31843.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi John Am 09.07.2010 um 22:03 schrieb John Baldwin: > On Friday, July 09, 2010 11:26:00 am Markus Gebert wrote: >> -- >> MCA: Bank 4, Status 0xb400004000030c2b >> MCA: Global Cap 0x0000000000000105, Status 0x0000000000000007 >> MCA: Vendor "AuthenticAMD", ID 0x40f13, APIC ID 2 >> MCA: CPU 2 UNCOR BUSLG Observer WR I/O >> MCA: Address 0xfd00000000 >=20 > Using my local port of mcelog this is what I get for this check: >=20 > CPU 2 4 northbridge=20 > ADDR fd00000000=20 > Northbridge Master abort > link number =3D 4 > bit61 =3D error uncorrected > bus error 'local node observed, request didn't time out > generic write mem transaction > i/o access, level generic' > STATUS b400004000030c2b MCGSTATUS 7 > MCGCAP 105 APICID 2 SOCKETID 0=20 > CPUID Vendor AMD Family 15 Model 65 >=20 > I don't know what to tell you off hand. Did you buy this hardware = from Sun=20 > directly? If so, I would try bugging them about this, especially = given the=20 > error that the BIOS is logging. Yes, this hardware comes from Sun directly, but getting Sun (/Oracle) = support for this issue is gonna be tough. FreeBSD is unsupported, and in = a short test we couldn't reproduce the problem with a Linux kernel. = While I agree that a hardware issue has always been and still is a = possibility to be considered, the fact that we tested this on two = machines remains as well as the fact that 6.x, 7.x do not show the = behavior. Another possibility is of course, that the X4100 is prone to = such issues and somehow 6.x and 7.x have workarounds we're not aware of = or just do something different in way so that this issue does not get = triggered. > It does sound like a hardware issue, but in=20 > the chipset, not in the RAM, so you might need to swap out the main = board=20 > rather than the RAM. Yep. The MCA report does not indicate RAM problems, and the MCE itself = was not our only reason to replace RAM. We found a Sun document about = the X4200 series getting hypertransport errors when RAM of a certain = vendor is installed, so we swapped RAM to rule this one out. We did not replace the mainboard though, but testing on a second X4100 = should do about the same. > I'm curious if disabling USB legacy support in the BIOS causes it to = still die=20 > even with ehci not loaded. If so, then the SMI# for the ehci = controller must=20 > somehow prevent the issue, perhaps by triggering frequently enough to = slow the=20 > rate of I/O requests down? I disabled usb legacy support in the BIOS and booted a kernel with = usb+ohci+ukbd+ums but without ehci. Unfortunately, I cannot reproduce = the MCE. Just to get you right: your theory is that when we don't load the ehci = driver, then the ehci-controller isn't taken over during boot and = therefore handled through SMM so that SMIs might occur often enough to = throttle the system just enough to not let the problem appear? I'm not = very familiar with usb legacy support and SMM, but why would ehci be = handled through SMM when the only usb devices (the virtual keyboard and = mouse) actually sit on ohci? And why would disabling legacy support help = getting more SMIs to throttle the system? As I unterstand this, and I = might be terribly wrong, legacy support is what would cause SMIs in the = first place. Markus
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?08562D52-02AA-46CF-BFCD-00D0A3C4DC34>