From owner-freebsd-stable@FreeBSD.ORG Mon Jul 12 12:41:53 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A8731106566B; Mon, 12 Jul 2010 12:41:53 +0000 (UTC) (envelope-from markus.gebert@hostpoint.ch) Received: from mail.adm.hostpoint.ch (mail.adm.hostpoint.ch [217.26.48.124]) by mx1.freebsd.org (Postfix) with ESMTP id 6B3438FC08; Mon, 12 Jul 2010 12:41:53 +0000 (UTC) Received: from [77.109.131.203] (port=60539 helo=ch4buk-en0.office.hostpoint.internal) by mail.adm.hostpoint.ch with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.69 (FreeBSD)) (envelope-from ) id 1OYIKW-0006iZ-47; Mon, 12 Jul 2010 14:41:52 +0200 Mime-Version: 1.0 (Apple Message framework v1078) Content-Type: text/plain; charset=us-ascii From: Markus Gebert In-Reply-To: <08562D52-02AA-46CF-BFCD-00D0A3C4DC34@hostpoint.ch> Date: Mon, 12 Jul 2010 14:41:51 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <201007091603.31843.jhb@freebsd.org> <08562D52-02AA-46CF-BFCD-00D0A3C4DC34@hostpoint.ch> To: John Baldwin X-Mailer: Apple Mail (2.1078) Cc: freebsd-stable Subject: Re: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jul 2010 12:41:53 -0000 On 10.07.2010, at 01:53, Markus Gebert wrote: >> I'm curious if disabling USB legacy support in the BIOS causes it to = still die=20 >> even with ehci not loaded. If so, then the SMI# for the ehci = controller must=20 >> somehow prevent the issue, perhaps by triggering frequently enough to = slow the=20 >> rate of I/O requests down? >=20 >=20 > I disabled usb legacy support in the BIOS and booted a kernel with = usb+ohci+ukbd+ums but without ehci. Unfortunately, I cannot reproduce = the MCE. Well, the situation has changed. Machine died over the weekend running = our test load with above kernel configuration. It seems that not having = ehci in the kernel at boot just makes the MCE much more unlikely to = occur, but it occurs. With ehci, I can panic the machine within a = minute, without ehci it seems to take at least hours. Still, I don't get = why not having the ehci driver in the kernel should have any effect, = especially because nothing is attached to it. Panic message: ---- MCA: Bank 4, Status 0xb400004000030c2b MCA: Global Cap 0x0000000000000105, Status 0x0000000000000007 MCA: Vendor "AuthenticAMD", ID 0x40f13, APIC ID 2 MCA: CPU 2 UNCOR BUSLG Observer WR I/O MCA: Address 0xfd00000000 panic: blockable sleep lock (sleep mutex) 128 @ = /usr/src/sys/vm/uma_core.c:1992 cpuid =3D 2 KDB: enter: panic [thread pid 12 tid 100039 ] Stopped at kdb_enter+0x3d: movq $0,0x69ccb0(%rip) ---- Don't know, why it's not a fatal trap 28 this time despite an MCE was = detected. Seen this before though, also with kernels that have ehci and = with usb legacy support, so seeing a different panic this time seems not = related to the way the kernel was configured. Maybe a symptom? Or may it = even be useful? If yes, what should I pull out of DDB? In the meantime, I'll try harder to reproduce the MCE on current... Markus