From owner-freebsd-stable@FreeBSD.ORG Mon Jul 12 12:51:57 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 31EF1106566C for ; Mon, 12 Jul 2010 12:51:57 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 0193C8FC17 for ; Mon, 12 Jul 2010 12:51:57 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id A81C246B17; Mon, 12 Jul 2010 08:51:56 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id DAA068A050; Mon, 12 Jul 2010 08:51:55 -0400 (EDT) From: John Baldwin To: Markus Gebert Date: Mon, 12 Jul 2010 08:51:35 -0400 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100217; KDE/4.4.5; amd64; ; ) References: <6B57591F-9FA2-45EB-825F-1DB025C0635D@hostpoint.ch> <08562D52-02AA-46CF-BFCD-00D0A3C4DC34@hostpoint.ch> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201007120851.35529.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Mon, 12 Jul 2010 08:51:55 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: freebsd-stable Subject: Re: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jul 2010 12:51:57 -0000 On Monday, July 12, 2010 8:41:51 am Markus Gebert wrote: > > On 10.07.2010, at 01:53, Markus Gebert wrote: > > >> I'm curious if disabling USB legacy support in the BIOS causes it to still die > >> even with ehci not loaded. If so, then the SMI# for the ehci controller must > >> somehow prevent the issue, perhaps by triggering frequently enough to slow the > >> rate of I/O requests down? > > > > > > I disabled usb legacy support in the BIOS and booted a kernel with usb+ohci+ukbd+ums but without ehci. Unfortunately, I cannot reproduce the MCE. > > > Well, the situation has changed. Machine died over the weekend running our test load with above kernel configuration. It seems that not having ehci in the kernel at boot just makes the MCE much more unlikely to occur, but it occurs. With ehci, I can panic the machine within a minute, without ehci it seems to take at least hours. Still, I don't get why not having the ehci driver in the kernel should have any effect, especially because nothing is attached to it. Ok, so maybe the SMI# interrupts do play a role somehow, at least as far as altering the timing. > Panic message: > > ---- > MCA: Bank 4, Status 0xb400004000030c2b > MCA: Global Cap 0x0000000000000105, Status 0x0000000000000007 > MCA: Vendor "AuthenticAMD", ID 0x40f13, APIC ID 2 > MCA: CPU 2 UNCOR BUSLG Observer WR I/O > MCA: Address 0xfd00000000 > panic: blockable sleep lock (sleep mutex) 128 @ /usr/src/sys/vm/uma_core.c:1992 > cpuid = 2 > KDB: enter: panic > [thread pid 12 tid 100039 ] > Stopped at kdb_enter+0x3d: movq $0,0x69ccb0(%rip) > ---- > > Don't know, why it's not a fatal trap 28 this time despite an MCE was detected. Seen this before though, also with kernels that have ehci and with usb legacy support, so seeing a different panic this time seems not related to the way the kernel was configured. Maybe a symptom? Or may it even be useful? If yes, what should I pull out of DDB? > > In the meantime, I'll try harder to reproduce the MCE on current... Well, it panic'd trying to malloc something in a non-safe place, because the machine check can happen at any time like an NMI. The panic was caused by the MCE however. -- John Baldwin