From owner-svn-src-all@FreeBSD.ORG Wed Jan 8 21:04:12 2014 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CC5CC896; Wed, 8 Jan 2014 21:04:12 +0000 (UTC) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:1900:2254:2068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id B66611202; Wed, 8 Jan 2014 21:04:12 +0000 (UTC) Received: from svn.freebsd.org ([127.0.1.70]) by svn.freebsd.org (8.14.7/8.14.7) with ESMTP id s08L4CXU066364; Wed, 8 Jan 2014 21:04:12 GMT (envelope-from jhb@svn.freebsd.org) Received: (from jhb@localhost) by svn.freebsd.org (8.14.7/8.14.7/Submit) id s08L4CfJ066363; Wed, 8 Jan 2014 21:04:12 GMT (envelope-from jhb@svn.freebsd.org) Message-Id: <201401082104.s08L4CfJ066363@svn.freebsd.org> From: John Baldwin Date: Wed, 8 Jan 2014 21:04:12 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r260457 - head/sys/x86/x86 X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jan 2014 21:04:13 -0000 Author: jhb Date: Wed Jan 8 21:04:12 2014 New Revision: 260457 URL: http://svnweb.freebsd.org/changeset/base/260457 Log: The changes in r233781 attempted to make logging during a machine check exception more readable. In practice they prevented all logging during a machine check exception on at least some systems. Specifically, when an uncorrected ECC error is detected in a DIMM on a Nehalem/Westmere class machine, all CPUs receive a machine check exception, but only CPUs on the same package as the memory controller for the erroring DIMM log an error. The CPUs on the other package would complete the scan of their machine check banks and panic before the first set of CPUs could log an error. The end result was a clearer display during the panic (no interleaved messages), but a crashdump without any useful info about the error that occurred. To handle this case, make all CPUs spin in the machine check handler once they have completed their scan of their machine check banks until at least one machine check error is logged. I tried using a DELAY() instead so that the CPUs would not potentially hang forever, but that was not reliable in testing. While here, don't clear MCIP from MSR_MCG_STATUS before invoking panic. Only clear it if the machine check handler does not panic and returns to the interrupted thread. Modified: head/sys/x86/x86/mca.c Modified: head/sys/x86/x86/mca.c ============================================================================== --- head/sys/x86/x86/mca.c Wed Jan 8 19:34:23 2014 (r260456) +++ head/sys/x86/x86/mca.c Wed Jan 8 21:04:12 2014 (r260457) @@ -53,6 +53,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include #include @@ -84,7 +85,7 @@ struct mca_internal { static MALLOC_DEFINE(M_MCA, "MCA", "Machine Check Architecture"); -static int mca_count; /* Number of records stored. */ +static volatile int mca_count; /* Number of records stored. */ static int mca_banks; /* Number of per-CPU register banks. */ static SYSCTL_NODE(_hw, OID_AUTO, mca, CTLFLAG_RD, NULL, @@ -733,7 +734,8 @@ mca_setup(uint64_t mcg_cap) TASK_INIT(&mca_refill_task, 0, mca_refill, NULL); mca_fill_freelist(); SYSCTL_ADD_INT(NULL, SYSCTL_STATIC_CHILDREN(_hw_mca), OID_AUTO, - "count", CTLFLAG_RD, &mca_count, 0, "Record count"); + "count", CTLFLAG_RD, (int *)(uintptr_t)&mca_count, 0, + "Record count"); SYSCTL_ADD_PROC(NULL, SYSCTL_STATIC_CHILDREN(_hw_mca), OID_AUTO, "interval", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, &mca_ticks, 0, sysctl_positive_int, "I", @@ -939,7 +941,7 @@ void mca_intr(void) { uint64_t mcg_status; - int recoverable; + int old_count, recoverable; if (!(cpu_feature & CPUID_MCA)) { /* @@ -953,15 +955,27 @@ mca_intr(void) } /* Scan the banks and check for any non-recoverable errors. */ + old_count = mca_count; recoverable = mca_scan(MCE); mcg_status = rdmsr(MSR_MCG_STATUS); if (!(mcg_status & MCG_STATUS_RIPV)) recoverable = 0; + if (!recoverable) { + /* + * Wait for at least one error to be logged before + * panic'ing. Some errors will assert a machine check + * on all CPUs, but only certain CPUs will find a valid + * bank to log. + */ + while (mca_count == old_count) + cpu_spinwait(); + + panic("Unrecoverable machine check exception"); + } + /* Clear MCIP. */ wrmsr(MSR_MCG_STATUS, mcg_status & ~MCG_STATUS_MCIP); - if (!recoverable) - panic("Unrecoverable machine check exception"); } #ifdef DEV_APIC