From owner-freebsd-amd64@FreeBSD.ORG Mon May 2 21:29:07 2011 Return-Path: Delivered-To: freebsd-amd64@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B45341065676 for ; Mon, 2 May 2011 21:29:07 +0000 (UTC) (envelope-from mike@mail.karels.net) Received: from mail.karels.net (mail.karels.net [63.231.190.5]) by mx1.freebsd.org (Postfix) with ESMTP id 624468FC1F for ; Mon, 2 May 2011 21:29:07 +0000 (UTC) Received: from mail.karels.net (localhost [127.0.0.1]) by mail.karels.net (8.14.3/8.13.6) with ESMTP id p42LDLrl051285; Mon, 2 May 2011 16:13:21 -0500 (CDT) (envelope-from mike@mail.karels.net) Message-Id: <201105022113.p42LDLrl051285@mail.karels.net> To: freebsd-amd64@freebsd.org From: Mike Karels Date: Mon, 02 May 2011 16:13:20 -0500 Sender: mike@karels.net X-Mailman-Approved-At: Mon, 02 May 2011 21:48:02 +0000 Cc: mike_karels@mcafee.com Subject: variable hang when starting APs on Westmere processors X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: mike@karels.net List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 May 2011 21:29:07 -0000 Looks like freebsd-smp is gone... not sure of the right target for this. I just picked up a problem from another developer at work who had the good fortune to have scheduled a vacation this week. The short description is that the start_ap() routine sometimes hangs, from 10 minutes to 3 hours, while starting up CPUs. This is with a much-modified system based on FreeBSD 7.2. A stock 8.2 CD hangs at the same spot almost all the time, although the code in the two versions appears identical. More details: This is amd64, using an Intel S5520HCR 2-socket motherboard with two XEON X5660 2.8GHz Westmere hex-core CPUs. The problem happens somewhat less with two XEON E5620 Quad core 2.4GHz CPUs. The hang seems to happen with higher numbered CPUs, so the hex-core with SMT has more chances to hit the problem. We added KTRs to the code, and found that the hang happens in the lapic_ipi_wait() call after de-asserting RESET. Of course, Linux doesn't exhibit the problem. Has anyone else seen a problem like this? Any ideas how to fix it, or debug further? Please copy me on responses; I'm not subscribed to this list currently. Mike