From owner-freebsd-stable@FreeBSD.ORG Sat Feb 19 12:45:11 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3E60216A4CE for ; Sat, 19 Feb 2005 12:45:11 +0000 (GMT) Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53]) by mx1.FreeBSD.org (Postfix) with ESMTP id F021043D53 for ; Sat, 19 Feb 2005 12:45:08 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by cyrus.watson.org (Postfix) with SMTP id 9751E46B16; Sat, 19 Feb 2005 07:45:08 -0500 (EST) Date: Sat, 19 Feb 2005 12:43:38 +0000 (GMT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Peter Losher In-Reply-To: <4217170A.2030106@isc.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: stable@freebsd.org Subject: Re: Hard lockups using 5.3-RELEASE.. X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Feb 2005 12:45:11 -0000 On Sat, 19 Feb 2005, Peter Losher wrote: > We have a Celestica dual-Opteron system w/ 4GB RAM running > 5.3-RELEASE/i386 (32-bit), and a SMP-aware kernel, which is experiencing > hard lockups. Debugging results below. Hmm. So just to summarize: - The system appears to wedge - Serial break can get into the debugger Have you tried updating to the latest RELENG_5_3 patch level? That includes at least one significant SMP stability fix. You can rebuild along the RELENG_5_3 branch, or just use freebsd-update to pull it in. > It looks like it's trying to lock Giant while it already has Giant. In > any case, we have rebuilt a uniprocessor kernel for now. If this is > already fixed in 5-STABLE, then let me know. ;) Generally speaking, recursing Giant is fine, as Giant is a recursible mutex; however, an ithread shouldn't already hold Giant at that point. This may be fixed in 5-STABLE, but it's hard to say. I think the order of operations here is: - First, slide to RELENG_5_3 head (p5?) to make sure you have the IPI stability fix. See if the problem goes away. - Generate the following information: when the box is wedged, does it... (1) Respond to pings (2) Does the num lock light go on and off when the num lock key is hit (3) If it responds to pings, what happens when you build a new TCP connection to an open TCP port (a) once (b) twice (c) the 100'd (or so) time. - Generate the following DDB output using your serial console: show pcpu show pcpu 0 show pcpu 1 ps show lockedvnods I may then ask you to generate stack traces of the processes that appear "interesting". The definition of interesting is a little bit context-specifi so it's hard to say what it is just now. If there are a lot of processes wedged in VM and VFS, then I'll ask you to trace each process that appears in the lockedvnods output. - Next, recompile with INVARIANTS and see if the problem triggers an assertion failure when it occurs. - Next, recompile with WITNESS and see if WITNESS creates a warning or assertion failure when it occurs. Break to the debugger and generate the above DDB output, but also "show allocks" (5-STABLE only), or "show locks" for interesting processes if 5-RELEASE-*. Also, I don't think you mentioned what sort of workload is present on the box. Thanks! Robert N M Watson