From owner-freebsd-current@FreeBSD.ORG  Wed Nov 24 15:37:02 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id EDB5816A4CF
	for <freebsd-current@freebsd.org>;
	Wed, 24 Nov 2004 15:37:01 +0000 (GMT)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7A30743D49
	for <freebsd-current@freebsd.org>;
	Wed, 24 Nov 2004 15:37:01 +0000 (GMT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (localhost [127.0.0.1])
	by fledge.watson.org (8.13.1/8.13.1) with ESMTP id iAOFZC0F016875;
	Wed, 24 Nov 2004 10:35:12 -0500 (EST)
	(envelope-from robert@fledge.watson.org)
Received: from localhost (robert@localhost)iAOFZCc9016872;
	Wed, 24 Nov 2004 15:35:12 GMT
	(envelope-from robert@fledge.watson.org)
Date: Wed, 24 Nov 2004 15:35:11 +0000 (GMT)
From: Robert Watson <rwatson@freebsd.org>
X-Sender: robert@fledge.watson.org
To: Hogan Whittall <hogan@ninthgate.net>
In-Reply-To: <20041123182254.GB10721@ninthgate.net>
Message-ID: <Pine.NEB.3.96L.1041124152633.98085Y-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: freebsd-current@freebsd.org
Subject: Re: Random panics with 5.3-REL, SMP
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 24 Nov 2004 15:37:02 -0000


On Tue, 23 Nov 2004, Hogan Whittall wrote:

> I'm still getting random panics, however.  Doesn't appear to be related
> to anything in particular and seems to usually happen after being up for
> 1-2 days.  I've attempted to get a coredump but the last panic wedged
> while dumping to disk.  I'm going to be out of town for a week and won't
> have access to the box, but if anyone has experienced something like
> this before and knows of a fix, please let me know.  Here are the specs
> of the machine: 

"Random panics" is a little vague as a starting point, but here are some
thoughts to look at when back from your vacation:

- Using a serial console to the box, you can reliably gather information
  without the core dump mechanism working.

- "Random panics" could mean "A lot of seemingly different panics
  happening with relatively frequency", or it might mean "A few similar
  panics, happening at random intervals".  It would be useful to clarify
  which it is.  Recognizing that you may not be familiar with the intimate
  details of kernel failure modes, the ways in which one might classify
  failures as being "similar" is by the nature of the panic and the stack
  trace to reach the panic.  Panics usually fall into two forms: an
  explicit call to panic() by code that has detected a failure of a kernel
  invariant ("this should never happen"), or a page fault ("the kernel
  touched some memory it shouldn't have").  Panics typically print a fault
  description, such as a pointer dereferenced, or the nature of the
  invariant test that triggered.   The same message might indicate the
  same problem occuring.  A stack trace can be generated using the "trace"
  command in DDB, and is a subset of the information you might get by
  pointing gdb at a core.  If the stack traces look similar (especially
  with regard to the functions close to the frame where the panic took
  place), the failure mode might be regarded to be similar also.
  Regardless, when reporting panics, the panic line or header of the fault
  report are excellent starting points. 

- In terms of debugging information, it would be very useful if you could
  hook up a serial console, and when a panic occurs, send the output of
  "show pcpu" and "show trace".  If an SMP box with an SMP kernel, run
  "show pcpu" for each cpu, and trace the active threads on each.  The
  output of "ps" is usually pretty valuable, as it will show what the
  system was doing, and if many threads are waiting fore something, it
  will show what they are waiting for.   With file system related panics
  or hangs, the output of "show lockedvnods" is often also very useful, as
  it will show what file system objects were being actively used, and by
  what threads.  If running with WITNESS (see below), "show locks" can be
  very helpful, as it will assist in understanding and debugging the
  synchronization state of the kernel.

- If a bug leads to an eventual panic, that problem caused by the bug will
  sometimes be better described if you have some of the kernel debugging
  kernel enabled.  For example, INVARIANTS and/or WITNESS.  Depending on
  the impact to performance you can take on the box, you might want to try
  some features, then others.  Features like INVARIANTS may also help
  catch the problem earlier, making the problem easier to diagnose.

I've found the single most useful tool in debugging failure modes is a
serial console, as it provides ready scroll-back to earlier console
output, a fairly reliable ability to enter the debugger using a break, as
well as functionality like remote DDB, logging of DDB output, etc.  I've
heard people report very similar benefits and experiences with firewire
debugging, but since I don't really live in the world of firewire, I'll
point at serial ports :-).

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Principal Research Scientist, McAfee Research