From owner-freebsd-current@FreeBSD.ORG Wed Nov 24 15:37:02 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EDB5816A4CF for ; Wed, 24 Nov 2004 15:37:01 +0000 (GMT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7A30743D49 for ; Wed, 24 Nov 2004 15:37:01 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.13.1/8.13.1) with ESMTP id iAOFZC0F016875; Wed, 24 Nov 2004 10:35:12 -0500 (EST) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)iAOFZCc9016872; Wed, 24 Nov 2004 15:35:12 GMT (envelope-from robert@fledge.watson.org) Date: Wed, 24 Nov 2004 15:35:11 +0000 (GMT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Hogan Whittall In-Reply-To: <20041123182254.GB10721@ninthgate.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@freebsd.org Subject: Re: Random panics with 5.3-REL, SMP X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Nov 2004 15:37:02 -0000 On Tue, 23 Nov 2004, Hogan Whittall wrote: > I'm still getting random panics, however. Doesn't appear to be related > to anything in particular and seems to usually happen after being up for > 1-2 days. I've attempted to get a coredump but the last panic wedged > while dumping to disk. I'm going to be out of town for a week and won't > have access to the box, but if anyone has experienced something like > this before and knows of a fix, please let me know. Here are the specs > of the machine: "Random panics" is a little vague as a starting point, but here are some thoughts to look at when back from your vacation: - Using a serial console to the box, you can reliably gather information without the core dump mechanism working. - "Random panics" could mean "A lot of seemingly different panics happening with relatively frequency", or it might mean "A few similar panics, happening at random intervals". It would be useful to clarify which it is. Recognizing that you may not be familiar with the intimate details of kernel failure modes, the ways in which one might classify failures as being "similar" is by the nature of the panic and the stack trace to reach the panic. Panics usually fall into two forms: an explicit call to panic() by code that has detected a failure of a kernel invariant ("this should never happen"), or a page fault ("the kernel touched some memory it shouldn't have"). Panics typically print a fault description, such as a pointer dereferenced, or the nature of the invariant test that triggered. The same message might indicate the same problem occuring. A stack trace can be generated using the "trace" command in DDB, and is a subset of the information you might get by pointing gdb at a core. If the stack traces look similar (especially with regard to the functions close to the frame where the panic took place), the failure mode might be regarded to be similar also. Regardless, when reporting panics, the panic line or header of the fault report are excellent starting points. - In terms of debugging information, it would be very useful if you could hook up a serial console, and when a panic occurs, send the output of "show pcpu" and "show trace". If an SMP box with an SMP kernel, run "show pcpu" for each cpu, and trace the active threads on each. The output of "ps" is usually pretty valuable, as it will show what the system was doing, and if many threads are waiting fore something, it will show what they are waiting for. With file system related panics or hangs, the output of "show lockedvnods" is often also very useful, as it will show what file system objects were being actively used, and by what threads. If running with WITNESS (see below), "show locks" can be very helpful, as it will assist in understanding and debugging the synchronization state of the kernel. - If a bug leads to an eventual panic, that problem caused by the bug will sometimes be better described if you have some of the kernel debugging kernel enabled. For example, INVARIANTS and/or WITNESS. Depending on the impact to performance you can take on the box, you might want to try some features, then others. Features like INVARIANTS may also help catch the problem earlier, making the problem easier to diagnose. I've found the single most useful tool in debugging failure modes is a serial console, as it provides ready scroll-back to earlier console output, a fairly reliable ability to enter the debugger using a break, as well as functionality like remote DDB, logging of DDB output, etc. I've heard people report very similar benefits and experiences with firewire debugging, but since I don't really live in the world of firewire, I'll point at serial ports :-). Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Principal Research Scientist, McAfee Research