From owner-freebsd-stable  Wed May  3  2:26:44 2000
Delivered-To: freebsd-stable@freebsd.org
Received: from account.abs.net (account.abs.net [207.114.5.70])
	by hub.freebsd.org (Postfix) with ESMTP
	id AC5BD37B8D3; Wed,  3 May 2000 02:26:24 -0700 (PDT)
	(envelope-from howardl@account.abs.net)
Received: (from howardl@localhost)
	by account.abs.net (8.9.3/8.9.3+RBL+DUL+RSS+ORBS) id FAA91225;
	Wed, 3 May 2000 05:25:30 -0400 (EDT)
	(envelope-from howardl)
From: Howard Leadmon <howardl@account.abs.net>
Message-Id: <200005030925.FAA91225@account.abs.net>
Subject: Re: Debugging Kernel/System Crashes, can anyone help??
In-Reply-To: <20000503175346.S8284@freebie.lemis.com> from Greg Lehey at "May
 3, 2000 05:53:46 pm"
To: Greg Lehey <grog@lemis.com>
Date: Wed, 3 May 2000 05:25:30 -0400 (EDT)
Cc: freebsd-stable@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG
X-Mailer: ELM [version 2.4ME+ PL72 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


 Well I actually have two prior crashes that I did save before I turned off
the dumpsaves to avoid running out of drive space, and as I am by no means 
a gdb user if you could tell me what your looking for I'll be happy to fire
up gdb and send you the info.

Here is what I grabbed out of the /var/crash directory (hopefully this is
useful), and I'll set the system to grab a clean dump next time around..

-rw-r--r--  1 howardl  wheel         31 Apr  8 15:44 bounds.0.gz
-rw-r--r--  1 howardl  wheel         31 Apr  9 13:39 bounds.1.gz
-rw-r--r--  1 howardl  wheel     825413 Apr  8 15:46 kernel.0.gz
-rw-r--r--  1 howardl  wheel     825413 Apr  9 13:40 kernel.1.gz
-rw-rw-r--  1 howardl  wheel          5 Feb 14 19:08 minfree
-rw-r--r--  1 howardl  wheel  111631734 Apr  8 15:46 vmcore.0.gz
-rw-r--r--  1 howardl  wheel  110971933 Apr  9 13:40 vmcore.1.gz

NOTE - I gzipped the dumps to save space, as with 384M RAM it was leaving
       some rather large files...


Also, thanks for the quick response..


> >  I know I posted a few messages here in the past, but maybe someone who is
> > good at tracking kernel problems can step up and lend a hand.
> >
> >  I have a machine running FBSD 4.0-STABLE, and have been experiencing almost
> > daily kernel panics or reboots on the machine.  I have replaced ALL of the
> > hardware, and reloaded the OS, but still having troubles.  I am at a bit of
> > a loss as to what is going on.  From one panic, I thought well maybe this
> > is an SMP issue, but removed one of the CPU's and still the box crashes. As
> > I have basically replaced everything, I am at a loss as to where to go from
> > here, so looking for some type of pointers or help with this..
> 
> Indeed.  We need to address this issue in some detail.  We need both
> documentation and tools.
> 
> >  The other day I was there, and got the following from one of the
> > crashes, as many times I am gone and luckally in some ways the box
> > will just panicboot and go on it's way.  Here is what I was able to
> > copy down:
> >
> >
> > Fatal trap 12: page fault while in kernel mode
> > mp_lock=01000002; cpuid=1; lapic.id=01000000
> > fault virtual address= 0x30
> > fault code= supervisor read, page not present
> > instruction pointer= 0x8:0xC01CAF71
> > stack pointer= 0x10:0xFF80DE48
> > frame pointer= 0x10:0xFF80DE4C
> > code segment= base 0x0, limit 0xFFFFF, type 0x1B
> >             = DPL 0, pres 1, def 32, gran 1
> > processor eflags= interrupt enabled, resume, IOPL=0
> > current process = idle
> > interupt mask= bio <- SMP: XXX
> > trap number= 12
> > panic: page fault
> >
> > The formatting of it may not be perfect, but the information should be
> > accurate, as I tried to be precise on what I wrote down.  Also here are
> > a few previous messages I had posted a while back when I thought this
> > might be network related, but after trying several different NIC's I still
> > have the same issues.  I will include the info below, as maybe it will
> > have some value in trying to debunk this problem.
> 
> The sad thing is that this information is that most of this
> information is almost useless.  I'm thinking of printing out a stack
> trace instead (comments, anybody?).  Without tedious comparison with
> your kernel namelist, all we can say here is that you died somewhere
> in the kernel, that you have an SMP machine, and that the block I/O
> subsystem is probably involved.  If this is happening daily, you
> should build a kernel with debugging symbols enabled and take a dump
> of the next crash.  We can then use gdb to analyse the dump.
> 
> >   Hello, I am running a 4.0-STABLE machine which is being used to host an
> > Undernet IRC server, and the machine keeps dying at times, or should I say
> > the networking side of it is at least dying.  At first I thought it might
> > have been related to the dc (DEC Chip) based drivers, so I replaced it with
> > a EEpro board using the fxp driver, but the same results.
> >
> > <snip>
> 
> If all your dumps have the interrupt mask set to bio, I don't think
> it's a networking problem.  With one possible exception...
> 
> > Mar 27 12:39:00 u2 /kernel: fxp0: device timeout
> 
> S_ren and I are trying to find out what is causing some weird Vinum
> problems.  He stated that the problem happened more frequently when
> an fxp board was in the system.  I don't believe him, and I've found
> at least one bug in Vinum that has nothing to do with networking (but
> does have to do with the bio mask); possibly, however, there's some
> other problem with the fxp driver.
> 
> It's possible that the other information will be of use, but I think
> we first need to look at a dump.
> 
> Greg
> --
> Finger grog@lemis.com for PGP public key
> See complete headers for address and phone numbers


---
Howard Leadmon - howardl@abs.net - http://www.abs.net
ABSnet Internet Services - Phone: 410-361-8160 - FAX: 410-361-8162


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message