From owner-freebsd-hackers  Wed May  3  1:23:43 2000
Delivered-To: freebsd-hackers@freebsd.org
Received: from freebie.lemis.com (freebie.lemis.com [192.109.197.137])
	by hub.freebsd.org (Postfix) with ESMTP
	id 76C4F37B94F; Wed,  3 May 2000 01:23:20 -0700 (PDT)
	(envelope-from grog@freebie.lemis.com)
Received: (from grog@localhost)
	by freebie.lemis.com (8.9.3/8.9.0) id RAA12069;
	Wed, 3 May 2000 17:53:47 +0930 (CST)
Date: Wed, 3 May 2000 17:53:46 +0930
From: Greg Lehey <grog@lemis.com>
To: Howard Leadmon <howardl@account.abs.net>
Cc: freebsd-stable@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG
Subject: Re: Debugging Kernel/System Crashes, can anyone help??
Message-ID: <20000503175346.S8284@freebie.lemis.com>
References: <200005030748.DAA84934@account.abs.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Mailer: Mutt 1.0pre2i
In-Reply-To: <200005030748.DAA84934@account.abs.net>
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-418-838-708
WWW-Home-Page: http://www.lemis.com/~grog
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Wednesday,  3 May 2000 at  3:48:42 -0400, Howard Leadmon wrote:
>
>    Hello,
>
>  I know I posted a few messages here in the past, but maybe someone who is
> good at tracking kernel problems can step up and lend a hand.
>
>  I have a machine running FBSD 4.0-STABLE, and have been experiencing almost
> daily kernel panics or reboots on the machine.  I have replaced ALL of the
> hardware, and reloaded the OS, but still having troubles.  I am at a bit of
> a loss as to what is going on.  From one panic, I thought well maybe this
> is an SMP issue, but removed one of the CPU's and still the box crashes. As
> I have basically replaced everything, I am at a loss as to where to go from
> here, so looking for some type of pointers or help with this..

Indeed.  We need to address this issue in some detail.  We need both
documentation and tools.

>  The other day I was there, and got the following from one of the
> crashes, as many times I am gone and luckally in some ways the box
> will just panicboot and go on it's way.  Here is what I was able to
> copy down:
>
>
> Fatal trap 12: page fault while in kernel mode
> mp_lock=01000002; cpuid=1; lapic.id=01000000
> fault virtual address= 0x30
> fault code= supervisor read, page not present
> instruction pointer= 0x8:0xC01CAF71
> stack pointer= 0x10:0xFF80DE48
> frame pointer= 0x10:0xFF80DE4C
> code segment= base 0x0, limit 0xFFFFF, type 0x1B
>             = DPL 0, pres 1, def 32, gran 1
> processor eflags= interrupt enabled, resume, IOPL=0
> current process = idle
> interupt mask= bio <- SMP: XXX
> trap number= 12
> panic: page fault
>
> The formatting of it may not be perfect, but the information should be
> accurate, as I tried to be precise on what I wrote down.  Also here are
> a few previous messages I had posted a while back when I thought this
> might be network related, but after trying several different NIC's I still
> have the same issues.  I will include the info below, as maybe it will
> have some value in trying to debunk this problem.

The sad thing is that this information is that most of this
information is almost useless.  I'm thinking of printing out a stack
trace instead (comments, anybody?).  Without tedious comparison with
your kernel namelist, all we can say here is that you died somewhere
in the kernel, that you have an SMP machine, and that the block I/O
subsystem is probably involved.  If this is happening daily, you
should build a kernel with debugging symbols enabled and take a dump
of the next crash.  We can then use gdb to analyse the dump.

>   Hello, I am running a 4.0-STABLE machine which is being used to host an
> Undernet IRC server, and the machine keeps dying at times, or should I say
> the networking side of it is at least dying.  At first I thought it might
> have been related to the dc (DEC Chip) based drivers, so I replaced it with
> a EEpro board using the fxp driver, but the same results.
>
> <snip>

If all your dumps have the interrupt mask set to bio, I don't think
it's a networking problem.  With one possible exception...

> Mar 27 12:39:00 u2 /kernel: fxp0: device timeout

Søren and I are trying to find out what is causing some weird Vinum
problems.  He stated that the problem happened more frequently when
an fxp board was in the system.  I don't believe him, and I've found
at least one bug in Vinum that has nothing to do with networking (but
does have to do with the bio mask); possibly, however, there's some
other problem with the fxp driver.

It's possible that the other information will be of use, but I think
we first need to look at a dump.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message