From owner-freebsd-hackers Wed May 3 1:23:43 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from freebie.lemis.com (freebie.lemis.com [192.109.197.137]) by hub.freebsd.org (Postfix) with ESMTP id 76C4F37B94F; Wed, 3 May 2000 01:23:20 -0700 (PDT) (envelope-from grog@freebie.lemis.com) Received: (from grog@localhost) by freebie.lemis.com (8.9.3/8.9.0) id RAA12069; Wed, 3 May 2000 17:53:47 +0930 (CST) Date: Wed, 3 May 2000 17:53:46 +0930 From: Greg Lehey To: Howard Leadmon Cc: freebsd-stable@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG Subject: Re: Debugging Kernel/System Crashes, can anyone help?? Message-ID: <20000503175346.S8284@freebie.lemis.com> References: <200005030748.DAA84934@account.abs.net> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Mailer: Mutt 1.0pre2i In-Reply-To: <200005030748.DAA84934@account.abs.net> Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-418-838-708 WWW-Home-Page: http://www.lemis.com/~grog X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF 13 24 52 F8 6D A4 95 EF Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Wednesday, 3 May 2000 at 3:48:42 -0400, Howard Leadmon wrote: > > Hello, > > I know I posted a few messages here in the past, but maybe someone who is > good at tracking kernel problems can step up and lend a hand. > > I have a machine running FBSD 4.0-STABLE, and have been experiencing almost > daily kernel panics or reboots on the machine. I have replaced ALL of the > hardware, and reloaded the OS, but still having troubles. I am at a bit of > a loss as to what is going on. From one panic, I thought well maybe this > is an SMP issue, but removed one of the CPU's and still the box crashes. As > I have basically replaced everything, I am at a loss as to where to go from > here, so looking for some type of pointers or help with this.. Indeed. We need to address this issue in some detail. We need both documentation and tools. > The other day I was there, and got the following from one of the > crashes, as many times I am gone and luckally in some ways the box > will just panicboot and go on it's way. Here is what I was able to > copy down: > > > Fatal trap 12: page fault while in kernel mode > mp_lock=01000002; cpuid=1; lapic.id=01000000 > fault virtual address= 0x30 > fault code= supervisor read, page not present > instruction pointer= 0x8:0xC01CAF71 > stack pointer= 0x10:0xFF80DE48 > frame pointer= 0x10:0xFF80DE4C > code segment= base 0x0, limit 0xFFFFF, type 0x1B > = DPL 0, pres 1, def 32, gran 1 > processor eflags= interrupt enabled, resume, IOPL=0 > current process = idle > interupt mask= bio <- SMP: XXX > trap number= 12 > panic: page fault > > The formatting of it may not be perfect, but the information should be > accurate, as I tried to be precise on what I wrote down. Also here are > a few previous messages I had posted a while back when I thought this > might be network related, but after trying several different NIC's I still > have the same issues. I will include the info below, as maybe it will > have some value in trying to debunk this problem. The sad thing is that this information is that most of this information is almost useless. I'm thinking of printing out a stack trace instead (comments, anybody?). Without tedious comparison with your kernel namelist, all we can say here is that you died somewhere in the kernel, that you have an SMP machine, and that the block I/O subsystem is probably involved. If this is happening daily, you should build a kernel with debugging symbols enabled and take a dump of the next crash. We can then use gdb to analyse the dump. > Hello, I am running a 4.0-STABLE machine which is being used to host an > Undernet IRC server, and the machine keeps dying at times, or should I say > the networking side of it is at least dying. At first I thought it might > have been related to the dc (DEC Chip) based drivers, so I replaced it with > a EEpro board using the fxp driver, but the same results. > > If all your dumps have the interrupt mask set to bio, I don't think it's a networking problem. With one possible exception... > Mar 27 12:39:00 u2 /kernel: fxp0: device timeout Søren and I are trying to find out what is causing some weird Vinum problems. He stated that the problem happened more frequently when an fxp board was in the system. I don't believe him, and I've found at least one bug in Vinum that has nothing to do with networking (but does have to do with the bio mask); possibly, however, there's some other problem with the fxp driver. It's possible that the other information will be of use, but I think we first need to look at a dump. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message