From owner-freebsd-alpha Wed Oct 23 11:29:26 2002 Delivered-To: freebsd-alpha@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9487237B401 for ; Wed, 23 Oct 2002 11:29:23 -0700 (PDT) Received: from gatekeeper.oremut01.us.wh.verio.net (gatekeeper.oremut01.us.wh.verio.net [198.65.168.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1F71B43E42 for ; Wed, 23 Oct 2002 11:29:23 -0700 (PDT) (envelope-from fclift@verio.net) Received: from mx.dmz.orem.verio.net (mx.dmz.orem.verio.net [10.1.1.10]) by gatekeeper.oremut01.us.wh.verio.net (Postfix) with ESMTP id D82AA3BF5A0 for ; Wed, 23 Oct 2002 12:29:22 -0600 (MDT) Received: from vespa.dmz.orem.verio.net (vespa.dmz.orem.verio.net [10.1.1.59]) by mx.dmz.orem.verio.net (8.11.6/8.11.6) with ESMTP id g9NITM336918; Wed, 23 Oct 2002 12:29:22 -0600 (MDT) Date: Wed, 23 Oct 2002 12:35:43 -0600 (MDT) From: Fred Clift X-X-Sender: To: Andrew Gallatin Cc: Subject: Re: debugging around machine-checks... In-Reply-To: <15798.56033.844389.549256@grasshopper.cs.duke.edu> Message-ID: <20021023113324.U98807-100000@vespa.dmz.orem.verio.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-alpha@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On Wed, 23 Oct 2002, Andrew Gallatin wrote: > > > that FreeBSD is instantenously interrupted when a machine check happens > > and that I dont get crash-dumps. > > Hmm.. I haven't used a machine check generating alpha in a while, but > from the code in interrupt.c, it looks like it *should* give you a > crashdump. Perhaps I'm just clueless - I build my kernel with the option makeoptions DEBUG=-g (install, reboot) by hand I run dumpon -v /dev/da0b (which is my swap partition, twice what I have of ram in size) and then I do my fiddling with XFree86 that gives me the machine-check and I end up at the SRM prompt. At this point, I know that just booting will fail. I have to power-cycle the box and when it comes back up, savecore either doesn't find anything, or isn't being run by the rc scripts. Once I get a chance to log in /var/crash has only minfree in it... Should I be doing something else? I just looked in /var/log/mesages and saw no evidence of crashdumps being written (ie dumping to.... or dump 254 253 252 251... etc). > > Can't you use the program counter from the panic output as a start? > If its in the X server, there should be a PC from userspace. > (see disclaimer below) > So can you interpret this for me then - honestly I just dont know what all the fields represent -- I should probably just go read the source code and see :) Oct 8 06:42:24 liron /kernel: unexpected machine check: Oct 8 06:42:24 liron /kernel: Oct 8 06:42:24 liron /kernel: mces = 0x1 Oct 8 06:42:24 liron /kernel: vector = 0x660 Oct 8 06:42:24 liron /kernel: param = 0xfffffc0000006068 Oct 8 06:42:24 liron /kernel: pc = 0x1604006ac Oct 8 06:42:24 liron /kernel: ra = 0x12006cb10 Oct 8 06:42:24 liron /kernel: curproc = 0xfffffe0009910200 Oct 8 06:42:24 liron /kernel: pid = 90765, comm = XFree86 Oct 8 06:42:24 liron /kernel: Oct 8 06:42:24 liron /kernel: panic: machine check The program counter is pc? so I should be able to, with gdb and a debug-version of XFree86, figure out what code this is? > > > > Look at alpha/alpha/interrupt.c:badaddr_read(). > > If you're feeling really lucky, you could add code to send the > appropriate signal (sigbus?) if the PC is in a userland app. > > The problem with this is that machine checks are somewhat > asynchronous, and I'm not sure the PC at the time of the fault > corresponds to the PC that actually caused the fault. > (that's why there are so many memory barriers all over the pci probing > and baddaddr code). Your explanation is helpful, and perhaps I'll try your suggestion of turning userland machine checks into sigbus or something - I'm sure I'm just begging for trouble here, but at least this isn't a production machine that other people depend on :). To send a signal to a process from within the kernel, it seems I just call psignal(pid, signo) - is this right? Thanks very much for your information - looks like a little check in machine_check() in interrupt.c will do pretty much what I want - perhaps I'll make sure that my hack only works on processes who's name starts with 'X' or something just to be safe.... Fred -- Fred Clift - fclift@verio.net -- Remember: If brute force doesn't work, you're just not using enough. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-alpha" in the body of the message