Date: Wed, 23 Oct 2002 12:35:43 -0600 (MDT) From: Fred Clift <fclift@verio.net> To: Andrew Gallatin <gallatin@cs.duke.edu> Cc: <freebsd-alpha@freebsd.org> Subject: Re: debugging around machine-checks... Message-ID: <20021023113324.U98807-100000@vespa.dmz.orem.verio.net> In-Reply-To: <15798.56033.844389.549256@grasshopper.cs.duke.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 23 Oct 2002, Andrew Gallatin wrote: > > > that FreeBSD is instantenously interrupted when a machine check happens > > and that I dont get crash-dumps. > > Hmm.. I haven't used a machine check generating alpha in a while, but > from the code in interrupt.c, it looks like it *should* give you a > crashdump. Perhaps I'm just clueless - I build my kernel with the option makeoptions DEBUG=-g (install, reboot) by hand I run dumpon -v /dev/da0b (which is my swap partition, twice what I have of ram in size) and then I do my fiddling with XFree86 that gives me the machine-check and I end up at the SRM prompt. At this point, I know that just booting will fail. I have to power-cycle the box and when it comes back up, savecore either doesn't find anything, or isn't being run by the rc scripts. Once I get a chance to log in /var/crash has only minfree in it... Should I be doing something else? I just looked in /var/log/mesages and saw no evidence of crashdumps being written (ie dumping to.... or dump 254 253 252 251... etc). > > Can't you use the program counter from the panic output as a start? > If its in the X server, there should be a PC from userspace. > (see disclaimer below) > So can you interpret this for me then - honestly I just dont know what all the fields represent -- I should probably just go read the source code and see :) Oct 8 06:42:24 liron /kernel: unexpected machine check: Oct 8 06:42:24 liron /kernel: Oct 8 06:42:24 liron /kernel: mces = 0x1 Oct 8 06:42:24 liron /kernel: vector = 0x660 Oct 8 06:42:24 liron /kernel: param = 0xfffffc0000006068 Oct 8 06:42:24 liron /kernel: pc = 0x1604006ac Oct 8 06:42:24 liron /kernel: ra = 0x12006cb10 Oct 8 06:42:24 liron /kernel: curproc = 0xfffffe0009910200 Oct 8 06:42:24 liron /kernel: pid = 90765, comm = XFree86 Oct 8 06:42:24 liron /kernel: Oct 8 06:42:24 liron /kernel: panic: machine check The program counter is pc? so I should be able to, with gdb and a debug-version of XFree86, figure out what code this is? > > > > Look at alpha/alpha/interrupt.c:badaddr_read(). > > If you're feeling really lucky, you could add code to send the > appropriate signal (sigbus?) if the PC is in a userland app. > > The problem with this is that machine checks are somewhat > asynchronous, and I'm not sure the PC at the time of the fault > corresponds to the PC that actually caused the fault. > (that's why there are so many memory barriers all over the pci probing > and baddaddr code). Your explanation is helpful, and perhaps I'll try your suggestion of turning userland machine checks into sigbus or something - I'm sure I'm just begging for trouble here, but at least this isn't a production machine that other people depend on :). To send a signal to a process from within the kernel, it seems I just call psignal(pid, signo) - is this right? Thanks very much for your information - looks like a little check in machine_check() in interrupt.c will do pretty much what I want - perhaps I'll make sure that my hack only works on processes who's name starts with 'X' or something just to be safe.... Fred -- Fred Clift - fclift@verio.net -- Remember: If brute force doesn't work, you're just not using enough. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-alpha" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20021023113324.U98807-100000>