Date: Thu, 24 Oct 2002 15:00:59 -0400 (EDT) From: Andrew Gallatin <gallatin@cs.duke.edu> To: Fred Clift <fclift@verio.net> Cc: <freebsd-alpha@FreeBSD.ORG> Subject: Re: debugging around machine-checks... Message-ID: <15800.17259.397652.862956@grasshopper.cs.duke.edu> In-Reply-To: <20021023113324.U98807-100000@vespa.dmz.orem.verio.net> References: <15798.56033.844389.549256@grasshopper.cs.duke.edu> <20021023113324.U98807-100000@vespa.dmz.orem.verio.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Fred Clift writes: > by hand I run dumpon -v /dev/da0b (which is my swap partition, twice what > I have of ram in size) > > and then I do my fiddling with XFree86 that gives me the machine-check and > I end up at the SRM prompt. At this point, I know that just booting will > fail. I have to power-cycle the box and when it comes back up, savecore > either doesn't find anything, or isn't being run by the rc scripts. Once > I get a chance to log in /var/crash has only minfree in it... > That *should* work.. > Should I be doing something else? > > I just looked in /var/log/mesages and saw no evidence of crashdumps being > written (ie dumping to.... or dump 254 253 252 251... etc). If you powercyle, the message buffer is lost. When I would crash X on an old miata, 1/2 the time I'd get a 'machine check in pal mode' -- this doesn't even get caught by the OS. However, if you're seeing the message below, I do not understand why you're not getting a crashdump. In any case, since the problem is probably with the X server (based on the mesage below), a crashdump would not help you. > > > > > Can't you use the program counter from the panic output as a start? > > If its in the X server, there should be a PC from userspace. > > (see disclaimer below) > > > > So can you interpret this for me then - honestly I just dont know what all > the fields represent -- I should probably just go read the source code and > see :) > > Oct 8 06:42:24 liron /kernel: unexpected machine check: > Oct 8 06:42:24 liron /kernel: > Oct 8 06:42:24 liron /kernel: mces = 0x1 > Oct 8 06:42:24 liron /kernel: vector = 0x660 > Oct 8 06:42:24 liron /kernel: param = 0xfffffc0000006068 > Oct 8 06:42:24 liron /kernel: pc = 0x1604006ac > Oct 8 06:42:24 liron /kernel: ra = 0x12006cb10 > Oct 8 06:42:24 liron /kernel: curproc = 0xfffffe0009910200 > Oct 8 06:42:24 liron /kernel: pid = 90765, comm = XFree86 > Oct 8 06:42:24 liron /kernel: > Oct 8 06:42:24 liron /kernel: panic: machine check > > > The program counter is pc? so I should be able to, with gdb and a > debug-version of XFree86, figure out what code this is? Yes, except its in a shared lib, or other dynamically loaded text. I don't know how you could debug that without a cordump. The ra (return address) is at least somewhere in the main text of the program (not a shared lib). <...> > Your explanation is helpful, and perhaps I'll try your suggestion of > turning userland machine checks into sigbus or something - I'm sure I'm > just begging for trouble here, but at least this isn't a production > machine that other people depend on :). > > To send a signal to a process from within the kernel, it seems I just call > > psignal(pid, signo) > > - is this right? > More or less. I think trapsignal may be more correct. > Thanks very much for your information - looks like a little check in > machine_check() in interrupt.c will do pretty much what I want - perhaps > I'll make sure that my hack only works on processes who's name starts > with 'X' or something just to be safe.... Good luck to you!! Drew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-alpha" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?15800.17259.397652.862956>