From owner-freebsd-alpha  Wed Oct 23 11:29:26 2002
Delivered-To: freebsd-alpha@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9487237B401
	for <freebsd-alpha@freebsd.org>; Wed, 23 Oct 2002 11:29:23 -0700 (PDT)
Received: from gatekeeper.oremut01.us.wh.verio.net (gatekeeper.oremut01.us.wh.verio.net [198.65.168.16])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1F71B43E42
	for <freebsd-alpha@freebsd.org>; Wed, 23 Oct 2002 11:29:23 -0700 (PDT)
	(envelope-from fclift@verio.net)
Received: from mx.dmz.orem.verio.net (mx.dmz.orem.verio.net [10.1.1.10])
	by gatekeeper.oremut01.us.wh.verio.net (Postfix) with ESMTP id D82AA3BF5A0
	for <freebsd-alpha@freebsd.org>; Wed, 23 Oct 2002 12:29:22 -0600 (MDT)
Received: from vespa.dmz.orem.verio.net (vespa.dmz.orem.verio.net [10.1.1.59])
	by mx.dmz.orem.verio.net (8.11.6/8.11.6) with ESMTP id g9NITM336918;
	Wed, 23 Oct 2002 12:29:22 -0600 (MDT)
Date: Wed, 23 Oct 2002 12:35:43 -0600 (MDT)
From: Fred Clift <fclift@verio.net>
X-X-Sender:  <fred@vespa.dmz.orem.verio.net>
To: Andrew Gallatin <gallatin@cs.duke.edu>
Cc: <freebsd-alpha@freebsd.org>
Subject: Re: debugging around machine-checks...
In-Reply-To: <15798.56033.844389.549256@grasshopper.cs.duke.edu>
Message-ID: <20021023113324.U98807-100000@vespa.dmz.orem.verio.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-alpha@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-alpha.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-alpha>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-alpha>
X-Loop: FreeBSD.org

On Wed, 23 Oct 2002, Andrew Gallatin wrote:


>
>  > that FreeBSD is instantenously interrupted when a machine check happens
>  > and that I dont get crash-dumps.
>
> Hmm.. I haven't used a machine check generating alpha in a while, but
> from the code in interrupt.c, it looks like it *should* give you a
> crashdump.


Perhaps I'm just clueless - I build my kernel with the option

makeoptions     DEBUG=-g


(install, reboot)

by hand I run dumpon -v /dev/da0b (which is my swap partition, twice what
I have of ram in size)

and then I do my fiddling with XFree86 that gives me the machine-check and
I end up at the SRM prompt.  At this point, I know that just booting will
fail.  I have to power-cycle the box and when it comes back up, savecore
either doesn't find anything, or isn't being run by the rc scripts.  Once
I get a chance to log in /var/crash has only minfree in it...


Should I be doing something else?

I just looked in /var/log/mesages and saw no evidence of crashdumps being
written (ie dumping to.... or dump 254 253 252 251...  etc).


>
> Can't you use the program counter from the panic output as a start?
> If its in the X server, there should be a PC from userspace.
> (see disclaimer below)
>

So can you interpret this for me then - honestly I just dont know what all
the fields represent -- I should probably just go read the source code and
see :)

Oct  8 06:42:24 liron /kernel: unexpected machine check:
Oct  8 06:42:24 liron /kernel:
Oct  8 06:42:24 liron /kernel: mces    = 0x1
Oct  8 06:42:24 liron /kernel: vector  = 0x660
Oct  8 06:42:24 liron /kernel: param   = 0xfffffc0000006068
Oct  8 06:42:24 liron /kernel: pc      = 0x1604006ac
Oct  8 06:42:24 liron /kernel: ra      = 0x12006cb10
Oct  8 06:42:24 liron /kernel: curproc = 0xfffffe0009910200
Oct  8 06:42:24 liron /kernel: pid = 90765, comm = XFree86
Oct  8 06:42:24 liron /kernel:
Oct  8 06:42:24 liron /kernel: panic: machine check


The program counter is pc? so I should be able to, with gdb and a
debug-version of XFree86, figure out what code this is?


>  >
>
> Look at alpha/alpha/interrupt.c:badaddr_read().
>
> If you're feeling really lucky, you could add code to send the
> appropriate signal (sigbus?) if the PC is in a userland app.
>
> The problem with this is that machine checks are somewhat
> asynchronous, and I'm not sure the PC at the time of the fault
> corresponds to the PC that actually caused the fault.
> (that's why there are so many memory barriers all over the pci probing
> and baddaddr code).


Your explanation is helpful, and perhaps I'll try your suggestion of
turning userland machine checks into sigbus or something  - I'm sure I'm
just begging for trouble here, but at least this isn't a production
machine that other people depend on :).

To send a signal to a process from within the kernel, it seems I just call

psignal(pid, signo)

 - is this right?


Thanks very much for your information - looks like a little check in
machine_check() in interrupt.c will do pretty much what I want - perhaps
I'll make sure that my hack only works on processes who's name starts
with 'X' or something just to be safe....


Fred


--
Fred Clift - fclift@verio.net -- Remember: If brute
force doesn't work, you're just not using enough.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message