From owner-freebsd-alpha  Thu Oct 24 12: 1:36 2002
Delivered-To: freebsd-alpha@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A433737B401
	for <freebsd-alpha@FreeBSD.ORG>; Thu, 24 Oct 2002 12:01:30 -0700 (PDT)
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0B26443E6A
	for <freebsd-alpha@FreeBSD.ORG>; Thu, 24 Oct 2002 12:01:30 -0700 (PDT)
	(envelope-from gallatin@cs.duke.edu)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.9.3/8.9.3) with ESMTP id PAA21309;
	Thu, 24 Oct 2002 15:01:29 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.11.6/8.9.1) id g9OJ0x311005;
	Thu, 24 Oct 2002 15:00:59 -0400 (EDT)
	(envelope-from gallatin@cs.duke.edu)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <15800.17259.397652.862956@grasshopper.cs.duke.edu>
Date: Thu, 24 Oct 2002 15:00:59 -0400 (EDT)
To: Fred Clift <fclift@verio.net>
Cc: <freebsd-alpha@FreeBSD.ORG>
Subject: Re: debugging around machine-checks...
In-Reply-To: <20021023113324.U98807-100000@vespa.dmz.orem.verio.net>
References: <15798.56033.844389.549256@grasshopper.cs.duke.edu>
	<20021023113324.U98807-100000@vespa.dmz.orem.verio.net>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Sender: owner-freebsd-alpha@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-alpha.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-alpha>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-alpha>
X-Loop: FreeBSD.org


Fred Clift writes:
 > by hand I run dumpon -v /dev/da0b (which is my swap partition, twice what
 > I have of ram in size)
 > 
 > and then I do my fiddling with XFree86 that gives me the machine-check and
 > I end up at the SRM prompt.  At this point, I know that just booting will
 > fail.  I have to power-cycle the box and when it comes back up, savecore
 > either doesn't find anything, or isn't being run by the rc scripts.  Once
 > I get a chance to log in /var/crash has only minfree in it...
 > 

That *should* work..

 > Should I be doing something else?
 > 
 > I just looked in /var/log/mesages and saw no evidence of crashdumps being
 > written (ie dumping to.... or dump 254 253 252 251...  etc).

If you powercyle, the message buffer is lost.

When I would crash X on an old miata, 1/2 the time I'd get a 
'machine check in pal mode' -- this doesn't even get caught by the
OS.   

However, if you're seeing the message below, I do not understand
why you're not getting a crashdump.

In any case, since the problem is probably with the X server (based on
the mesage below), a crashdump would not help you.


 > 
 > >
 > > Can't you use the program counter from the panic output as a start?
 > > If its in the X server, there should be a PC from userspace.
 > > (see disclaimer below)
 > >
 > 
 > So can you interpret this for me then - honestly I just dont know what all
 > the fields represent -- I should probably just go read the source code and
 > see :)
 > 
 > Oct  8 06:42:24 liron /kernel: unexpected machine check:
 > Oct  8 06:42:24 liron /kernel:
 > Oct  8 06:42:24 liron /kernel: mces    = 0x1
 > Oct  8 06:42:24 liron /kernel: vector  = 0x660
 > Oct  8 06:42:24 liron /kernel: param   = 0xfffffc0000006068
 > Oct  8 06:42:24 liron /kernel: pc      = 0x1604006ac
 > Oct  8 06:42:24 liron /kernel: ra      = 0x12006cb10
 > Oct  8 06:42:24 liron /kernel: curproc = 0xfffffe0009910200
 > Oct  8 06:42:24 liron /kernel: pid = 90765, comm = XFree86
 > Oct  8 06:42:24 liron /kernel:
 > Oct  8 06:42:24 liron /kernel: panic: machine check
 > 
 > 
 > The program counter is pc? so I should be able to, with gdb and a
 > debug-version of XFree86, figure out what code this is?

Yes,  except its in a shared lib, or other dynamically loaded text.
I don't know how you could debug that without a cordump.
The ra (return address) is at least somewhere in the main text
of the program (not a shared lib).

<...>

 > Your explanation is helpful, and perhaps I'll try your suggestion of
 > turning userland machine checks into sigbus or something  - I'm sure I'm
 > just begging for trouble here, but at least this isn't a production
 > machine that other people depend on :).
 > 
 > To send a signal to a process from within the kernel, it seems I just call
 > 
 > psignal(pid, signo)
 > 
 >  - is this right?
 > 

More or less.  I think trapsignal may be more correct.

 > Thanks very much for your information - looks like a little check in
 > machine_check() in interrupt.c will do pretty much what I want - perhaps
 > I'll make sure that my hack only works on processes who's name starts
 > with 'X' or something just to be safe....

Good luck to you!!

Drew

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-alpha" in the body of the message