From owner-freebsd-hackers  Sun Jun  1 22:05:08 1997
Return-Path: <owner-hackers>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id WAA26965
          for hackers-outgoing; Sun, 1 Jun 1997 22:05:08 -0700 (PDT)
Received: from genesis.atrad.adelaide.edu.au (genesis.atrad.adelaide.edu.au [129.127.96.120])
          by hub.freebsd.org (8.8.5/8.8.5) with ESMTP id WAA26959
          for <hackers@freebsd.org>; Sun, 1 Jun 1997 22:05:03 -0700 (PDT)
Received: (from msmith@localhost) by genesis.atrad.adelaide.edu.au (8.8.5/8.7.3) id OAA18291 for hackers@freebsd.org; Mon, 2 Jun 1997 14:34:54 +0930 (CST)
From: Michael Smith <msmith@atrad.adelaide.edu.au>
Message-Id: <199706020504.OAA18291@genesis.atrad.adelaide.edu.au>
Subject: weird scheduler crash (2.2)
To: hackers@freebsd.org
Date: Mon, 2 Jun 1997 14:34:54 +0930 (CST)
X-Mailer: ELM [version 2.4ME+ PL28 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk


Hmm.  We've been trying for several weeks now to find a cause for the
occasional crashes we're seeing on our radar controllers.

We've finally managed to reproduce one here in the lab, but as luck 
has it, I can't make sense of its complaint.  :

Fatal trap 12: page fault while in kernel mode
fault virtual address	= 0x0
fault code		= supervisor write, page not present
instruction pointer	= 0x8:0xf01c8310
stack pointer		= 0x10:0xefbffd7c
frame pointer		= 0x10:0xefbffd8c
code segment		= base 0x0, limit 0xfffff, type 0x1b
			  DPL 0, PRES 1, DEF32 1, gran 1
processor eflags	= resume, IOPL = 3
current process		= 690 (exptd)
interrupt mask		= net, tty, bio
kernel: type 12 trap, code = 0
Stopped at	set_nort+0x25	movl	%eax,0(%ecx)
db> trace
set_nort(f0ca8a00) at set_nort+0x25
_selwakeup(f0204330) at _selwakeup+0x69
_logwakeup(2,efbffe48,5,0,efbffdf4) at _logwakeup+0x16
_printf(f01c8e2c,c,f01c871f,f01c8e25) at _printf+0x50
_trap_fatal(efbffe48,0,f0d0cc00,c,f0d20700) at _trap_fatal+0x5f
_trap_pfault(efbffe48,0,ffffffff,278,3) at _trap_pfault+0x11c
_trap(10,10,3,278,efbffe88) at _trap+0x2ab
calltrap() at calltrap+0x15
--- trap 0xc, eip = 0xf0117408, esp = 0xefbffe84, ebp = 0xefbffe88 ---
_unsleep(f0d0cc00) at _unsleep+0x48
_selwakeup(f0214348) at _selwakeup+0x76
_mdsiointr(0,10,f020f9dc,118,ffffffff) at _mdsiointr+0x184
_Xfastintr10(f020f9dc,118,f011cb84,b,f01f5748) at _Xfastintr10+0x17
_select(f0d0cc00,efbfff94,efbfff84) at _select+0x2e2
_syscall(27,27,4,4,efbf77d4) at _syscall+0x127
_Xsyscall() at _Xsyscall+0x35
--- syscall 0x5d, eip = 0x7c945, esp = 0xefbf7568, ebp = efbf77d4 ---

The kernel couldn't be convinced to do a dump either, so this is all I
have.  It looks like the driver (mdsio) took an interrupt during a
select syscall which in turn resulted in the driver trying to wake the
selecting process up again.

Is the set_nort stuff relevant?  Is this, perhaps, a screwup in the
select code in (my) mdsio driver?  If so, how?

select+0x2e2 is 0x9ee in (this) sys_generic.o, which looks like :

 617:../../kern/sys_generic.c ****      error = tsleep((caddr_t)&selwait, PSOCK 
| PCATCH, "select", timo);
 1910                           .stabd 68,0,617
 1911 09d7 FF75D8               pushl -40(%ebp)
 1912 09da 68040700             pushl $LC0
 1912      00
 1913 09df 68180100             pushl $280
 1913      00
 1914 09e4 68000000             pushl $_selwait
 1914      00
 1915 09e9 E812F6FF             call _tsleep
 1915      FF
 1916 09ee 89C3                 movl %eax,%ebx

so I think it was actually asleep at the time.

-- 
]] Mike Smith, Software Engineer        msmith@gsoft.com.au             [[
]] Genesis Software                     genesis@gsoft.com.au            [[
]] High-speed data acquisition and      (GSM mobile)     0411-222-496   [[
]] realtime instrument control.         (ph)          +61-8-8267-3493   [[
]] Unix hardware collector.             "Where are your PEZ?" The Tick  [[