From owner-freebsd-hackers Wed Jul 5 09:29:10 1995 Return-Path: hackers-owner Received: (from majordom@localhost) by freefall.cdrom.com (8.6.10/8.6.6) id JAA28242 for hackers-outgoing; Wed, 5 Jul 1995 09:29:10 -0700 Received: from FileServ1.MI.Uni-Koeln.DE (FileServ1.MI.Uni-Koeln.DE [134.95.212.1]) by freefall.cdrom.com (8.6.10/8.6.6) with SMTP id JAA28236 for ; Wed, 5 Jul 1995 09:28:57 -0700 Received: by FileServ1.MI.Uni-Koeln.DE id AA06352 (5.67b/IDA-1.5); Wed, 5 Jul 1995 17:52:34 +0200 Message-Id: <199507051552.AA06352@FileServ1.MI.Uni-Koeln.DE> From: esser@zpr.uni-koeln.de (Stefan Esser) Date: Wed, 5 Jul 1995 17:52:34 +0200 X-Mailer: Mail User's Shell (7.2.5 10/14/92) To: Voradesh Yenbut Subject: Re: One cause of 2.05R instability found Cc: hackers@freebsd.org Sender: hackers-owner@freebsd.org Precedence: bulk Regarding problems with panics: Fatal trap 12: page fault while in kernel mode Is this a single case ? Who else (other than Voradesh Yenbut) sees this ??? } A few days ago, I committed a 90MHz pentium system running 2.05R to be } a news server. The system was not stable at all. It kept on crashing } within 2 hours with "Fatal trap 12: page fault while in kernel mode" } and fault code "supervisor read, page not present". The crash always } happened at the same instruction pointer, i.e., ncr_complete+195 } (as reported by gdb; I don't have the hex number with me) in ncr.c. } } In ncr.c, ncr_complte+195 is at the following if statement: } } if (DEBUG_FLAGS & DEBUG_TINY) } printf ("CCB=%x STAT=%x/%x\n", (unsigned)cp & 0xfff, } cp->host_status,cp->scsi_status); No, sorry, this statement isn't there (at ncr_complete+195) for sure ... Except if you changed the sources, or if you configured NCR debugging in your kernel config file, eg. by: options "SCSI_DEBUG_FLAGS=0x80" } where DEBUG_FLAGS is ncr_debug declared in ncr.c as } } static int ncr_debug = SCSI_DEBUG_FLAGS; No, not really ... The complete code is: #ifdef SCSI_DEBUG_FLAGS #define DEBUG_FLAGS ncr_debug #else /* SCSI_DEBUG_FLAGS */ #define SCSI_DEBUG_FLAGS 0 #define DEBUG_FLAGS 0 #endif /* SCSI_DEBUG_FLAGS */ and SCSI_DEBUG_FLAGS is undefined by default. This makes DEBUG_FLAGS a constant zero, and GCC generates no code at all for the if statement or the printf() ... } I commented out the if statement, rebuilt and installed the new } kernel. The system has been running fine with the new kernel for two } days (though I still keep my fingers crossed). Well, since there shouldn't have been any code generated before, there shouldn't be any difference ... The NCR code hasn't changed over many months until after FreeBSD-2.0.5R has been released, and I don't have any other report of "trap 12: page fault while in kernel mode" problems. So I don't suppose this to be a problem caused by the driver. But I have got to admit, that a panic within some subroutine generally points at some problem in close proximity ... For further diagnosis, I need to know: Did you change the sources or use any NCR specific kernel config file options ? How did you identify the suspected error location in ncr.c ? ; ncb_profile (np, cp); pushl %ecx pushl 8(%ebp) call _ncb_profile addl $8,%esp ; if (DEBUG_FLAGS & DEBUG_TINY) ; printf ("CCB=%x STAT=%x/%x\n", (unsigned)cp & 0xfff, ; cp->host_status,cp->scsi_status); ; xp = cp->xfer; movl 12(%ebp),%ecx movl 452(%ecx),%edi ; cp->xfer = NULL; movl $0,452(%ecx) Alll data structures should remain unchanged over the execution of ncr_complete(), since they are locked in a way that should also prevent simultanous updates by the NCR ... xp = cp->xfer; cp->xfer = NULL; tp = &np->target[xp->sc_link->target]; lp = tp->lp[xp->sc_link->lun]; ncr_complete + 195: if (cp->parity_status) { ... { On address ncr_complete + 195, there is the test of cp->parity_status. I'd be rather surprised, if the access to cp->xfer (four lines above) would always succeed, and the page would get lost (reproducibly) before the access to cp->parity ... The address of cp->parity_status is a few bytes before cp->xfer, and I really can't see, how the memory allocated for CCBs at driver startup should get unmapped from kernel VM ... I assume, that the address printed by the panic message points at the failed instruction, not behind that instruction. Is this true for this trap ??? (Don't have a i486 manual here, but else the failed instruction couldn't be restarted, so this seems the only possibility.) It might help to send a stack trace obtained using the kernel debugger ... Is there anybody else seeing that kind of failure ??? STefan -- Stefan Esser Internet: Zentrum fuer Paralleles Rechnen Tel: +49 221 4706021 Universitaet zu Koeln FAX: +49 221 4705160 Weyertal 80 50931 Koeln