From owner-freebsd-hackers  Wed Jul  5 09:29:10 1995
Return-Path: hackers-owner
Received: (from majordom@localhost)
          by freefall.cdrom.com (8.6.10/8.6.6) id JAA28242
          for hackers-outgoing; Wed, 5 Jul 1995 09:29:10 -0700
Received: from FileServ1.MI.Uni-Koeln.DE (FileServ1.MI.Uni-Koeln.DE [134.95.212.1])
          by freefall.cdrom.com (8.6.10/8.6.6) with SMTP id JAA28236
          for <hackers@freebsd.org>; Wed, 5 Jul 1995 09:28:57 -0700
Received: by FileServ1.MI.Uni-Koeln.DE id AA06352
  (5.67b/IDA-1.5); Wed, 5 Jul 1995 17:52:34 +0200
Message-Id: <199507051552.AA06352@FileServ1.MI.Uni-Koeln.DE>
From: esser@zpr.uni-koeln.de (Stefan Esser)
Date: Wed, 5 Jul 1995 17:52:34 +0200
X-Mailer: Mail User's Shell (7.2.5 10/14/92)
To: Voradesh Yenbut <yenbut@cs.washington.edu>
Subject: Re: One cause of 2.05R instability found
Cc: hackers@freebsd.org
Sender: hackers-owner@freebsd.org
Precedence: bulk

Regarding problems with panics:

	Fatal trap 12: page fault while in kernel mode

Is this a single case ?

Who else (other than <yenbut@cs.washington.edu> Voradesh Yenbut)
sees this ???


} A few days ago, I committed a 90MHz pentium system running 2.05R to be
} a news server. The system was not stable at all.  It kept on crashing
} within 2 hours with "Fatal trap 12: page fault while in kernel mode"
} and fault code "supervisor read, page not present".  The crash always
} happened at the same instruction pointer, i.e., ncr_complete+195
} (as reported by gdb; I don't have the hex number with me) in ncr.c.
} 
} In ncr.c, ncr_complte+195 is at the following if statement:
} 
}         if (DEBUG_FLAGS & DEBUG_TINY)
}                 printf ("CCB=%x STAT=%x/%x\n", (unsigned)cp & 0xfff,
}                         cp->host_status,cp->scsi_status);

No, sorry, this statement isn't there (at ncr_complete+195) for sure ...
Except if you changed the sources, or if you configured NCR debugging 
in your kernel config file, eg. by:

options		"SCSI_DEBUG_FLAGS=0x80"


} where DEBUG_FLAGS is ncr_debug declared in ncr.c as
} 
}         static int ncr_debug = SCSI_DEBUG_FLAGS;

No, not really ... The complete code is:

#ifdef SCSI_DEBUG_FLAGS
	#define DEBUG_FLAGS ncr_debug
#else /* SCSI_DEBUG_FLAGS */
	#define SCSI_DEBUG_FLAGS	0
	#define DEBUG_FLAGS	0
#endif /* SCSI_DEBUG_FLAGS */

and SCSI_DEBUG_FLAGS is undefined by default. This makes
DEBUG_FLAGS a constant zero, and GCC generates no code
at all for the if statement or the printf() ...

} I commented out the if statement, rebuilt and installed the new
} kernel.  The system has been running fine with the new kernel for two
} days (though I still keep my fingers crossed).

Well, since there shouldn't have been any code generated
before, there shouldn't be any difference ...

The NCR code hasn't changed over many months until after 
FreeBSD-2.0.5R has been released, and I don't have any 
other report of "trap 12: page fault while in kernel mode"
problems. So I don't suppose this to be a problem caused
by the driver. 

But I have got to admit, that a panic within some subroutine 
generally points at some problem in close proximity ...


For further diagnosis, I need to know:

Did you change the sources or use any NCR specific kernel
config file options ?

How did you identify the suspected error location in ncr.c ?

;	ncb_profile (np, cp);
	pushl %ecx
	pushl 8(%ebp)
	call _ncb_profile
	addl $8,%esp

;	if (DEBUG_FLAGS & DEBUG_TINY)
;		printf ("CCB=%x STAT=%x/%x\n", (unsigned)cp & 0xfff,
;			cp->host_status,cp->scsi_status);

;	xp = cp->xfer;
	movl 12(%ebp),%ecx
	movl 452(%ecx),%edi

;	cp->xfer = NULL;
	movl $0,452(%ecx)

Alll data structures should remain unchanged over the 
execution of ncr_complete(), since they are locked in a 
way that should also prevent simultanous updates by the 
NCR ...

	xp = cp->xfer;
	cp->xfer = NULL;
	tp = &np->target[xp->sc_link->target];
	lp = tp->lp[xp->sc_link->lun];

ncr_complete + 195:
	if (cp->parity_status) {
		...
	{

On address ncr_complete + 195, there is the test of 
cp->parity_status. I'd be rather surprised, if the 
access to cp->xfer (four lines above) would always
succeed, and the page would get lost (reproducibly)
before the access to cp->parity ...

The address of cp->parity_status is a few bytes 
before cp->xfer, and I really can't see, how the 
memory allocated for CCBs at driver startup should 
get unmapped from kernel VM ...

I assume, that the address printed by the panic message 
points at the failed instruction, not behind that 
instruction. Is this true for this trap ???
(Don't have a i486 manual here, but else the failed
instruction couldn't be restarted, so this seems the 
only possibility.)

It might help to send a stack trace obtained using
the kernel debugger ...


Is there anybody else seeing that kind of failure ???

STefan

-- 
 Stefan Esser				Internet:	<se@ZPR.Uni-Koeln.DE>
 Zentrum fuer Paralleles Rechnen	Tel:		+49 221 4706021
 Universitaet zu Koeln			FAX:		+49 221 4705160
 Weyertal 80
 50931 Koeln