From owner-aic7xxx  Thu Jul 30 08:16:28 1998
Return-Path: <owner-aic7xxx@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id IAA15208
          for aic7xxx-outgoing; Thu, 30 Jul 1998 08:16:28 -0700 (PDT)
          (envelope-from owner-aic7xxx@FreeBSD.ORG)
Received: from einstein.phy.duke.edu (einstein.phy.duke.edu [152.3.182.4])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA15202
          for <AIC7xxx@freebsd.org>; Thu, 30 Jul 1998 08:16:26 -0700 (PDT)
          (envelope-from rgb@phy.duke.edu)
Received: from ganesh.phy.duke.edu (rgb@ganesh.phy.duke.edu [152.3.183.52])
	by einstein.phy.duke.edu (8.8.8/8.8.8) with ESMTP id LAA02106;
	Thu, 30 Jul 1998 11:16:22 -0400 (EDT)
Received: from localhost (rgb@localhost)
	by ganesh.phy.duke.edu (8.8.5/8.8.5) with SMTP id LAA19580;
	Thu, 30 Jul 1998 11:13:45 -0400
X-Authentication-Warning: ganesh.phy.duke.edu: rgb owned process doing -bs
Date: Thu, 30 Jul 1998 11:13:45 -0400 (EDT)
From: "Robert G. Brown" <rgb@phy.duke.edu>
To: Mike Isely <isely@pobox.com>
cc: Chris Pirih <proverbs@wolfenet.com>,
        aic7xxx Mailing List <AIC7xxx@FreeBSD.ORG>
Subject: Re: Puzzle for Doug...
In-Reply-To: <Pine.BSI.3.95.980729173534.16529B-100000@nathan.enteract.com>
Message-ID: <Pine.LNX.3.96.980730105212.19553B-100000@ganesh.phy.duke.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-aic7xxx@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

On Wed, 29 Jul 1998, Mike Isely wrote:

> On Wed, 29 Jul 1998, Chris Pirih wrote:
> 
> > At 11:27 AM 07/29/1998 -0400, Robert G. Brown wrote:
> > >What if the memory was remapped to a nonexistent location?  You write
> > >to the location and there is nothing there.  You read from the
> > >location and get nulls -- obviously a parity violation but now
> > >remapped to a location where the error comes up as NMI.  
> > 
> > I like that theory!  There doesn't even have to be a write, just a
> > read -- machines without parity/ECC memory would return 0xFF... or
> > whatever, and machines with parity/ECC would NMI.
> 
> I agree now.  Accessing memory-mapped devices in a parity-sensitive system
> has got to be a sticky business.  Obviously if you read a device register,
> there's no parity information.  Yet a memory access must be parity checked
> (assuming of course the system supports it).  Therefore the chipset has to
> somehow know on an access-by-access basis what is going to have parity and
> what isn't.  If the code accesses what it thinks is a device location but
> is in fact unmapped memory, what's the chipset going to do?  Treat as
> memory or device hardware?  If it (randomly) decides that's memory well
> there's a really good chance it's a parity error and then you get the NMI.
> 
> Note that in order for this theory to be right, such an access has to get
> past the system's page tables first.  The page tables usually stop such
> abuse with a page-not-present fault which is converted by the system into
> some kind of software error signal.  That's why random bad memory accesses
> usually result in SIGSEGV (for user mode) or a panic (for kernel mode). 
> So what's probably happened here is that something somewhere has decided
> there must be something at that bad address, set up the page table(s)
> appropriately, and punched through to the system bus, only to be rewarded
> by a NMI-caused parity error because in fact there's nothing really there! 

In a few minutes I'll go back downstairs to the computer room (where
all my "broken" systems currently live) and regenerate my final
results from yesterday and report on them, but (as I said at the end
of yesterday) the crash occurs slightly differently depending on the
context -- during boot, when the page tables are not yet active (as I
recall -- don't those start after built-in-device initialization but
before init?) I get the NMI; afterwards I just get a Data-Path Ram
Error or a parity violation, which sound like the kernel version of
SIGSEGV -- remember this is all in protected mode.  This supports, a
bit, the idea that it may be a nothing-there kind of error.

Unfortunately, the actual error occurs in down(), which is part of the
scheduler itself and hence in the category of Things Mere Mortals Were
Note Meant To Know (or at least debug).  I also don't believe that the
scheduler itself is broken.  So, I am forced to conclude that the
error is occurring one step uphill, and only in code that is either:

  a) part of the uniform scsi subsystem, i.e. drivers/scsi/scsi.c, but
only is tweaked by something system specific and marginal, perhaps U2W
controllers in certain specific configurations.  In the particular
case at hand, the system is "unusual" in that the 7860 and 7890 share
an interrupt and are both very fast devices being actively hammered at
init time.  There is one point in the SCSI code (in internal_Cmnd)
where I >>think<< semaphores are implicitly manipulated and Leonard
states (note the NOTE comment):

    save_flags(flags);
    cli();
    /* Assign a unique nonzero serial_number. */
    if (++serial_number == 0) serial_number = 1;
    SCpnt->serial_number = serial_number;

    /*
     * We will wait MIN_RESET_DELAY clock ticks after the last reset so
     * we can avoid the drive not being ready.
     */
    timeout = host->last_reset + MIN_RESET_DELAY;
    if (jiffies < timeout) {
	int ticks_remaining = timeout - jiffies;
	/*
	 * NOTE: This may be executed from within an interrupt
	 * handler!  This is bad, but for now, it'll do.  The irq
	 * level of the interrupt handler has been masked out by the
	 * platform dependent interrupt handling code already, so the
	 * sti() here will not cause another call to the SCSI host's
	 * interrupt handler (assuming there is one irq-level per
	 * host).
	 */
	sti();
	while (--ticks_remaining >= 0) udelay(1000000/HZ);
	host->last_reset = jiffies - MIN_RESET_DELAY;
    }
    restore_flags(flags);
    
    update_timeout(SCpnt, SCpnt->timeout_per_command);

I'm wondering if what is going on is that a "late" interrupt is
generated by the 7860 on IRQ 10 at just the wrong time to be
critically reentrant) and tweaks the "This is bad" part of this code
by NULL-ing the wait_queue part of the semaphor.  Then when the
down(&sem) command is finally issued, it falls on dead air
and...doom.  This is the kind of thing where tiny marginal differences
in response speed of e.g. the attached CD-ROM could be what makes the
difference between success and failure.  I should probably pull the
cable off of the CD-ROM in one of the affected boxes to see if having
an empty 7860 makes a difference.

Note that I am NOT sufficiently knowledgeable to be sure that this is
a reasonable hypothesis, and I haven't stuck printk's in here yet to
see if in fact it is reentrant.  On the other hand, I CAN'T stick
printk's in down() (I sort of think my screen wouldn't hold the result,
assuming that nothing in printk itself actually CALLS down(), which
would be even worse).  I suspect that I'm going to spend a couple of
hours refining what I know and then will refer this to a Higher Power;
e.g. Doug, Alan, Leonard, and Linus.

   rgb
    
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message