From owner-aic7xxx Thu Jul 30 08:16:28 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id IAA15208 for aic7xxx-outgoing; Thu, 30 Jul 1998 08:16:28 -0700 (PDT) (envelope-from owner-aic7xxx@FreeBSD.ORG) Received: from einstein.phy.duke.edu (einstein.phy.duke.edu [152.3.182.4]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id IAA15202 for ; Thu, 30 Jul 1998 08:16:26 -0700 (PDT) (envelope-from rgb@phy.duke.edu) Received: from ganesh.phy.duke.edu (rgb@ganesh.phy.duke.edu [152.3.183.52]) by einstein.phy.duke.edu (8.8.8/8.8.8) with ESMTP id LAA02106; Thu, 30 Jul 1998 11:16:22 -0400 (EDT) Received: from localhost (rgb@localhost) by ganesh.phy.duke.edu (8.8.5/8.8.5) with SMTP id LAA19580; Thu, 30 Jul 1998 11:13:45 -0400 X-Authentication-Warning: ganesh.phy.duke.edu: rgb owned process doing -bs Date: Thu, 30 Jul 1998 11:13:45 -0400 (EDT) From: "Robert G. Brown" To: Mike Isely cc: Chris Pirih , aic7xxx Mailing List Subject: Re: Puzzle for Doug... In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-aic7xxx@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Wed, 29 Jul 1998, Mike Isely wrote: > On Wed, 29 Jul 1998, Chris Pirih wrote: > > > At 11:27 AM 07/29/1998 -0400, Robert G. Brown wrote: > > >What if the memory was remapped to a nonexistent location? You write > > >to the location and there is nothing there. You read from the > > >location and get nulls -- obviously a parity violation but now > > >remapped to a location where the error comes up as NMI. > > > > I like that theory! There doesn't even have to be a write, just a > > read -- machines without parity/ECC memory would return 0xFF... or > > whatever, and machines with parity/ECC would NMI. > > I agree now. Accessing memory-mapped devices in a parity-sensitive system > has got to be a sticky business. Obviously if you read a device register, > there's no parity information. Yet a memory access must be parity checked > (assuming of course the system supports it). Therefore the chipset has to > somehow know on an access-by-access basis what is going to have parity and > what isn't. If the code accesses what it thinks is a device location but > is in fact unmapped memory, what's the chipset going to do? Treat as > memory or device hardware? If it (randomly) decides that's memory well > there's a really good chance it's a parity error and then you get the NMI. > > Note that in order for this theory to be right, such an access has to get > past the system's page tables first. The page tables usually stop such > abuse with a page-not-present fault which is converted by the system into > some kind of software error signal. That's why random bad memory accesses > usually result in SIGSEGV (for user mode) or a panic (for kernel mode). > So what's probably happened here is that something somewhere has decided > there must be something at that bad address, set up the page table(s) > appropriately, and punched through to the system bus, only to be rewarded > by a NMI-caused parity error because in fact there's nothing really there! In a few minutes I'll go back downstairs to the computer room (where all my "broken" systems currently live) and regenerate my final results from yesterday and report on them, but (as I said at the end of yesterday) the crash occurs slightly differently depending on the context -- during boot, when the page tables are not yet active (as I recall -- don't those start after built-in-device initialization but before init?) I get the NMI; afterwards I just get a Data-Path Ram Error or a parity violation, which sound like the kernel version of SIGSEGV -- remember this is all in protected mode. This supports, a bit, the idea that it may be a nothing-there kind of error. Unfortunately, the actual error occurs in down(), which is part of the scheduler itself and hence in the category of Things Mere Mortals Were Note Meant To Know (or at least debug). I also don't believe that the scheduler itself is broken. So, I am forced to conclude that the error is occurring one step uphill, and only in code that is either: a) part of the uniform scsi subsystem, i.e. drivers/scsi/scsi.c, but only is tweaked by something system specific and marginal, perhaps U2W controllers in certain specific configurations. In the particular case at hand, the system is "unusual" in that the 7860 and 7890 share an interrupt and are both very fast devices being actively hammered at init time. There is one point in the SCSI code (in internal_Cmnd) where I >>think<< semaphores are implicitly manipulated and Leonard states (note the NOTE comment): save_flags(flags); cli(); /* Assign a unique nonzero serial_number. */ if (++serial_number == 0) serial_number = 1; SCpnt->serial_number = serial_number; /* * We will wait MIN_RESET_DELAY clock ticks after the last reset so * we can avoid the drive not being ready. */ timeout = host->last_reset + MIN_RESET_DELAY; if (jiffies < timeout) { int ticks_remaining = timeout - jiffies; /* * NOTE: This may be executed from within an interrupt * handler! This is bad, but for now, it'll do. The irq * level of the interrupt handler has been masked out by the * platform dependent interrupt handling code already, so the * sti() here will not cause another call to the SCSI host's * interrupt handler (assuming there is one irq-level per * host). */ sti(); while (--ticks_remaining >= 0) udelay(1000000/HZ); host->last_reset = jiffies - MIN_RESET_DELAY; } restore_flags(flags); update_timeout(SCpnt, SCpnt->timeout_per_command); I'm wondering if what is going on is that a "late" interrupt is generated by the 7860 on IRQ 10 at just the wrong time to be critically reentrant) and tweaks the "This is bad" part of this code by NULL-ing the wait_queue part of the semaphor. Then when the down(&sem) command is finally issued, it falls on dead air and...doom. This is the kind of thing where tiny marginal differences in response speed of e.g. the attached CD-ROM could be what makes the difference between success and failure. I should probably pull the cable off of the CD-ROM in one of the affected boxes to see if having an empty 7860 makes a difference. Note that I am NOT sufficiently knowledgeable to be sure that this is a reasonable hypothesis, and I haven't stuck printk's in here yet to see if in fact it is reentrant. On the other hand, I CAN'T stick printk's in down() (I sort of think my screen wouldn't hold the result, assuming that nothing in printk itself actually CALLS down(), which would be even worse). I suspect that I'm going to spend a couple of hours refining what I know and then will refer this to a Higher Power; e.g. Doug, Alan, Leonard, and Linus. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe aic7xxx" in the body of the message