Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Jul 1998 11:27:38 -0400 (EDT)
From:      "Robert G. Brown" <rgb@phy.duke.edu>
To:        Doug Ledford <dledford@dialnet.net>
Cc:        Jess Johnson <jester@feeding.frenzy.com>, aic7xxx Mailing List <AIC7xxx@FreeBSD.ORG>
Subject:   Re: Puzzle for Doug...
Message-ID:  <Pine.LNX.3.96.980729101924.6958C-100000@ganesh.phy.duke.edu>
In-Reply-To: <35BE5247.16EC040F@dialnet.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 28 Jul 1998, Doug Ledford wrote:

> Robert G. Brown wrote:
> 
> > Well, I saw the NMI error pop up on ANOTHER of the five systems
> > overnight, although this one recovered.  I have to say that I
> > seriously doubt that 3/5 of Dell's delivered systems have bad memory,
> > especially given that I've run these systems diskless for around 3
> > weeks now "flawlessly" under heavy load of big-memory applications.  A
> > memory problem with any significant probability of occurring (which
> > clearly must be the case, given that it happens at boot time in low
> > memory) would almost certainly have created havoc -- repeated kernel
> > crashes, bad answers, segment violation errors as loop/jump addresses
> > were corrupted -- none of which have been observed.  The phenomena
> > thus far seems confined to the aic7xxx driver only and moreso to the
> > 7890 device -- I ran the old aic7xxx driver in diskless kernels for a
> > week or so (the one that found the 7860 but not the 7890) and observed
> > none of this.
> 
> When you ran the older driver in this machines was it doing anything or just
> sitting there idle?  Secondly, what *speed* was it doing something at.

It was sitting there idle.  I just meant that the system actually
completed a boot with the 7860 identified and initialized and with its
one attached device (a NEC CD-ROM) identified and installed as
/dev/sr0.  Since the attached device was a CD-ROM on a 50 pin cable,
I'm sure that it was running at 10 MHz although I don't recall
checking.

The point I was making is that even now the 7860 appears to come up
correctly and return its devices correctly.  The primary crashpoint
seems to be the step where the 7890 identifies its attached devices.

> OK...first off, there are a few things that are true.  
> 
> 1) You're getting NMI interrupts.  Unless the people at Dell hooked the
> aic7xxx chipset into the NMI interrupt line by accident, the aic7xxx
> hardware *can't* give this interrupt.  Only the actual PCI chipset is
> typically hooked into the NMI.  Even if the aic7xxx driver is righting
> garbage over your entire memory space, it would still be writing 32 bit (or
> whatever) values to RAM and the system chipset would be generating the
> parity/ECC code during the write.  When reading that value back from RAM, if
> the parity/ECC code for a particular RAM location don't match the data in
> that RAM location, you get an NMI.  The whole process is contained and
> localized in the 440BX chipset in your case.  We could do anything we
> wanted, scribble on kernel memory all day long, and do all sorts of other
> nasty stuff and not be capable of causing an NMI interrupt.  Only an error
> between that 440BX and the SDRAM should ever cause an NMI, or even be
> capable of it, unless something reprograms the local APIC or the IO-APIC
> under 2.1.x SMP.

What if the memory was remapped to a nonexistent location?  You write
to the location and there is nothing there.  You read from the
location and get nulls -- obviously a parity violation but now
remapped to a location where the error comes up as NMI.  Not unlike
the parity error that I was getting writing to the PCI bus with
FAILDIS not set last week...

> 2) Dell are Microsoft Lackeys, but that doesn't mean they are bad system
> builders, just that their systems are tweaked for Microsoft products.  This
> includes things like RAM timings.  The systems you bought (if I remember
> correctly) are held by Dell as super duper NT servers.  Most likely, the
> machines are tweaked for NT usage.  It very well may be that the combination
> of DMA loads and CPU memory loads under linux are too much for the NT
> tweaked settings to sustain.  It may also be that you are getting hit by
> another problem I've *heard* about under linux, but have no personal
> experience with.  Namely, I've heard claims that the 440BX chipset systems
> with their PC100 SDRAM(8ns) actually are not reliable under linux unless you
> use PC100+ SDRAM(7ns).  It's not really called PC100+ SDRAM, but the point
> is that the original PC100 SDRAM was 8ns, while for linux to be reliable I
> have heard claims that 100Mhz systems need 7ns SDRAM instead.  It's entirely
> possible (if even plausible) to think that Dell would go the cheaper route
> of 8ns SDRAM when building for NT.
> 
> 3) Not knowing the exact machine configurations, there could be virtual
> address mapping problems due to VM size and kernel offset combinations.  You
> might try going back to 256MB RAM and seeing if a problem machine all of a
> sudden starts playing nice.  If so, then you'll need to tweak some kernel
> headers and build a custom kernel for the 512MB RAM case (I doubt this one,
> the stock defines should be good to jsut under 1GB RAM).
> 
> 4) Yes, it's perfectly reasonable to expect that if the SDRAM is even close
> to marginal with the aic7xxx driver not installed, then installing the
> driver and using an Ultra2 disk is actually likely to break the RAM stuff. 
> These errors can take all kinds of shapes and forms.  However, I would
> suspect that the SDRAM in these machines is ECC SDRAM and that ECC error
> checking is enabled in the BIOS.  If that's the case, do you even know if
> the machine gets single bit errors without the driver?  It's possible that
> the machine could have single bit, silently corrected errors in the diskless
> mode, and start getting multi-bit errors when the driver is active. 
> Personally, I doubt this, but I would also check to make sure the SDRAM is
> ECC SDRAM and that the ECC code is enabled in the BIOS.  If you don't have
> ECC SDRAM, then I would send it back and tell Dell to put the right SDRAM in
> there.  I can't think of a valid reason not to use ECC SDRAM in PII
> machines, especially after how the entire SDRAM stuff started out.  I simply
> don't trust it without ECC.

I'll see if I can find these things out.  Remember, these are gift
horses that I'm doing dentistry on; I honestly didn't know what I was
getting until I opened up the boxes (as in just yesterday I discovered
two of the 16 have DAT drives when I finally got around to unpacking
them, hooray:-).  I, too, have heard the 8ns/7ns SDRAM thing on lists,
and this is the kind of thing with the right marginality -- on some
systems, the 8ns SDRAM might actually tolerate 7 ns access, on others
there might be a problem.

I have to admit that I'm still very confused as to how/why this could
create a state-dependent problem.  Specifically, I still don't see why
running the aic7xxx device driver generates more or faster memory
traffic than running 180+MB background numerical job with lots of
memory accesses in all flavours or DMA-based fast ethernet transfers
-- I would have thought that memory access is memory access and always
occurs at a fixed, deterministic speed (neither faster nor slower) as
dictated by the associated clocks.  I thought that the only variable
associated with memory "speed" was access latency and contention for
the memory bus itself, which clearly has to be handled robustly in
order for any operating system/hardware combo to work given a wide
range of memory speeds.

That is, I would have expected memory access rates to be hardware
bound and almost totally insensitive to the nature or source of the
demands placed on it by the CPU or attached device.  I would have
expected the DMA controller to just (recoverably) block until any
requested memory transaction completes regardless of its nature or how
urgently the CPU or attached device wants the data.  After all, the
memory is nearly always provided more slowly than the attached devices
would "like" -- hence the introduction of fast caches and the like to
buffer the demand.

This also makes me wonder about how NT could put "less" stress on
memory -- I don't see that it makes one whit of difference that it is
linux or NT requesting a read or write of a stream of data to/from some
starting location in RAM -- either one should be initiated either
through the DMA controller directly to/from a device (in which case
the operating system is not directly involved) or the device involved
is the CPU (in which case the actual machine code running might well be
absolutely identical under NT and Linux, as in some core inner loop in
a big matrix multiply).  Either way, it's hard to see how an operating
system/device combination that runs stably on 60 ns (or worse!) DRAM
suddenly engenders errors on a 8 ns SDRAM systsem (with the same
devices and PCI bus) UNLESS the latencies involved are software, not
hardware.

I should point out that there is no reason to believe that the aic7xxx
code is doing anything particularly "special" or memory intensive at
the moment the crash occurs.  Indeed, it appears to be hanging out
waiting for the Inquiry command to complete (the same command that was
hanging the system completely until I disabled the SCSI BIOS).  Why
should only the aic7xxx driver tweak the NMI bug during a boot
initialization phase when the device isn't even doing a DMA stream?
Again, PCI writes to a device controller surely occur at the PCI bus
speed independent of which device one is writing to, and my ethernet
controller and superfast video device work just fine.  I suspect a
possible problem with mmap -- perhaps some memory is being remapped to
dead air so that an attempted write that came back as a parity error
before now tweaks the NMI call.

> Anyway, there's a few things to consider when trying to track this down. 
> Good luck on finding it, my personal bet would be first on the 7ns vs. 8ns
> SDRAM and then second on ECC issues.

I agree, and will dutifully investigate both.  I will also do the
following:

 a) Build a diskless kernel with the aic7xxx driver as a module
(again) to see if the driver initialization transactions can complete
if the system isn't in the middle of a boot.  This also permits me to
go back into the code and figure out with copious printk's just what
statement is being executed at the moment of death.

 b) See if I can determine why the 7890 controller isn't coming up
Ultra.  From what I recall of the initialization code, it "has" to be
coming up ultra as it is an U2W controller with (only!) a U2W disk.  I
suspect a bug, but it could be this same bug seen from a different
point of view.  Maybe the crashing systems are ones where the
controller succeeds in initializing as Ultra and the non-crashing
systems somehow fail the ultra negotiation and come up as just
Wide/Sync.  I don't know.

 c) I'll certainly try setting mem=256M and the like; I did a bit of
this yesterday and it didn't help, but I'll do a bit more systematic
stuff today.

I'll still bet a nickel that the problem is software.  Or a beer if
you (like many linux humans, including myself:-) occasionally
indulge...  The memory speed thing has me musing, though...

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe aic7xxx" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.LNX.3.96.980729101924.6958C-100000>