Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 06 Dec 1999 13:28:40 -0800
From:      Mike Smith <msmith@freebsd.org>
To:        Gerard Roudier <groudier@club-internet.fr>
Cc:        Ed Hall <edhall@screech.weirdnoise.com>, freebsd-hackers@FreeBSD.ORG
Subject:   Re: PCI DMA lockups in 3.2 (3.3 maybe?) 
Message-ID:  <199912062128.NAA01671@mass.cdrom.com>
In-Reply-To: Your message of "Mon, 06 Dec 1999 23:21:15 %2B0100." <Pine.LNX.3.95.991206224054.405B-100000@localhost> 

index | next in thread | previous in thread | raw e-mail

> I have some remarks about the issue. I donnot claim it is not a software 
> problem, but ...
> 
> 1) Given the actual differences between the ncr and sym drivers nowadays, 
> I would be surprised if the problem was due to a driver software bug.
> A difference could be that recent drivers may use PCI optimized
> transactions (Memory Write and Invalidate, Memory Read Multiple).

The problem has been seen manifesting under both the 'ncr' and 'sym' 
drivers.  The nature of the situation suggests that it may be a symptom 
of the techniques used to talk to the LSI parts in conjunction with some 
other bus circumstances,

> 2) In order to investigate some hardware problem, we need to know about
> the actual revision of PCI chips used on the system and to have access to
> correspondings errata listings. I can look into the ones I have (basically
> SYMBIOS chips), and into the specifications update of the 440BX that are
> available from Intel site, but I donnot have anything about the network
> board (neither I know of this board).

The symptoms seem to manifest across a range of part revisions; the Intel 
ethernet part involved is the 82558, for which not all data is available.

> 3) I donnot see the reasons that led to think the kernel stack having 
> been clobbered by some part involving the ncr/symbios chips, but may-be 
> a clear diagnosis exists.

This assumption stems from a diagnosis I performed some time back, and 
may have been independantly corroborated.  Analysis of a trap taken in 
the EtherExpress driver showed register contents which were inconsistent 
with the preceeding instruction stream, but consistent with the trap 
itself.  Given the highly repeatable nature of the trap, the only 
conclusions that I could come to were:

 - Something about the instruction sequence was triggering a failure in 
   the CPU causing corruption of the register file.

 - An interrupt was being taken at a particular point as a direct 
   consequence of some interaction between the ethernet and SCSI 
   hardware, and the interrupt handler was damaging the stack such that
   on return the register contents were restored as garbage.

In my original case, there were essentially only two common points of 
failure, both inside the 'fxp' driver, and both showing the same signs of 
register corruption.

> 4) Have all the pathes (PCI, memory,...) parity enabled and do
> corresponding parts parity checking ? 

We were using ECC memory and CPU cache in the case I was working with.  I 
don't _think_ that PCI parity would have helped here (the problem seemed 
too consistent to be a noise-related failure of that kind).

> 5) Did you give a try using normal IO instead of MMIO for the SYMBIOS chip 
> and the Network chip, if code allows ?
> MMIO may confuse drivers that are not aware of posted buffers. For example
> a PCI device driver that writes using MMIO to some IO register to ack
> something and then assumes the chip knows about is just wrong since the
> transaction can be posted (a read, dummy if needed, must be performed
> prior such an assumption). This also acts as barriers for drivers that are
> not clean about actual instruction and memory ordering.

I'm not sure how this sort of error would lead to stack corruption unless 
it resulted in a deferred PCI dma to a stack variable.

At any rate, if you're interested in looking at a kernel core from such a 
failure, I'm sure we can make one available to you.

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  msmith@freebsd.org
\\ and he'll hate you for a lifetime.             \\  msmith@cdrom.com




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199912062128.NAA01671>