Date: Mon, 06 Dec 1999 13:28:40 -0800 From: Mike Smith <msmith@freebsd.org> To: Gerard Roudier <groudier@club-internet.fr> Cc: Ed Hall <edhall@screech.weirdnoise.com>, freebsd-hackers@FreeBSD.ORG Subject: Re: PCI DMA lockups in 3.2 (3.3 maybe?) Message-ID: <199912062128.NAA01671@mass.cdrom.com> In-Reply-To: Your message of "Mon, 06 Dec 1999 23:21:15 %2B0100." <Pine.LNX.3.95.991206224054.405B-100000@localhost>
next in thread | previous in thread | raw e-mail | index | archive | help
> I have some remarks about the issue. I donnot claim it is not a softwar= e = > problem, but ... > = > 1) Given the actual differences between the ncr and sym drivers nowaday= s, = > I would be surprised if the problem was due to a driver software bug. > A difference could be that recent drivers may use PCI optimized > transactions (Memory Write and Invalidate, Memory Read Multiple). The problem has been seen manifesting under both the 'ncr' and 'sym' = drivers. The nature of the situation suggests that it may be a symptom = of the techniques used to talk to the LSI parts in conjunction with some = other bus circumstances, > 2) In order to investigate some hardware problem, we need to know about= > the actual revision of PCI chips used on the system and to have access = to > correspondings errata listings. I can look into the ones I have (basica= lly > SYMBIOS chips), and into the specifications update of the 440BX that ar= e > available from Intel site, but I donnot have anything about the network= > board (neither I know of this board). The symptoms seem to manifest across a range of part revisions; the Intel= = ethernet part involved is the 82558, for which not all data is available.= > 3) I donnot see the reasons that led to think the kernel stack having = > been clobbered by some part involving the ncr/symbios chips, but may-be= = > a clear diagnosis exists. This assumption stems from a diagnosis I performed some time back, and = may have been independantly corroborated. Analysis of a trap taken in = the EtherExpress driver showed register contents which were inconsistent = with the preceeding instruction stream, but consistent with the trap = itself. Given the highly repeatable nature of the trap, the only = conclusions that I could come to were: - Something about the instruction sequence was triggering a failure in = the CPU causing corruption of the register file. - An interrupt was being taken at a particular point as a direct = consequence of some interaction between the ethernet and SCSI = hardware, and the interrupt handler was damaging the stack such that on return the register contents were restored as garbage. In my original case, there were essentially only two common points of = failure, both inside the 'fxp' driver, and both showing the same signs of= = register corruption. > 4) Have all the pathes (PCI, memory,...) parity enabled and do > corresponding parts parity checking ? = We were using ECC memory and CPU cache in the case I was working with. I= = don't _think_ that PCI parity would have helped here (the problem seemed = too consistent to be a noise-related failure of that kind). > 5) Did you give a try using normal IO instead of MMIO for the SYMBIOS c= hip = > and the Network chip, if code allows ? > MMIO may confuse drivers that are not aware of posted buffers. For exam= ple > a PCI device driver that writes using MMIO to some IO register to ack > something and then assumes the chip knows about is just wrong since the= > transaction can be posted (a read, dummy if needed, must be performed > prior such an assumption). This also acts as barriers for drivers that = are > not clean about actual instruction and memory ordering. I'm not sure how this sort of error would lead to stack corruption unless= = it resulted in a deferred PCI dma to a stack variable. At any rate, if you're interested in looking at a kernel core from such a= = failure, I'm sure we can make one available to you. -- = \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ msmith@freebsd.org \\ and he'll hate you for a lifetime. \\ msmith@cdrom.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199912062128.NAA01671>