FreeBSD Mail Archives

Date:      Mon, 06 Dec 1999 12:22:17 -0800
From:      Ed Hall <edhall@screech.weirdnoise.com>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        "Jonathan M. Bresler" <jmb@hub.freebsd.org>, kris@hub.freebsd.org, freebsd-hackers@FreeBSD.ORG
Subject:   Re: PCI DMA lockups in 3.2 (3.3 maybe?) 
Message-ID:  <199912062022.MAA30052@screech.weirdnoise.com>
In-Reply-To: Your message of "Mon, 06 Dec 1999 10:34:35 PST." <199912061834.KAA71206@apollo.backplane.com>

I've confirmed that neither problem exists in 4.0.  There are ample
work-arounds, both hardware and software, including just not using 3.3.
No fixes, though, just work-arounds...  Workarounds for the NCR/FXP
issue included:

1) Using 2.2.8 (4.0 isn't a production option).
2) Using a different NIC (a Tulip worked fine).
3) Using a different SCSI adapter (Adaptec, as Matt suggested, works fine).
4) Using a different SCSI driver (Peter managed to get a driver from 4.0
   hooked up under 3.3, and it survived two days of torture that would
   have toasted things within an hour using the stock driver; you'll have
   to ask him for details).

Workarounds for the pagedaemon issue included:

1) Using 2.2.8 (4.0, too, but not as a production option)
   (do I see a pattern?)
2) Using read()/write() instead of mmap() for certain file updates in
   our application.  In this case read()/write() performed better anyhow.

So the two issues I described are no longer "active" for the purposes
of my project.  I posted because I feared that what I saw as the main
issue--that 3.3 is regarded in some circles as not being up to FreeBSD
standards--was getting lost in various unseemly side-issues.  It could
be that I was just plain unlucky, but my experiences suggest that there
may be some merit to that view.  You be the judge.

I've been with BSD a long time--from back when my email address was
decvax!randvax!edhall.  I want it to succeed, for reasons that are more
emotional than rational; my nightmare was having to say that my project
(1) worked on Solaris, (2) worked on Linux, but (3) broke FreeBSD.
I'd be a pretty poor engineer to play favorites when the facts point
in another direction.  Fortunately, we were able to discover a more
favorable set of facts.  This time.

		-Ed

:  Matthew Dillon <dillon@apollo.backplane.com> wrote:
: :You write:
: :: 	we can not identify the specific problem from this message.
: :: without sufficient information to indentify and hopefully reproduce
: :: the problem, we can not address it.  please provide this information
: :: if it is available to you. if it is not, please provide us contact
: :: information for the commercial entities experiencing the problem.
: :
: :I work at Yahoo.  My address there is "edhall@yahoo-inc.com".
: :
: :On a recent project I encountered two show-stopping bugs with 3.3-release
: :that did not exist in 2.2.8-release:
: :
: :1) Random crashes in FXP interrupt or low-level IP code.  Something is
: :   clobbering the kernel stack--possibly the NCR driver, since using an
: :   Adaptec made the problem stop, as did a backport of the CAM driver
: :   Peter Wemm tried.  This was on an N440BX, which is becoming quite
: :   common in server applications.  Other installations are apparantly
: :   seeing the same problem on this hardware.
: :
: :2) A hard loop in the pagedaemon.  This was especially egregious, since
: :   it meant the system had to be rebooted from the console--and since
: :   the application could elicit the problem within a few minutes.
: :   Disabling the use of mmap() for file update in the application
: :   prevented the problem.  After spending a day trying to cook up a
: :   test program that elicited the same behavior that the application
: :   did, I gave up for lack of time.  But there have been other reports
: :   of late that sound like this problem, mostly in high VM/RAM situations.
: :
: :That's two serious bugs that exist in 3.3-release but not in 2.2.8-release.
: :Looking back through the archives, I can see that I'm not the only one who
: :has experienced them.  I came away from the experience with the feeling that
: :the FreeBSD project has some serious Q/A problems... and I can assure you,
: :I'm not alone in this feeling.
: :
: :		-Ed
: 
:     Well, #2 at least should be fixed in -current.  Unfortunately the
:     changes to the VM system were too extensive to backport to 3.x.  Or, 
:     I should say, that at the time I started working on the VM system core 
:     was not interested in allowing me to backport the changes, and then later
:     it was simply too late - too many changes had been made.
: 
:     #1 has come up a couple of times.  There was a conversation in October
:     that closely relates to your problem:
: 
: :From: Joe McGuckin <joe@monk.via.net>
: :Subject:  fxp related kernel panic
: :
: :I have a 3.3-stable machine that I use as a news router (running diablo). The
: :fxp0 interface averages 10-15 Mbps bandwidth continously.
: :
: :About once a week the machine crashes & reboots. We enabled the debugger this ti
: :me
: :and captured the following debug output:
: :
: :Fatal trap 12: page fault while in kernel mode
: :fault virtual address   = 0x382e4641
: :fault code              = supervisor write, page not present
: :instruction pointer     = 0x8:0xc01a372e
: :stack pointer           = 0x10:0xc02523b0
: :frame pointer           = 0x10:0xc02523c0
: :code segment            = base 0x0, limit 0xfffff, type 0x1b
: :                        = DPL 0, pres 1, def32 1, gran 1
: :processor eflags        = interrupt enabled, resume, IOPL = 0
: :current process         = Idle
: :interrupt mask          = net
: :kernel: type 12 trap, code=0
: :Stopped at      fxp_add_rfabuf+0x1de:   movw    %ax,0x4(%esi)
: :db> 
: :
: :%uname -a
: :FreeBSD feeder.via.net 3.3-STABLE FreeBSD 3.3-STABLE #7: Mon Oct 18 17:14:40 PDT
: : 1999     lewis@feeder.via.net:/usr/src/sys/compile/DIABLO  i386
: :
: :%dmesg
: :Copyright (c) 1992-1999 FreeBSD Inc.
: :Copyright (c) 1982, 1986, 1989, 1991, 1993
: :        The Regents of the University of California. All rights reserved.
: :FreeBSD 3.3-STABLE #7: Mon Oct 18 17:14:40 PDT 1999
: 
:     To which DG responded:
: 
: :From:     David Greenman <dg@root.com>
: :Subject:  Re: fxp related kernel panic 
: :To:       Joe McGuckin <joe@monk.via.net>
: :Cc:       hackers@FreeBSD.ORG, lewis@lppi.com
: :Date:     Tue, 26 Oct 1999 11:43:02 -0700
: :
: :
: :   Let me guess...your system has an Intel N440BX motherboard, right? If so,
: :then it's a known problem with no solution yet.
: :
: :-DG
: :
: :David Greenman
: :Co-founder/Principal Architect, The FreeBSD Project - http://www.freebsd.org
: :Creator of high-performance Internet servers - http://www.terasolutions.com
: :Pave the road of life with opportunities.
: 
:     And he also said:
: 
: :From:     David Greenman <dg@root.com>
: :Subject:  Re: fxp related kernel panic 
: :To:       Lew Payne <lew@lppi.com>
: :Cc:       hackers@FreeBSD.ORG, Joe McGuckin <joe@monk.via.net>
: :Date:     Tue, 26 Oct 1999 13:19:45 -0700
: :
: :
: :>Hi David -- What if I install a *real* EtherExpress Pro-100B (or
: :>whatever it's known as today) in the PCI slot, and use it instead
: :>of the on-board (N440BX motherboard) fxp0 interface?
: :>
: :>Judging that you probably know the nature of the problem, do you
: :>think this might circumvent it?
: :
: :   I think it is caused by the NCR/Symbios controller. It might be a side
: :effect of the NCR just using up a lot of PCI bandwidth, with the real bug
: :being in the fxp driver (although I've looked and haven't found one). So
: :I don't think putting in a real Pro/100 will have any effect on the problem.
: :Of course I don't really know what is causing it, so just about anything
: :is possible.
: :
: :-DG
: :
: :David Greenman
: 
:     And that, I'm afraid is where it has been left.  Nobody is sure where
:     the problem is.  I suspect that it may be a DMA synchronization problem
:     with either the NCR or the FXP driver, or perhaps heavy PCI bandwidth
:     useage is generating a FIFO overrun error during the FXP DMA that the
:     driver is not handling properly.  I just don't know.
: 
:     The only current solution is to use an adaptec controller.  I have
:     personally had *extremely* good luck with adaptec's, 2940UW, 7896 (or 97)
:     U2W (on-motherboard), and 7890 (or 91) U2W (PCI card).
: 
:     I think part of the reason the problem has not been fixed is that many
:     of the hardcore developers are using Adaptec controllers rather then NCR
:     controllers and simply cannot reproduce it.
: 
: 					-Matt
: 					Matthew Dillon 
: 					<dillon@backplane.com>
: 






To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199912062022.MAA30052>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation