From owner-freebsd-hackers Mon Dec 6 12:21:43 1999 Delivered-To: freebsd-hackers@freebsd.org Received: from screech.weirdnoise.com (209-128-78-198.bayarea.net [209.128.78.198]) by hub.freebsd.org (Postfix) with ESMTP id 8402615737; Mon, 6 Dec 1999 12:21:13 -0800 (PST) (envelope-from edhall@screech.weirdnoise.com) Received: from screech.weirdnoise.com (localhost [127.0.0.1]) by screech.weirdnoise.com (8.8.7/8.8.7) with ESMTP id MAA30052; Mon, 6 Dec 1999 12:22:17 -0800 Message-Id: <199912062022.MAA30052@screech.weirdnoise.com> X-Mailer: exmh version 2.0.2 To: Matthew Dillon Cc: "Jonathan M. Bresler" , kris@hub.freebsd.org, freebsd-hackers@FreeBSD.ORG Subject: Re: PCI DMA lockups in 3.2 (3.3 maybe?) In-Reply-To: Your message of "Mon, 06 Dec 1999 10:34:35 PST." <199912061834.KAA71206@apollo.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Mon, 06 Dec 1999 12:22:17 -0800 From: Ed Hall Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I've confirmed that neither problem exists in 4.0. There are ample work-arounds, both hardware and software, including just not using 3.3. No fixes, though, just work-arounds... Workarounds for the NCR/FXP issue included: 1) Using 2.2.8 (4.0 isn't a production option). 2) Using a different NIC (a Tulip worked fine). 3) Using a different SCSI adapter (Adaptec, as Matt suggested, works fine). 4) Using a different SCSI driver (Peter managed to get a driver from 4.0 hooked up under 3.3, and it survived two days of torture that would have toasted things within an hour using the stock driver; you'll have to ask him for details). Workarounds for the pagedaemon issue included: 1) Using 2.2.8 (4.0, too, but not as a production option) (do I see a pattern?) 2) Using read()/write() instead of mmap() for certain file updates in our application. In this case read()/write() performed better anyhow. So the two issues I described are no longer "active" for the purposes of my project. I posted because I feared that what I saw as the main issue--that 3.3 is regarded in some circles as not being up to FreeBSD standards--was getting lost in various unseemly side-issues. It could be that I was just plain unlucky, but my experiences suggest that there may be some merit to that view. You be the judge. I've been with BSD a long time--from back when my email address was decvax!randvax!edhall. I want it to succeed, for reasons that are more emotional than rational; my nightmare was having to say that my project (1) worked on Solaris, (2) worked on Linux, but (3) broke FreeBSD. I'd be a pretty poor engineer to play favorites when the facts point in another direction. Fortunately, we were able to discover a more favorable set of facts. This time. -Ed : Matthew Dillon wrote: : :You write: : :: we can not identify the specific problem from this message. : :: without sufficient information to indentify and hopefully reproduce : :: the problem, we can not address it. please provide this information : :: if it is available to you. if it is not, please provide us contact : :: information for the commercial entities experiencing the problem. : : : :I work at Yahoo. My address there is "edhall@yahoo-inc.com". : : : :On a recent project I encountered two show-stopping bugs with 3.3-release : :that did not exist in 2.2.8-release: : : : :1) Random crashes in FXP interrupt or low-level IP code. Something is : : clobbering the kernel stack--possibly the NCR driver, since using an : : Adaptec made the problem stop, as did a backport of the CAM driver : : Peter Wemm tried. This was on an N440BX, which is becoming quite : : common in server applications. Other installations are apparantly : : seeing the same problem on this hardware. : : : :2) A hard loop in the pagedaemon. This was especially egregious, since : : it meant the system had to be rebooted from the console--and since : : the application could elicit the problem within a few minutes. : : Disabling the use of mmap() for file update in the application : : prevented the problem. After spending a day trying to cook up a : : test program that elicited the same behavior that the application : : did, I gave up for lack of time. But there have been other reports : : of late that sound like this problem, mostly in high VM/RAM situations. : : : :That's two serious bugs that exist in 3.3-release but not in 2.2.8-release. : :Looking back through the archives, I can see that I'm not the only one who : :has experienced them. I came away from the experience with the feeling that : :the FreeBSD project has some serious Q/A problems... and I can assure you, : :I'm not alone in this feeling. : : : : -Ed : : Well, #2 at least should be fixed in -current. Unfortunately the : changes to the VM system were too extensive to backport to 3.x. Or, : I should say, that at the time I started working on the VM system core : was not interested in allowing me to backport the changes, and then later : it was simply too late - too many changes had been made. : : #1 has come up a couple of times. There was a conversation in October : that closely relates to your problem: : : :From: Joe McGuckin : :Subject: fxp related kernel panic : : : :I have a 3.3-stable machine that I use as a news router (running diablo). The : :fxp0 interface averages 10-15 Mbps bandwidth continously. : : : :About once a week the machine crashes & reboots. We enabled the debugger this ti : :me : :and captured the following debug output: : : : :Fatal trap 12: page fault while in kernel mode : :fault virtual address = 0x382e4641 : :fault code = supervisor write, page not present : :instruction pointer = 0x8:0xc01a372e : :stack pointer = 0x10:0xc02523b0 : :frame pointer = 0x10:0xc02523c0 : :code segment = base 0x0, limit 0xfffff, type 0x1b : : = DPL 0, pres 1, def32 1, gran 1 : :processor eflags = interrupt enabled, resume, IOPL = 0 : :current process = Idle : :interrupt mask = net : :kernel: type 12 trap, code=0 : :Stopped at fxp_add_rfabuf+0x1de: movw %ax,0x4(%esi) : :db> : : : :%uname -a : :FreeBSD feeder.via.net 3.3-STABLE FreeBSD 3.3-STABLE #7: Mon Oct 18 17:14:40 PDT : : 1999 lewis@feeder.via.net:/usr/src/sys/compile/DIABLO i386 : : : :%dmesg : :Copyright (c) 1992-1999 FreeBSD Inc. : :Copyright (c) 1982, 1986, 1989, 1991, 1993 : : The Regents of the University of California. All rights reserved. : :FreeBSD 3.3-STABLE #7: Mon Oct 18 17:14:40 PDT 1999 : : To which DG responded: : : :From: David Greenman : :Subject: Re: fxp related kernel panic : :To: Joe McGuckin : :Cc: hackers@FreeBSD.ORG, lewis@lppi.com : :Date: Tue, 26 Oct 1999 11:43:02 -0700 : : : : : : Let me guess...your system has an Intel N440BX motherboard, right? If so, : :then it's a known problem with no solution yet. : : : :-DG : : : :David Greenman : :Co-founder/Principal Architect, The FreeBSD Project - http://www.freebsd.org : :Creator of high-performance Internet servers - http://www.terasolutions.com : :Pave the road of life with opportunities. : : And he also said: : : :From: David Greenman : :Subject: Re: fxp related kernel panic : :To: Lew Payne : :Cc: hackers@FreeBSD.ORG, Joe McGuckin : :Date: Tue, 26 Oct 1999 13:19:45 -0700 : : : : : :>Hi David -- What if I install a *real* EtherExpress Pro-100B (or : :>whatever it's known as today) in the PCI slot, and use it instead : :>of the on-board (N440BX motherboard) fxp0 interface? : :> : :>Judging that you probably know the nature of the problem, do you : :>think this might circumvent it? : : : : I think it is caused by the NCR/Symbios controller. It might be a side : :effect of the NCR just using up a lot of PCI bandwidth, with the real bug : :being in the fxp driver (although I've looked and haven't found one). So : :I don't think putting in a real Pro/100 will have any effect on the problem. : :Of course I don't really know what is causing it, so just about anything : :is possible. : : : :-DG : : : :David Greenman : : And that, I'm afraid is where it has been left. Nobody is sure where : the problem is. I suspect that it may be a DMA synchronization problem : with either the NCR or the FXP driver, or perhaps heavy PCI bandwidth : useage is generating a FIFO overrun error during the FXP DMA that the : driver is not handling properly. I just don't know. : : The only current solution is to use an adaptec controller. I have : personally had *extremely* good luck with adaptec's, 2940UW, 7896 (or 97) : U2W (on-motherboard), and 7890 (or 91) U2W (PCI card). : : I think part of the reason the problem has not been fixed is that many : of the hardcore developers are using Adaptec controllers rather then NCR : controllers and simply cannot reproduce it. : : -Matt : Matthew Dillon : : To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message