From owner-freebsd-current@FreeBSD.ORG Mon Aug 25 15:20:40 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1EC4D16A4BF; Mon, 25 Aug 2003 15:20:40 -0700 (PDT) Received: from mail.fpsn.net (mail.fpsn.net [63.224.69.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id C72E043FD7; Mon, 25 Aug 2003 15:20:35 -0700 (PDT) (envelope-from cfaber@fpsn.net) Received: from fpsn.net (mirc-sucks@unixgr.com [63.224.69.60]) (authenticated bits=0) by mail.fpsn.net (8.12.9/8.12.9) with ESMTP id h7PMKIrd061380; Mon, 25 Aug 2003 16:20:25 -0600 (MDT) Message-ID: <3F4A8BAE.1070703@fpsn.net> Date: Mon, 25 Aug 2003 16:20:30 -0600 From: Colin Faber Organization: fpsn.net, Inc. (http://www.fpsn.net) User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4b) Gecko/20030507 X-Accept-Language: en-us, en MIME-Version: 1.0 References: <20030813103509.Q49991@mail.physik.uni-mainz.de> <3F43BB52.5060503@fpsn.net> <20030821113436.G17320@klima.physik.uni-mainz.de> <3F449E40.3000206@he.iki.fi> In-Reply-To: <3F449E40.3000206@he.iki.fi> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Filter-Engine: scanmail (Ruckus scanmail) 1.0-Beta (ab 1.93) X-Filter-Url: http://www.fpsn.net/ruckus X-Spam: No cc: freebsd-current@freebsd.org cc: freebsd-smp@freebsd.org Subject: Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Aug 2003 22:20:40 -0000 After many hours of fighting with the machine I finally managed to get a debugging kernel built. Given I can successfully panic this machine on command what would any of you very smart developer people like out of the panic message? Commands I should run to get a kernel.debug etc? Petri Helenius wrote: > > Related to the em driver, 82540M has not worked since sometime in > 5.1-BETA time, > I filed a pr on that a few months ago but it seems the fault might be > with PCI IRQ routing, > not the em driver itself. > > Pete > > > Hartmann, O. wrote: > >> On Wed, 20 Aug 2003, Colin Faber wrote: >> >> Hi. >> >> I first swapped the Intel 1000/PRO server NIC into the next slot and >> up then the >> machine seems to be 'stable'. Then, two days later, I changed the PSU >> to 400W >> units. >> >> I think it's a IRQ routing problem since we have had this problem >> (spontanous reboots) >> from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this >> is a very bad state. >> >> I can not remember the error message I got when the system crashed, >> but it lookes like >> yours and I always say the amr0-text in that message. ACPI is not >> working on the old >> TYAN Thunder 2500 (S1867) main PCB. >> >> I also changed machdep.cpu_idle_hlt = 0, but with no effect. >> >> At the moment, I do not dare swapping the NIC again due to the fact >> the machine is in >> a preliminary production state. >> >> I also realized some weird things when creating and deleting files >> when the system crashed. >> Crashes always could be forced by accessing samba services from a PC. >> Crashes always >> occured when heavy IO was done, but this also could be a evidence for >> an IRQ problem, I think. >> I do not know. The machine was 'stable' (it means: when the NIC was at >> the crash-causing >> slot) a whole night, but whenever our department 'got started' in the >> morning time and heavy >> IO was done, the machine froze. This changed when I swapped the NIC to >> another slot!!!! >> And now I also have two 400W PSUs. >> >> FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but >> I do not know whether I >> should mention this here. truss for instance crashes. We use afbackup >> for backing up, but >> afbackup core dumps on this machine and it does not on a UP machine >> also running FreeBSD 5.1-p2. >> It also crashes on a UP kernel on this machine. >> >> I tried to 'truss' an afrestore call, but I had to start the tracing >> three or four times >> because I got this error first time: >> >> truss: PIOCWAIT: Input/output error >> >> or something like this >> >> root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore >> -v -p "/usr/homes/kurs*" -C / >> truss: PIOCWAIT top of loop: Input/output error >> truss: PIOCWAIT top of loop: Input/output error >> truss: PIOCWAIT top of loop: Input/output error >> truss: PIOCWAIT top of loop: Input/output error >> >> or sometimes truss stops lacking in a /proc/PID-XXX/mem file. >> >> But calling it more times will 'solve' the problem. >> >> >> While writing this, I crashed the system with the above showed >> command, this is the >> error message from the kernel when the system froze (I wrote it down >> from the screen): >> >> >> Fatal trap 12 : page fault while in kernel mode >> cpuid = 1; lapic.id = 00000000 >> fault virtual address = 0x24 >> fault code = supervisor read, page not present >> instruction pointer = 0x8:0xc01b29db >> stack pointer = 0x10:0xe8ff3b70 >> frame pointer = 0x10:0xe8ff3b84 >> code segment = base 0x0, limit 0xfffff, type 0x1b >> = DPL 0, pres 1, def 32, gran 1 >> processor eflags = interrupt enabled, resume, IOPL = 0 >> current process = 27510 (bunzip2) >> trap number = 12 >> panic: page fault >> cpuid = 1, lapic.id = 00000000 >> boot() called on cpu#1 >> syncing disks, buffers remaining ... panic: absolutely cannot call >> smp_ipi_shutdown with interrupts already disabled >> >> cpuid = 1; lapic.id = 00000000 >> boot() called on cpu#1 >> Uptime 1d20h18m55s >> pfs_vncache_unload(): 6 entried remaining >> >> Fatal double fault: >> eip = 0xc03134ic >> esp = 0xe8ff1ff8 >> ebp = 0xe8ff2014 >> >> cpuid = 1, lapic.id = 00000000 >> panic: double fault >> cpuid = 1, lapic.id = 00000000 >> boot() called on cpu#1 >> Uptime: 1d20h18m55s >> pfs_vncache_unload(): 6 entries remaining >> >> After this, the machine was dead. >> >> :>Hi, >> :> >> :>I've got nearly the same setup in a Dell 1600SC with a gig of ram >> and a PERC4/Sc (LSI MegaRAID) card. >> :> >> :>Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up >> FreeBSD 5.1-RELEASE-p2 on command >> :>simply by running something to quickly create and remove a >> directory. i.e.: >> :> >> :> perl -e 'for(my $i = 0 ; $i < 9999; $i++){ mkdir("abc"); >> rmdir("abc"); }' >> :> >> :> >> :>Having machdep.cpu_idle_hlt = 0 makes no difference. >> :> >> :> >> :>Kernel: >> :> FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 >> 21:40:47 MDT 2003 i386 >> :> >> :>Raid: >> :> amr0: mem 0xfcd00000-0xfcd0ffff irq 3 at >> device 2.0 on pci1 >> :> amrd0: on amr0 >> :> amrd0: 34556MB (70770688 sectors) RAID 5 (optimal) >> :> >> :> >> :>I suspect that your and my problems are more driver related to the >> amr driver and may be exposing >> :>some other problem with in the kernels fs locking. I don't think (as >> others have suggested) that >> :>your issue is power related, or related to the combination of >> hardware you're using. (Other than >> :>the fact that you've got a MegaRAID card). >> :> >> :>The exact crash message I'm seeing is: >> :> >> :>panic: lockmgr: locking against myself >> :>cpuid = 0; lapic.id 00000000 >> :>boot() called on cpu#0 >> :> >> :>syncing disks, buffers remaining... panic: ffs_copyonwrite: >> recursive call >> :>cpuid = 0; lapic.id 00000000 >> :>boot() called on cpu#0 >> :>Uptime: 58s >> :>pfs_vncache_unload(): 7 entries remaining >> :>amr0: flushing cache...done >> :>Terminate ACPI >> :> >> :> >> :> >> :>Hartmann, O. wrote: >> :> >> :>> Dear Sirs. >> :>> >> :>> It seems to me a never ending story. We run a box with a TYAN Thunder >> :>> 2500 Dual SMP mainboard, 2GB ECC Tyan certified memory, AMI >> Enterprise >> :>> 1600 RAID adapter and additional Intel 1000/Pro server type (64 bit) >> :>> GBit LAN NIC. With FreeBSD 4.8 this was stable, but to achive this >> :>> state was really hard! It is a story similar to that what happend >> when >> :>> we changed towards FreeBSD 5.1-RELEASE-p2 on this machine. >> :>> >> :>> It seems to be highly dependend in which PCI slot several cards are >> :>> attached, so I will report this here also. >> :>> >> :>> Phenomenon: >> :>> >> :>> After a while the machine was running, the SMP kernel reboots >> :>> spontanously. This is when heavy IO is done, compiling or, when in >> the >> :>> morning time our department gets up and our staff connects to the >> samba >> :>> server. >> :>> >> :>> Dependend on which devices are switched on or off by BIOS, the kernel >> :>> freezes at the stage when the amr0 RAID got recognized. I can >> avoid this >> :>> by enabling the built in NIC (fxp0). I can force this by putting >> the em0 >> :>> NIC into another slot, for instance in the one remaining 64BIT/66MHz >> :>> slot (which should be a separate bus). >> :>> >> :>> This 'game' was identical to that I had with FreeBSD 4.X - 4.8 and I >> :>> found out, that putting an additional NIC into PCI slot No. 2 >> (counted >> :>> from AGP slot on) made things clear, but using both NICs together >> :>> (either additional fxp0 or the new em0) remains the systems >> completely >> :>> unstable. >> :>> >> :>> In FreeBSD 5.1-RELEASE-p2 and especially in FreeBSD 5.1-CURRENT this >> :>> 'gambling' seems to reach its climax. My kernel is built up with >> :>> SCHED_4BSD because SCHED_ULE and ADAPTIVE_MUTEXES crashes immediately >> :>> the same way as described (running a while, then coredumping or >> freeze >> :>> at the stage after the amr0-RAID showed up in the kernel boot >> messages, >> :>> see the dmesg output below). >> :>> >> :>> I'm not an hardware expert, but all this wierd stuff looks like to >> me to be >> :>> a IRQ routing problem. I fiddled around with many hand-assigned >> IRQ configurations, >> :>> but nothing helped. Either the Intel 1000/Pro or the AMI RAID causing >> :>> problems in the TYAN Thunder 2500 SMP environment. >> :>> >> :>> We have also a SMP machine with a similar hardware, based on an >> ASUS CV4X-D, >> :>> AMI Elite 1600 RAID controller and the same Intel em0 1GBit NIC. >> OS is >> :>> FreeBSD 4.8 and this system never had any problem! >> :>> >> :>> I feel a little bit helpless this moment, because I think I tried >> every trick >> :>> and something seems to be wrong with the combination TYAN Thunder >> 2500 and FreeBSD >> :>> 5.X SMP. It is also very courios, that a kernel without >> SMP/IO_APIC freezes after >> :>> booting at the same place (amr0 RAID recognition). >> :>> >> :>> Is there any help outside? >> :>> >> :>> I attach the kernel config file and the dmesg output. Please note: >> I disabled both >> :>> serial ports, the parallel port, sound and usb to get additional >> IRQs. But I have to >> :>> enable the built in NIC to get a bootable, but instable FreeBSD >> 5.1-R box. >> :>> >> :>> ==================================== >> :>> DMESG output >> :>> ==================================== >> :>> >> :>> Copyright (c) 1992-2003 The FreeBSD Project. >> :>> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, >> 1993, 1994 >> :>> The Regents of the University of California. All rights reserved. >> :>> FreeBSD 5.1-RELEASE-p2 #14: Wed Aug 13 09:47:00 CEST 2003 >> :>> root@atmos.physik.uni-mainz.de:/usr/obj/usr/src/sys/ATMOS >> :>> Preloaded elf kernel "/boot/kernel/kernel" at 0xc0458000. >> :>> Timecounter "i8254" frequency 1193182 Hz >> :>> Timecounter "TSC" frequency 868644793 Hz >> :>> CPU: Intel Pentium III (868.64-MHz 686-class CPU) >> :>> Origin = "GenuineIntel" Id = 0x683 Stepping = 3 >> :>> >> Features=0x387fbff >> >> :>> real memory = 2147483648 (2048 MB) >> :>> avail memory = 2085625856 (1989 MB) >> :>> Programming 16 pins in IOAPIC #0 >> :>> IOAPIC #0 intpin 2 -> irq 0 >> :>> Programming 16 pins in IOAPIC #1 >> :>> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs >> :>> cpu0 (BSP): apic id: 1, version: 0x00040011, at 0xfee00000 >> :>> cpu1 (AP): apic id: 0, version: 0x00040011, at 0xfee00000 >> :>> io0 (APIC): apic id: 2, version: 0x000f0011, at 0xfec00000 >> :>> io1 (APIC): apic id: 3, version: 0x000f0011, at 0xfec01000 >> :>> netsmb_dev: loaded >> :>> Pentium Pro MTRR support enabled >> :>> npx0: on motherboard >> :>> npx0: INT 16 interface >> :>> pcibios: BIOS version 2.10 >> :>> Using $PIR table, 12 entries at 0xc00fdf00 >> :>> pcib0: at pcibus 0 on motherboard >> :>> pci0: on pcib0 >> :>> IOAPIC #1 intpin 13 -> irq 2 >> :>> IOAPIC #1 intpin 12 -> irq 16 >> :>> IOAPIC #1 intpin 2 -> irq 17 >> :>> IOAPIC #1 intpin 7 -> irq 18 >> :>> pcib1: at device 0.1 on pci0 >> :>> pci1: on pcib1 >> :>> IOAPIC #1 intpin 1 -> irq 19 >> :>> pci1: at device 0.0 (no driver attached) >> :>> sym0: <896> port 0xf800-0xf8ff mem >> 0xfeafe000-0xfeafffff,0xfeafac00-0xfeafafff irq 2 at device 1.0 on pci0 >> :>> sym0: Symbios NVRAM, ID 7, Fast-40, SE, parity checking >> :>> sym0: open drain IRQ line driver, using on-chip SRAM >> :>> sym0: using LOAD/STORE-based firmware. >> :>> sym0: handling phase mismatch from SCRIPTS. >> :>> sym1: <896> port 0xf400-0xf4ff mem >> 0xfeafc000-0xfeafdfff,0xfeafa800-0xfeafabff irq 16 at device 1.1 on pci0 >> :>> sym1: Symbios NVRAM, ID 7, Fast-40, LVD, parity checking >> :>> sym1: open drain IRQ line driver, using on-chip SRAM >> :>> sym1: using LOAD/STORE-based firmware. >> :>> sym1: handling phase mismatch from SCRIPTS. >> :>> em0: port >> 0xfcc0-0xfcff mem 0xfeac0000-0xfeadffff irq 17 at device 4.0 on pci0 >> :>> em0: Speed:1000 Mbps Duplex:Full >> :>> fxp0: port >> 0xfc40-0xfc7f mem 0xfe900000-0xfe9fffff,0xfeaf9000-0xfeaf9fff irq 18 >> at device 7.0 on pci0 >> :>> fxp0: Ethernet address 00:e0:81:00:f0:d7 >> :>> miibus0: on fxp0 >> :>> inphy0: on miibus0 >> :>> inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto >> :>> isab0: port 0x500-0x50f at device 15.0 on pci0 >> :>> isa0: on isab0 >> :>> pci0: at device 15.1 (no driver attached) >> :>> pcib2: at pcibus 2 on motherboard >> :>> pci2: on pcib2 >> :>> pcib3: at device 2.0 on pci2 >> :>> pci3: on pcib3 >> :>> IOAPIC #1 intpin 11 -> irq 20 >> :>> IOAPIC #1 intpin 8 -> irq 21 >> :>> pcib4: at device 0.0 on pci3 >> :>> pci4: on pcib4 >> :>> IOAPIC #1 intpin 10 -> irq 22 >> :>> amr0: mem 0xf0000000-0xf3ffffff irq 22 at >> device 0.0 on pci4 >> :>> amr0: Firmware G170, BIOS >> F316, 64MB RAM >> :>> pci3: at device 1.0 (no driver attached) >> :>> pci3: at device 2.0 (no driver attached) >> :>> orm0: