From owner-freebsd-smp@FreeBSD.ORG Thu Aug 21 03:27:05 2003 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 676F016A4BF; Thu, 21 Aug 2003 03:27:05 -0700 (PDT) Received: from silver.he.iki.fi (silver.he.iki.fi [193.64.42.241]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9F7AC43F93; Thu, 21 Aug 2003 03:27:03 -0700 (PDT) (envelope-from pete@he.iki.fi) Received: from he.iki.fi (h81.vuokselantie10.fi [193.64.42.129]) by silver.he.iki.fi (8.12.9/8.11.4) with ESMTP id h7LAQA5L072727; Thu, 21 Aug 2003 13:26:10 +0300 (EEST) (envelope-from pete@he.iki.fi) Message-ID: <3F449E40.3000206@he.iki.fi> Date: Thu, 21 Aug 2003 13:26:08 +0300 From: Petri Helenius User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030624 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Hartmann, O." References: <20030813103509.Q49991@mail.physik.uni-mainz.de> <3F43BB52.5060503@fpsn.net> <20030821113436.G17320@klima.physik.uni-mainz.de> In-Reply-To: <20030821113436.G17320@klima.physik.uni-mainz.de> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit cc: Colin Faber cc: freebsd-current@freebsd.org cc: freebsd-smp@freebsd.org Subject: Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Aug 2003 10:27:05 -0000 Related to the em driver, 82540M has not worked since sometime in 5.1-BETA time, I filed a pr on that a few months ago but it seems the fault might be with PCI IRQ routing, not the em driver itself. Pete Hartmann, O. wrote: >On Wed, 20 Aug 2003, Colin Faber wrote: > >Hi. > >I first swapped the Intel 1000/PRO server NIC into the next slot and up then the >machine seems to be 'stable'. Then, two days later, I changed the PSU to 400W >units. > >I think it's a IRQ routing problem since we have had this problem (spontanous reboots) >from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this is a very bad state. > >I can not remember the error message I got when the system crashed, but it lookes like >yours and I always say the amr0-text in that message. ACPI is not working on the old >TYAN Thunder 2500 (S1867) main PCB. > >I also changed machdep.cpu_idle_hlt = 0, but with no effect. > >At the moment, I do not dare swapping the NIC again due to the fact the machine is in >a preliminary production state. > >I also realized some weird things when creating and deleting files when the system crashed. >Crashes always could be forced by accessing samba services from a PC. Crashes always >occured when heavy IO was done, but this also could be a evidence for an IRQ problem, I think. >I do not know. The machine was 'stable' (it means: when the NIC was at the crash-causing >slot) a whole night, but whenever our department 'got started' in the morning time and heavy >IO was done, the machine froze. This changed when I swapped the NIC to another slot!!!! >And now I also have two 400W PSUs. > >FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but I do not know whether I >should mention this here. truss for instance crashes. We use afbackup for backing up, but >afbackup core dumps on this machine and it does not on a UP machine also running FreeBSD 5.1-p2. >It also crashes on a UP kernel on this machine. > >I tried to 'truss' an afrestore call, but I had to start the tracing three or four times >because I got this error first time: > > truss: PIOCWAIT: Input/output error > >or something like this > > root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore -v -p "/usr/homes/kurs*" -C / > truss: PIOCWAIT top of loop: Input/output error > truss: PIOCWAIT top of loop: Input/output error > truss: PIOCWAIT top of loop: Input/output error > truss: PIOCWAIT top of loop: Input/output error > >or sometimes truss stops lacking in a /proc/PID-XXX/mem file. > >But calling it more times will 'solve' the problem. > > >While writing this, I crashed the system with the above showed command, this is the >error message from the kernel when the system froze (I wrote it down from the screen): > > >Fatal trap 12 : page fault while in kernel mode >cpuid = 1; lapic.id = 00000000 >fault virtual address = 0x24 >fault code = supervisor read, page not present >instruction pointer = 0x8:0xc01b29db >stack pointer = 0x10:0xe8ff3b70 >frame pointer = 0x10:0xe8ff3b84 >code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, def 32, gran 1 >processor eflags = interrupt enabled, resume, IOPL = 0 >current process = 27510 (bunzip2) >trap number = 12 >panic: page fault >cpuid = 1, lapic.id = 00000000 >boot() called on cpu#1 >syncing disks, buffers remaining ... panic: absolutely cannot call > smp_ipi_shutdown with interrupts already disabled > >cpuid = 1; lapic.id = 00000000 >boot() called on cpu#1 >Uptime 1d20h18m55s >pfs_vncache_unload(): 6 entried remaining > >Fatal double fault: >eip = 0xc03134ic >esp = 0xe8ff1ff8 >ebp = 0xe8ff2014 > >cpuid = 1, lapic.id = 00000000 >panic: double fault >cpuid = 1, lapic.id = 00000000 >boot() called on cpu#1 >Uptime: 1d20h18m55s >pfs_vncache_unload(): 6 entries remaining > >After this, the machine was dead. > >:>Hi, >:> >:>I've got nearly the same setup in a Dell 1600SC with a gig of ram and a PERC4/Sc (LSI MegaRAID) card. >:> >:>Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up FreeBSD 5.1-RELEASE-p2 on command >:>simply by running something to quickly create and remove a directory. i.e.: >:> >:> perl -e 'for(my $i = 0 ; $i < 9999; $i++){ mkdir("abc"); rmdir("abc"); }' >:> >:> >:>Having machdep.cpu_idle_hlt = 0 makes no difference. >:> >:> >:>Kernel: >:> FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 21:40:47 MDT 2003 i386 >:> >:>Raid: >:> amr0: mem 0xfcd00000-0xfcd0ffff irq 3 at device 2.0 on pci1 >:> amrd0: on amr0 >:> amrd0: 34556MB (70770688 sectors) RAID 5 (optimal) >:> >:> >:>I suspect that your and my problems are more driver related to the amr driver and may be exposing >:>some other problem with in the kernels fs locking. I don't think (as others have suggested) that >:>your issue is power related, or related to the combination of hardware you're using. (Other than >:>the fact that you've got a MegaRAID card). >:> >:>The exact crash message I'm seeing is: >:> >:>panic: lockmgr: locking against myself >:>cpuid = 0; lapic.id 00000000 >:>boot() called on cpu#0 >:> >:>syncing disks, buffers remaining... panic: ffs_copyonwrite: recursive call >:>cpuid = 0; lapic.id 00000000 >:>boot() called on cpu#0 >:>Uptime: 58s >:>pfs_vncache_unload(): 7 entries remaining >:>amr0: flushing cache...done >:>Terminate ACPI >:> >:> >:> >:>Hartmann, O. wrote: >:> >:>> Dear Sirs. >:>> >:>> It seems to me a never ending story. We run a box with a TYAN Thunder >:>> 2500 Dual SMP mainboard, 2GB ECC Tyan certified memory, AMI Enterprise >:>> 1600 RAID adapter and additional Intel 1000/Pro server type (64 bit) >:>> GBit LAN NIC. With FreeBSD 4.8 this was stable, but to achive this >:>> state was really hard! It is a story similar to that what happend when >:>> we changed towards FreeBSD 5.1-RELEASE-p2 on this machine. >:>> >:>> It seems to be highly dependend in which PCI slot several cards are >:>> attached, so I will report this here also. >:>> >:>> Phenomenon: >:>> >:>> After a while the machine was running, the SMP kernel reboots >:>> spontanously. This is when heavy IO is done, compiling or, when in the >:>> morning time our department gets up and our staff connects to the samba >:>> server. >:>> >:>> Dependend on which devices are switched on or off by BIOS, the kernel >:>> freezes at the stage when the amr0 RAID got recognized. I can avoid this >:>> by enabling the built in NIC (fxp0). I can force this by putting the em0 >:>> NIC into another slot, for instance in the one remaining 64BIT/66MHz >:>> slot (which should be a separate bus). >:>> >:>> This 'game' was identical to that I had with FreeBSD 4.X - 4.8 and I >:>> found out, that putting an additional NIC into PCI slot No. 2 (counted >:>> from AGP slot on) made things clear, but using both NICs together >:>> (either additional fxp0 or the new em0) remains the systems completely >:>> unstable. >:>> >:>> In FreeBSD 5.1-RELEASE-p2 and especially in FreeBSD 5.1-CURRENT this >:>> 'gambling' seems to reach its climax. My kernel is built up with >:>> SCHED_4BSD because SCHED_ULE and ADAPTIVE_MUTEXES crashes immediately >:>> the same way as described (running a while, then coredumping or freeze >:>> at the stage after the amr0-RAID showed up in the kernel boot messages, >:>> see the dmesg output below). >:>> >:>> I'm not an hardware expert, but all this wierd stuff looks like to me to be >:>> a IRQ routing problem. I fiddled around with many hand-assigned IRQ configurations, >:>> but nothing helped. Either the Intel 1000/Pro or the AMI RAID causing >:>> problems in the TYAN Thunder 2500 SMP environment. >:>> >:>> We have also a SMP machine with a similar hardware, based on an ASUS CV4X-D, >:>> AMI Elite 1600 RAID controller and the same Intel em0 1GBit NIC. OS is >:>> FreeBSD 4.8 and this system never had any problem! >:>> >:>> I feel a little bit helpless this moment, because I think I tried every trick >:>> and something seems to be wrong with the combination TYAN Thunder 2500 and FreeBSD >:>> 5.X SMP. It is also very courios, that a kernel without SMP/IO_APIC freezes after >:>> booting at the same place (amr0 RAID recognition). >:>> >:>> Is there any help outside? >:>> >:>> I attach the kernel config file and the dmesg output. Please note: I disabled both >:>> serial ports, the parallel port, sound and usb to get additional IRQs. But I have to >:>> enable the built in NIC to get a bootable, but instable FreeBSD 5.1-R box. >:>> >:>> ==================================== >:>> DMESG output >:>> ==================================== >:>> >:>> Copyright (c) 1992-2003 The FreeBSD Project. >:>> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 >:>> The Regents of the University of California. All rights reserved. >:>> FreeBSD 5.1-RELEASE-p2 #14: Wed Aug 13 09:47:00 CEST 2003 >:>> root@atmos.physik.uni-mainz.de:/usr/obj/usr/src/sys/ATMOS >:>> Preloaded elf kernel "/boot/kernel/kernel" at 0xc0458000. >:>> Timecounter "i8254" frequency 1193182 Hz >:>> Timecounter "TSC" frequency 868644793 Hz >:>> CPU: Intel Pentium III (868.64-MHz 686-class CPU) >:>> Origin = "GenuineIntel" Id = 0x683 Stepping = 3 >:>> Features=0x387fbff >:>> real memory = 2147483648 (2048 MB) >:>> avail memory = 2085625856 (1989 MB) >:>> Programming 16 pins in IOAPIC #0 >:>> IOAPIC #0 intpin 2 -> irq 0 >:>> Programming 16 pins in IOAPIC #1 >:>> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs >:>> cpu0 (BSP): apic id: 1, version: 0x00040011, at 0xfee00000 >:>> cpu1 (AP): apic id: 0, version: 0x00040011, at 0xfee00000 >:>> io0 (APIC): apic id: 2, version: 0x000f0011, at 0xfec00000 >:>> io1 (APIC): apic id: 3, version: 0x000f0011, at 0xfec01000 >:>> netsmb_dev: loaded >:>> Pentium Pro MTRR support enabled >:>> npx0: on motherboard >:>> npx0: INT 16 interface >:>> pcibios: BIOS version 2.10 >:>> Using $PIR table, 12 entries at 0xc00fdf00 >:>> pcib0: at pcibus 0 on motherboard >:>> pci0: on pcib0 >:>> IOAPIC #1 intpin 13 -> irq 2 >:>> IOAPIC #1 intpin 12 -> irq 16 >:>> IOAPIC #1 intpin 2 -> irq 17 >:>> IOAPIC #1 intpin 7 -> irq 18 >:>> pcib1: at device 0.1 on pci0 >:>> pci1: on pcib1 >:>> IOAPIC #1 intpin 1 -> irq 19 >:>> pci1: at device 0.0 (no driver attached) >:>> sym0: <896> port 0xf800-0xf8ff mem 0xfeafe000-0xfeafffff,0xfeafac00-0xfeafafff irq 2 at device 1.0 on pci0 >:>> sym0: Symbios NVRAM, ID 7, Fast-40, SE, parity checking >:>> sym0: open drain IRQ line driver, using on-chip SRAM >:>> sym0: using LOAD/STORE-based firmware. >:>> sym0: handling phase mismatch from SCRIPTS. >:>> sym1: <896> port 0xf400-0xf4ff mem 0xfeafc000-0xfeafdfff,0xfeafa800-0xfeafabff irq 16 at device 1.1 on pci0 >:>> sym1: Symbios NVRAM, ID 7, Fast-40, LVD, parity checking >:>> sym1: open drain IRQ line driver, using on-chip SRAM >:>> sym1: using LOAD/STORE-based firmware. >:>> sym1: handling phase mismatch from SCRIPTS. >:>> em0: port 0xfcc0-0xfcff mem 0xfeac0000-0xfeadffff irq 17 at device 4.0 on pci0 >:>> em0: Speed:1000 Mbps Duplex:Full >:>> fxp0: port 0xfc40-0xfc7f mem 0xfe900000-0xfe9fffff,0xfeaf9000-0xfeaf9fff irq 18 at device 7.0 on pci0 >:>> fxp0: Ethernet address 00:e0:81:00:f0:d7 >:>> miibus0: on fxp0 >:>> inphy0: on miibus0 >:>> inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto >:>> isab0: port 0x500-0x50f at device 15.0 on pci0 >:>> isa0: on isab0 >:>> pci0: at device 15.1 (no driver attached) >:>> pcib2: at pcibus 2 on motherboard >:>> pci2: on pcib2 >:>> pcib3: at device 2.0 on pci2 >:>> pci3: on pcib3 >:>> IOAPIC #1 intpin 11 -> irq 20 >:>> IOAPIC #1 intpin 8 -> irq 21 >:>> pcib4: at device 0.0 on pci3 >:>> pci4: on pcib4 >:>> IOAPIC #1 intpin 10 -> irq 22 >:>> amr0: mem 0xf0000000-0xf3ffffff irq 22 at device 0.0 on pci4 >:>> amr0: Firmware G170, BIOS F316, 64MB RAM >:>> pci3: at device 1.0 (no driver attached) >:>> pci3: at device 2.0 (no driver attached) >:>> orm0: