From owner-freebsd-smp@FreeBSD.ORG Thu Aug 21 03:22:32 2003 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 74D8D16A4BF; Thu, 21 Aug 2003 03:22:32 -0700 (PDT) Received: from klima.physik.uni-mainz.de (klima.Physik.Uni-Mainz.DE [134.93.180.162]) by mx1.FreeBSD.org (Postfix) with ESMTP id A49E543FDD; Thu, 21 Aug 2003 03:22:29 -0700 (PDT) (envelope-from ohartman@klima.physik.uni-mainz.de) Received: from klima.physik.uni-mainz.de (klima.physik.uni-mainz.de [134.93.180.162])h7LAMIJD017913; Thu, 21 Aug 2003 12:22:18 +0200 (CEST) (envelope-from ohartman@klima.physik.uni-mainz.de) Date: Thu, 21 Aug 2003 12:22:17 +0200 (CEST) From: "Hartmann, O." To: Colin Faber In-Reply-To: <3F43BB52.5060503@fpsn.net> Message-ID: <20030821113436.G17320@klima.physik.uni-mainz.de> References: <20030813103509.Q49991@mail.physik.uni-mainz.de> <3F43BB52.5060503@fpsn.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@freebsd.org cc: freebsd-smp@freebsd.org Subject: Re: (2) 5.1-R-p2 crashes on SMP with AMI RAID and Intel 1000/Pro X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Aug 2003 10:22:32 -0000 On Wed, 20 Aug 2003, Colin Faber wrote: Hi. I first swapped the Intel 1000/PRO server NIC into the next slot and up then the machine seems to be 'stable'. Then, two days later, I changed the PSU to 400W units. I think it's a IRQ routing problem since we have had this problem (spontanous reboots) from FreeBSD 4.0 on). Changing the slot for the NIC helped, but this is a very bad state. I can not remember the error message I got when the system crashed, but it lookes like yours and I always say the amr0-text in that message. ACPI is not working on the old TYAN Thunder 2500 (S1867) main PCB. I also changed machdep.cpu_idle_hlt = 0, but with no effect. At the moment, I do not dare swapping the NIC again due to the fact the machine is in a preliminary production state. I also realized some weird things when creating and deleting files when the system crashed. Crashes always could be forced by accessing samba services from a PC. Crashes always occured when heavy IO was done, but this also could be a evidence for an IRQ problem, I think. I do not know. The machine was 'stable' (it means: when the NIC was at the crash-causing slot) a whole night, but whenever our department 'got started' in the morning time and heavy IO was done, the machine froze. This changed when I swapped the NIC to another slot!!!! And now I also have two 400W PSUs. FreeBSD 5.1-p2 on the TYAN S1867 seems to have much more problems, but I do not know whether I should mention this here. truss for instance crashes. We use afbackup for backing up, but afbackup core dumps on this machine and it does not on a UP machine also running FreeBSD 5.1-p2. It also crashes on a UP kernel on this machine. I tried to 'truss' an afrestore call, but I had to start the tracing three or four times because I got this error first time: truss: PIOCWAIT: Input/output error or something like this root: /usr/local/samba/lib: truss -fae -o /tmp/afrestore afrestore -v -p "/usr/homes/kurs*" -C / truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error truss: PIOCWAIT top of loop: Input/output error or sometimes truss stops lacking in a /proc/PID-XXX/mem file. But calling it more times will 'solve' the problem. While writing this, I crashed the system with the above showed command, this is the error message from the kernel when the system froze (I wrote it down from the screen): Fatal trap 12 : page fault while in kernel mode cpuid = 1; lapic.id = 00000000 fault virtual address = 0x24 fault code = supervisor read, page not present instruction pointer = 0x8:0xc01b29db stack pointer = 0x10:0xe8ff3b70 frame pointer = 0x10:0xe8ff3b84 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def 32, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 27510 (bunzip2) trap number = 12 panic: page fault cpuid = 1, lapic.id = 00000000 boot() called on cpu#1 syncing disks, buffers remaining ... panic: absolutely cannot call smp_ipi_shutdown with interrupts already disabled cpuid = 1; lapic.id = 00000000 boot() called on cpu#1 Uptime 1d20h18m55s pfs_vncache_unload(): 6 entried remaining Fatal double fault: eip = 0xc03134ic esp = 0xe8ff1ff8 ebp = 0xe8ff2014 cpuid = 1, lapic.id = 00000000 panic: double fault cpuid = 1, lapic.id = 00000000 boot() called on cpu#1 Uptime: 1d20h18m55s pfs_vncache_unload(): 6 entries remaining After this, the machine was dead. :>Hi, :> :>I've got nearly the same setup in a Dell 1600SC with a gig of ram and a PERC4/Sc (LSI MegaRAID) card. :> :>Dual 2.4GHz Xeon P4 HT CPU's and I've discovered I can lock up FreeBSD 5.1-RELEASE-p2 on command :>simply by running something to quickly create and remove a directory. i.e.: :> :> perl -e 'for(my $i = 0 ; $i < 9999; $i++){ mkdir("abc"); rmdir("abc"); }' :> :> :>Having machdep.cpu_idle_hlt = 0 makes no difference. :> :> :>Kernel: :> FreeBSD 5.1-RELEASE-p2 FreeBSD 5.1-RELEASE-p2 #0: Mon Aug 11 21:40:47 MDT 2003 i386 :> :>Raid: :> amr0: mem 0xfcd00000-0xfcd0ffff irq 3 at device 2.0 on pci1 :> amrd0: on amr0 :> amrd0: 34556MB (70770688 sectors) RAID 5 (optimal) :> :> :>I suspect that your and my problems are more driver related to the amr driver and may be exposing :>some other problem with in the kernels fs locking. I don't think (as others have suggested) that :>your issue is power related, or related to the combination of hardware you're using. (Other than :>the fact that you've got a MegaRAID card). :> :>The exact crash message I'm seeing is: :> :>panic: lockmgr: locking against myself :>cpuid = 0; lapic.id 00000000 :>boot() called on cpu#0 :> :>syncing disks, buffers remaining... panic: ffs_copyonwrite: recursive call :>cpuid = 0; lapic.id 00000000 :>boot() called on cpu#0 :>Uptime: 58s :>pfs_vncache_unload(): 7 entries remaining :>amr0: flushing cache...done :>Terminate ACPI :> :> :> :>Hartmann, O. wrote: :> :>> Dear Sirs. :>> :>> It seems to me a never ending story. We run a box with a TYAN Thunder :>> 2500 Dual SMP mainboard, 2GB ECC Tyan certified memory, AMI Enterprise :>> 1600 RAID adapter and additional Intel 1000/Pro server type (64 bit) :>> GBit LAN NIC. With FreeBSD 4.8 this was stable, but to achive this :>> state was really hard! It is a story similar to that what happend when :>> we changed towards FreeBSD 5.1-RELEASE-p2 on this machine. :>> :>> It seems to be highly dependend in which PCI slot several cards are :>> attached, so I will report this here also. :>> :>> Phenomenon: :>> :>> After a while the machine was running, the SMP kernel reboots :>> spontanously. This is when heavy IO is done, compiling or, when in the :>> morning time our department gets up and our staff connects to the samba :>> server. :>> :>> Dependend on which devices are switched on or off by BIOS, the kernel :>> freezes at the stage when the amr0 RAID got recognized. I can avoid this :>> by enabling the built in NIC (fxp0). I can force this by putting the em0 :>> NIC into another slot, for instance in the one remaining 64BIT/66MHz :>> slot (which should be a separate bus). :>> :>> This 'game' was identical to that I had with FreeBSD 4.X - 4.8 and I :>> found out, that putting an additional NIC into PCI slot No. 2 (counted :>> from AGP slot on) made things clear, but using both NICs together :>> (either additional fxp0 or the new em0) remains the systems completely :>> unstable. :>> :>> In FreeBSD 5.1-RELEASE-p2 and especially in FreeBSD 5.1-CURRENT this :>> 'gambling' seems to reach its climax. My kernel is built up with :>> SCHED_4BSD because SCHED_ULE and ADAPTIVE_MUTEXES crashes immediately :>> the same way as described (running a while, then coredumping or freeze :>> at the stage after the amr0-RAID showed up in the kernel boot messages, :>> see the dmesg output below). :>> :>> I'm not an hardware expert, but all this wierd stuff looks like to me to be :>> a IRQ routing problem. I fiddled around with many hand-assigned IRQ configurations, :>> but nothing helped. Either the Intel 1000/Pro or the AMI RAID causing :>> problems in the TYAN Thunder 2500 SMP environment. :>> :>> We have also a SMP machine with a similar hardware, based on an ASUS CV4X-D, :>> AMI Elite 1600 RAID controller and the same Intel em0 1GBit NIC. OS is :>> FreeBSD 4.8 and this system never had any problem! :>> :>> I feel a little bit helpless this moment, because I think I tried every trick :>> and something seems to be wrong with the combination TYAN Thunder 2500 and FreeBSD :>> 5.X SMP. It is also very courios, that a kernel without SMP/IO_APIC freezes after :>> booting at the same place (amr0 RAID recognition). :>> :>> Is there any help outside? :>> :>> I attach the kernel config file and the dmesg output. Please note: I disabled both :>> serial ports, the parallel port, sound and usb to get additional IRQs. But I have to :>> enable the built in NIC to get a bootable, but instable FreeBSD 5.1-R box. :>> :>> ==================================== :>> DMESG output :>> ==================================== :>> :>> Copyright (c) 1992-2003 The FreeBSD Project. :>> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 :>> The Regents of the University of California. All rights reserved. :>> FreeBSD 5.1-RELEASE-p2 #14: Wed Aug 13 09:47:00 CEST 2003 :>> root@atmos.physik.uni-mainz.de:/usr/obj/usr/src/sys/ATMOS :>> Preloaded elf kernel "/boot/kernel/kernel" at 0xc0458000. :>> Timecounter "i8254" frequency 1193182 Hz :>> Timecounter "TSC" frequency 868644793 Hz :>> CPU: Intel Pentium III (868.64-MHz 686-class CPU) :>> Origin = "GenuineIntel" Id = 0x683 Stepping = 3 :>> Features=0x387fbff :>> real memory = 2147483648 (2048 MB) :>> avail memory = 2085625856 (1989 MB) :>> Programming 16 pins in IOAPIC #0 :>> IOAPIC #0 intpin 2 -> irq 0 :>> Programming 16 pins in IOAPIC #1 :>> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs :>> cpu0 (BSP): apic id: 1, version: 0x00040011, at 0xfee00000 :>> cpu1 (AP): apic id: 0, version: 0x00040011, at 0xfee00000 :>> io0 (APIC): apic id: 2, version: 0x000f0011, at 0xfec00000 :>> io1 (APIC): apic id: 3, version: 0x000f0011, at 0xfec01000 :>> netsmb_dev: loaded :>> Pentium Pro MTRR support enabled :>> npx0: on motherboard :>> npx0: INT 16 interface :>> pcibios: BIOS version 2.10 :>> Using $PIR table, 12 entries at 0xc00fdf00 :>> pcib0: at pcibus 0 on motherboard :>> pci0: on pcib0 :>> IOAPIC #1 intpin 13 -> irq 2 :>> IOAPIC #1 intpin 12 -> irq 16 :>> IOAPIC #1 intpin 2 -> irq 17 :>> IOAPIC #1 intpin 7 -> irq 18 :>> pcib1: at device 0.1 on pci0 :>> pci1: on pcib1 :>> IOAPIC #1 intpin 1 -> irq 19 :>> pci1: at device 0.0 (no driver attached) :>> sym0: <896> port 0xf800-0xf8ff mem 0xfeafe000-0xfeafffff,0xfeafac00-0xfeafafff irq 2 at device 1.0 on pci0 :>> sym0: Symbios NVRAM, ID 7, Fast-40, SE, parity checking :>> sym0: open drain IRQ line driver, using on-chip SRAM :>> sym0: using LOAD/STORE-based firmware. :>> sym0: handling phase mismatch from SCRIPTS. :>> sym1: <896> port 0xf400-0xf4ff mem 0xfeafc000-0xfeafdfff,0xfeafa800-0xfeafabff irq 16 at device 1.1 on pci0 :>> sym1: Symbios NVRAM, ID 7, Fast-40, LVD, parity checking :>> sym1: open drain IRQ line driver, using on-chip SRAM :>> sym1: using LOAD/STORE-based firmware. :>> sym1: handling phase mismatch from SCRIPTS. :>> em0: port 0xfcc0-0xfcff mem 0xfeac0000-0xfeadffff irq 17 at device 4.0 on pci0 :>> em0: Speed:1000 Mbps Duplex:Full :>> fxp0: port 0xfc40-0xfc7f mem 0xfe900000-0xfe9fffff,0xfeaf9000-0xfeaf9fff irq 18 at device 7.0 on pci0 :>> fxp0: Ethernet address 00:e0:81:00:f0:d7 :>> miibus0: on fxp0 :>> inphy0: on miibus0 :>> inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto :>> isab0: port 0x500-0x50f at device 15.0 on pci0 :>> isa0: on isab0 :>> pci0: at device 15.1 (no driver attached) :>> pcib2: at pcibus 2 on motherboard :>> pci2: on pcib2 :>> pcib3: at device 2.0 on pci2 :>> pci3: on pcib3 :>> IOAPIC #1 intpin 11 -> irq 20 :>> IOAPIC #1 intpin 8 -> irq 21 :>> pcib4: at device 0.0 on pci3 :>> pci4: on pcib4 :>> IOAPIC #1 intpin 10 -> irq 22 :>> amr0: mem 0xf0000000-0xf3ffffff irq 22 at device 0.0 on pci4 :>> amr0: Firmware G170, BIOS F316, 64MB RAM :>> pci3: at device 1.0 (no driver attached) :>> pci3: at device 2.0 (no driver attached) :>> orm0: