Date: Wed, 6 Jul 2005 09:40:20 +0200 (CEST) From: Blaz Zupan <blaz@si.FreeBSD.org> To: freebsd-stable@freebsd.org Cc: Kris Kennaway <kris@obsecurity.org> Subject: Re: FreeBSD -STABLE servers repeatedly crashing. Message-ID: <20050706093012.M3376@titanic.medinet.si> In-Reply-To: <20050701184352.GA177@xor.obsecurity.org> References: <42BF8815.6090909@atopia.net> <20050627081933.GA97832@cell.sick.ru> <42C16394.4040904@atopia.net> <1119971279.36316.45.camel@buffy.york.ac.uk> <42C16C0E.9090002@atopia.net> <20050629100535.GC27557@xor.obsecurity.org> <20050701184352.GA177@xor.obsecurity.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 1 Jul 2005, Kris Kennaway wrote: >> On Tue, Jun 28, 2005 at 11:26:06AM -0400, Matt Juszczak wrote: >>> After CPUID: 1, the machine locks cold and nothing else is printed to >>> the screen. >> >> Try two things: >> >> 1) adding 'options KDB_STOP_NMI' to your kernel config. > > I just learned that you also need to set the > debug.kdb.stop_cpus_with_nmi=1 sysctl (e.g. in sysctl.conf). I'm experiencing the same crashes as Matt, but on 5.4-RELEASE-p3. The machine is a HP DL380 G3 and it is heavily loaded (postfix mail server running amavisd-new with antivirus and antispam, so it has heavy IO and CPU load). It does not survive more than a couple of hours, while it is rock stable running 4.11. We have four machines like this, three of them are now again running 4.11 and we left the fourth one at 5.4. We have two other DL380 servers working on our outbound mail queue, but they are not SMP and they are rock stable on 5.4. Without KDB_STOP_NMI, the machine was basically stuck after a crash. Now I've finally landed in the kernel debugger and I have a trace from DDB and have also been able to generate a crashdump with "call doadump". If a developer is willing to investigate, I have: - the vmcore file from the crash (its size is 1GB) - the corresponding kernel, compiled with debug symbols - a GIF of the console at the time of the crash with the backtrace at the time of crash - a dmesg from the box (see below) - the kernel config file Please contact me if you want to investigate this further. Just in case, here is a dmesg from the box: Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.4-RELEASE-p3 #0: Tue Jul 5 18:37:15 CEST 2005 blaz@bigbrother.amis.net:/usr/obj/usr/src5/sys/DL380 Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Xeon(TM) CPU 3.06GHz (3049.93-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf29 Stepping = 9 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Hyperthreading: 2 logical CPUs real memory = 1073717248 (1023 MB) avail memory = 1045372928 (996 MB) ACPI APIC Table: <COMPAQ 00000083> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 6 cpu3 (AP): APIC ID: 7 MADT: Forcing active-low polarity and level trigger for SCI ioapic0 <Version 1.1> irqs 0-15 on motherboard ioapic1 <Version 1.1> irqs 16-31 on motherboard ioapic2 <Version 1.1> irqs 32-47 on motherboard ioapic3 <Version 1.1> irqs 48-63 on motherboard npx0: <math processor> on motherboard npx0: INT 16 interface acpi0: <COMPAQ P29> on motherboard acpi0: Power Button (fixed) Timecounter "ACPI-safe" frequency 3579545 Hz quality 1000 acpi_timer0: <32-bit timer at 3.579545MHz> port 0x920-0x923 on acpi0 cpu0: <ACPI CPU> on acpi0 cpu1: <ACPI CPU> on acpi0 cpu2: <ACPI CPU> on acpi0 cpu3: <ACPI CPU> on acpi0 pcib0: <ACPI Host-PCI bridge> on acpi0 pci0: <ACPI PCI bus> on pcib0 pci0: <display, VGA> at device 3.0 (no driver attached) pci0: <base peripheral> at device 4.0 (no driver attached) pci0: <base peripheral> at device 4.2 (no driver attached) isab0: <PCI-ISA bridge> at device 15.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <ServerWorks CSB5 UDMA100 controller> port 0x2000-0x200f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 15.1 on pci0 ata0: channel #0 on atapci0 ata1: channel #1 on atapci0 ohci0: <OHCI (generic) USB controller> mem 0xf5ef0000-0xf5ef0fff irq 7 at device 15.2 on pci0 usb0: OHCI version 1.0, legacy support usb0: SMM does not respond, resetting usb0: <OHCI (generic) USB controller> on ohci0 usb0: USB revision 1.0 uhub0: (0x1166) OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 4 ports with 4 removable, self powered pcib1: <ACPI Host-PCI bridge> on acpi0 pci1: <ACPI PCI bus> on pcib1 ciss0: <Compaq Smart Array 5i> port 0x3000-0x30ff mem 0xf7cf0000-0xf7cf3fff,0xf7dc0000-0xf7dfffff irq 30 at device 3.0 on pci1 pcib2: <ACPI Host-PCI bridge> on acpi0 pci2: <ACPI PCI bus> on pcib2 bge0: <Broadcom BCM5703 Gigabit Ethernet, ASIC rev. 0x1002> mem 0xf7ef0000-0xf7efffff irq 29 at device 1.0 on pci2 miibus0: <MII bus> on bge0 brgphy0: <BCM5703 10/100/1000baseTX PHY> on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bge0: Ethernet address: 00:0e:7f:20:22:91 bge1: <Broadcom BCM5703 Gigabit Ethernet, ASIC rev. 0x1002> mem 0xf7ee0000-0xf7eeffff irq 31 at device 2.0 on pci2 miibus1: <MII bus> on bge1 brgphy1: <BCM5703 10/100/1000baseTX PHY> on miibus1 brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bge1: Ethernet address: 00:0e:7f:20:22:90 pcib3: <ACPI Host-PCI bridge> on acpi0 pci3: <ACPI PCI bus> on pcib3 pcib4: <ACPI Host-PCI bridge> on acpi0 pci6: <ACPI PCI bus> on pcib4 pci6: <base peripheral, PCI hot-plug controller> at device 30.0 (no driver attached) acpi_tz0: <Thermal Zone> on acpi0 atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 psm0: <PS/2 Mouse> irq 12 on atkbdc0 psm0: model Generic PS/2 mouse, device ID 0 sio0: <Standard PC COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0 sio0: type 16550A fdc0: <floppy drive controller (FDE)> port 0x3f2-0x3f5 irq 6 drq 2 on acpi0 fd0: <1440-KB 3.5" drive> on fdc0 drive 0 orm0: <ISA Option ROMs> at iomem 0xee000-0xeffff,0xcc000-0xcd7ff,0xc8000-0xcbfff,0xc0000-0xc7fff on isa0 pmtimer0 on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 10.000 msec IP Filter: v3.4.35 initialized. Default = pass all, Logging = enabled acd0: CDROM <COMPAQ CD-ROM SN-124/N104> at ata0-master PIO4 SMP: AP CPU #3 Launched! SMP: AP CPU #1 Launched! SMP: AP CPU #2 Launched! da0 at ciss0 bus 0 target 0 lun 0 da0: <COMPAQ RAID 5 VOLUME OK> Fixed Direct Access SCSI-0 device da0: 135.168MB/s transfers da0: 69455MB (142245120 512 byte sectors: 255H 32S/T 17432C) Mounting root from ufs:/dev/da0s1a WARNING: / was not properly dismounted WARNING: /usr was not properly dismounted WARNING: /var was not properly dismounted WARNING: /spool was not properly dismounted /spool: mount pending error: blocks 5484 files 14
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050706093012.M3376>