Date: Wed, 14 Apr 2004 18:19:39 -0700 From: Rick Updegrove <dislists@updegrove.net> To: freebsd-stable@freebsd.org Subject: Re: 4.9 SMP Stability? Message-ID: <407DE32B.8040304@updegrove.net> In-Reply-To: <20040415000022.GA57253@xor.obsecurity.org> References: <40770C0A.3000000@updegrove.net> <407979F3.20501@freebsd.org> <407C5AED.9040709@updegrove.net> <407C76A6.5080502@users.sourceforge.net> <407CA3D6.2090803@updegrove.net> <20040414083216.A45296@server.gisp.dk> <407D466E.9060900@updegrove.net> <407DBD39.6020405@updegrove.net> <20040414232312.GA56901@xor.obsecurity.org> <407DCB29.8010109@updegrove.net> <20040415000022.GA57253@xor.obsecurity.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Kris Kennaway wrote: > First verify that Obviously, I am going to have to change one thing at a time, wait for the crash (and let the disks take the beating) or I will have no way to know what exactly is happening. So, I will start at the top and work my way down this list. > * You have an up-to-date BIOS on the system. A lot of systems have > buggy BIOSes, and this is frequently the cause of "mysterious crashes" > especially for advanced features like SMP. I am running HP 4.06.33 PL at your request I will update to 4.06.43 PL I will do this just as soon as I sent this reply, which has more questions I need answered. Besides, I need to run the new BIOS with the 4.10-BETA kernel until it crashes to eliminate the BIOS as a suspect right? > * You have not fiddled with options in the BIOS. Playing with things > like memory timing and other BIOS features can cause crashes. I have changed one setting which stopped the "locking up with no reboooting". See http://lists.freebsd.org/pipermail/freebsd-stable/2003-July/002230.html I got an off-list reply which suggested I do the following: I went into the BIOS and selected: Configuration -> PCI Slot Devices -> PCI IRQ Locking -> Routing Algorithm [Smart] Ok I changed Routing Algorithm [Smart] to [Fixed] and got a scary warning about data loss etc. but I hit Yes and saved as prompted and rebooted. > * The hardware is all in order, you don't have mismatched components > like CPUs with different steppings, etc. This may sound silly but how do I verify this? (I have attached dmesg -a at the bottom of this email in case that helps) > These three points hold *whether or not an older version of FreeBSD > works for you*, because different versions of FreeBSD interact in > different ways with the hardware, and a previously existing problem > may suddenly leap out at you when you run a different version. Sorry but to me the above paragraph is confusing. I don't agree with what I think it says. The hardware runs just fine with 4.8-STABLE so I don't think you can convince me that my hardware is the cause of this problem. > * you're not using out-of-date kernel modules, since in general they > must be rebuilt whenever you update your kernel. How do I verify this? The proceedure I follow after once doing: mkdir /root/kernels cp GENERIC /root/kernels/MYKERNEL is: cp -Rp /etc /etc.old cd /usr rm -rf src/* rm -rf obj/* cd /usr/src /usr/local/bin/cvsup -g -L 2 /etc/stable-supfile cd /usr/src/sys/i386/conf ln -s /root/kernels/MYKERNEL /usr/sbin/config MYKERNEL cd ../../compile/MYKERNEL make depend cd /usr/src make -j4 buildworld cd /usr/src make buildkernel KERNCONF=MYKERNEL make installkernel KERNCONF=MYKERNEL make installworld cd /dev /bin/sh MAKEDEV all cd /usr/src/release/sysinstall make all install shutdown -r now Am I missing anything specific? If you just point me to the handbook I will refer back to my question: "Am I missing anything specific?" > You said the machine panicked. I said the machine reboots without any warning and without leaving anything useful in any of the logs. > When you encounter a panic, the useful > thing to do is to obtain a debugging traceback, as described in the > developers handbook. > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html > > Your bug report will be more useful the more relevant details you can > provide about it. For example, provide a copy of boot -v, and details > of what you are doing to provoke the problem, what you have tried to > work around it, and any other partial results you might have. boot -v -bash: boot: command not found Again, I am doing nothing to provoke the problem. I check the uptime frm time to time and I notice that it has rebooted. So far I have been unable to obtain any useful information by following the handbook. However, I think I have made some progress in that area. #/etc/rc.conf dumpdev=/dev/amrd0s1b savecore=YES dumpdir="/var/crash" So, hopefully when the machine crashes, after the BIOS update, along with the above changes to rc.conf and the debugging traceback (if I can obtain one) will help. > After all this, there's no guarantee that one of the volunteer > developers will be able to jump on board to try to solve your problem > straight away [1]. Debugging this kind of thing typically takes time, > so if you don't have it to spare then you'll just have to put on a > happy face and accept that you can't put in the work needed to track > newer versions of FreeBSD on your machine. Yep I know but I feel like I must try anyway. :) > Kris > > [1] of course, you always have the option to pay an expert to > investigate the problem. Copyright (c) 1992-2003 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.10-BETA #0: Tue Apr 13 21:49:08 PDT 2004 root@govmail.ca.gov:/usr/obj/usr/src/sys/SMP Timecounter "i8254" frequency 1193182 Hz CPU: Pentium III/Pentium III Xeon/Celeron (499.15-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x673 Stepping = 3 Features=0x387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,PN,MMX,FXSR,SSE> real memory = 536870912 (524288K bytes) avail memory = 519507968 (507332K bytes) Programming 24 pins in IOAPIC #0 IOAPIC #0 intpin 2 -> irq 0 FreeBSD/SMP: Multiprocessor motherboard: 2 CPUs cpu0 (BSP): apic id: 1, version: 0x00040011, at 0xfee00000 cpu1 (AP): apic id: 0, version: 0x00040011, at 0xfee00000 io0 (APIC): apic id: 2, version: 0x00170011, at 0xfec00000 Preloaded elf kernel "kernel" at 0xc0329000. Pentium Pro MTRR support enabled md0: Malloc disk Using $PIR table, 14 entries at 0xc00fdee0 npx0: <math processor> on motherboard npx0: INT 16 interface pcib0: <Intel 82443BX host to PCI bridge (AGP disabled)> on motherboard IOAPIC #0 intpin 19 -> irq 2 IOAPIC #0 intpin 17 -> irq 16 pci0: <PCI bus> on pcib0 isab0: <Intel 82371AB PCI to ISA bridge> at device 4.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <Intel PIIX4 ATA33 controller> port 0xfcd0-0xfcdf at device 4.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 pci0: <Intel 82371AB/EB (PIIX4) USB controller> at 4.2 irq 2 Timecounter "PIIX" frequency 3579545 Hz chip1: <Intel 82371AB Power management controller> port 0x2180-0x218f at device 4.3 on pci0 pcib1: <PCI to PCI bridge (vendor=8086 device=0960)> at device 7.0 on pci0 IOAPIC #0 intpin 16 -> irq 17 pci1: <PCI bus> on pcib1 ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0xe800-0xe8ff mem 0xfebfe000-0xfebfefff irq 17 at device 4.0 on pci1 aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs pci1: <unknown card> (vendor=0x1000, dev=0x000c) at 7.0 irq 18 amr0: <LSILogic MegaRAID> mem 0xf0000000-0xf7ffffff irq 16 at device 7.1 on pci0 amr0: <Integrated HP NetRAID (T5)> Firmware D.02.05, BIOS B.01.04, 16MB RAM pcib2: <DEC 21152 PCI-PCI bridge> at device 8.0 on pci0 pci2: <PCI bus> on pcib2 fxp0: <Intel 82558 Pro/100 Ethernet> port 0xdce0-0xdcff mem 0xfe900000-0xfe9fffff,0xefffe000-0xefffefff irq 16 at device 2.0 on pci2 fxp0: Ethernet address 00:90:27:b7:09:76 inphy0: <i82555 10/100 media interface> on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto pci0: <unknown card> (vendor=0x103c, dev=0x10c1) at 11.0 pci0: <Cirrus Logic GD5446 SVGA controller> at 13.0 orm0: <Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xc9000-0xc97ff on isa0 pmtimer0 on isa0 fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0 kbd0 at atkbd0 psm0: <PS/2 Mouse> irq 12 on atkbdc0 psm0: model Generic PS/2 mouse, device ID 0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A sio1: configured irq 3 not in bitmap of probed irqs 0 ppc0: parallel port not found. APIC_IO: Testing 8254 interrupt delivery APIC_IO: Broken MP table detected: 8254 is not connected to IOAPIC #0 intpin 2 APIC_IO: routing 8254 via 8259 and IOAPIC #0 intpin 0 ata0-slave: ATAPI identify retries exceeded acd0: CDROM <CD-532E-B> at ata0-master PIO4 Waiting 15 seconds for SCSI devices to settle amrd0: <LSILogic MegaRAID logical drive> on amr0 amrd0: 34708MB (71081984 sectors) RAID 5 (optimal) SMP: AP CPU #1 Launched! Mounting root from ufs:/dev/amrd0s1a dumpon: crash dumps to /dev/amrd0s1b (133, 131073) swapon: adding /dev/amrd0s1b as swap device Automatic boot in progress... /dev/amrd0s1a: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/amrd0s1a: clean, 17512 free (232 frags, 2160 blocks, 0.4% fragmentation) /dev/amrd0s1f: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/amrd0s1f: clean, 108490 free (322 frags, 13521 blocks, 0.2% fragmentation) /dev/amrd0s1g: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/amrd0s1g: clean, 11804820 free (392020 frags, 1426600 blocks, 2.4% fragmentation) /dev/amrd0s1e: FILESYSTEM CLEAN; SKIPPING CHECKS /dev/amrd0s1e: clean, 314499 free (21563 frags, 36617 blocks, 4.2% fragmentation) Doing initial network setup: hostname . fxp0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 inet 134.186.104.10 netmask 0xffffff00 broadcast 134.186.104.255 ether 00:90:27:b7:09:76 media: Ethernet 100baseTX <full-duplex> status: active lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 inet 127.0.0.1 netmask 0xff000000 add net default: gateway 134.186.104.62 Additional routing options: TCP keepalive=YES . Routing daemons: . Additional daemons: syslogd . Checking for core dump: savecore: no core dump Doing additional network setup: . Starting final network daemons: . ELF ldconfig path: /usr/lib /usr/lib/compat /usr/X11R6/lib /usr/local/lib a.out ldconfig path: /usr/lib/aout /usr/lib/compat/aout /usr/X11R6/lib/aout Starting standard daemons: cron sshd . Initial rc.i386 initialization: . Configuring syscons: blanktime . Additional ABI support: . Starting local daemons: starting svscan in /service [1] 96 . Local package initialization: [Wed Apr 14 17:53:50 2004] [warn] Loaded DSO libexec/apache/libphp4.so uses plain Apache 1.3 API, this module might crash under EAPI! (please recompile it with -DEAPI) apache Starting clamd mysqld (skipping samba.sh, not executable) Starting spamd sqwebmaild svscan . Additional TCP options: . Wed Apr 14 17:53:52 PDT 2004 Apr 14 17:55:41 govmail sshd[385]: error: PAM: Authentication failure
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?407DE32B.8040304>