Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 Apr 2004 18:19:39 -0700
From:      Rick Updegrove <dislists@updegrove.net>
To:        freebsd-stable@freebsd.org
Subject:   Re: 4.9 SMP Stability?
Message-ID:  <407DE32B.8040304@updegrove.net>
In-Reply-To: <20040415000022.GA57253@xor.obsecurity.org>
References:  <40770C0A.3000000@updegrove.net> <407979F3.20501@freebsd.org> <407C5AED.9040709@updegrove.net> <407C76A6.5080502@users.sourceforge.net> <407CA3D6.2090803@updegrove.net> <20040414083216.A45296@server.gisp.dk> <407D466E.9060900@updegrove.net> <407DBD39.6020405@updegrove.net> <20040414232312.GA56901@xor.obsecurity.org> <407DCB29.8010109@updegrove.net> <20040415000022.GA57253@xor.obsecurity.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Kris Kennaway wrote:

> First verify that

Obviously, I am going to have to change one thing at a time, wait for
the crash (and let the disks take the beating) or I will have no way to
know what exactly is happening.

So, I will start at the top and work my way down this list.

> * You have an up-to-date BIOS on the system.  A lot of systems have
> buggy BIOSes, and this is frequently the cause of "mysterious crashes"
> especially for advanced features like SMP.

I am running HP 4.06.33 PL
at your request I will update to 4.06.43 PL

I will do this just as soon as I sent this reply, which has more
questions I need answered.  Besides, I need to run the new BIOS with the
4.10-BETA kernel until it crashes to eliminate the BIOS as a suspect right?

> * You have not fiddled with options in the BIOS.  Playing with things
> like memory timing and other BIOS features can cause crashes.

I have changed one setting which stopped the "locking up with no
reboooting".

See http://lists.freebsd.org/pipermail/freebsd-stable/2003-July/002230.html

I got an off-list reply which suggested I do the following:

I went into the BIOS and selected:
Configuration
-> PCI Slot Devices
-> PCI IRQ Locking
-> Routing Algorithm [Smart]

Ok I changed Routing Algorithm [Smart] to [Fixed] and got a scary
warning about data loss etc. but I hit Yes and saved as prompted and
rebooted.

> * The hardware is all in order, you don't have mismatched components
> like CPUs with different steppings, etc.

This may sound silly but how do I verify this?

(I have attached dmesg -a at the bottom of this email in case that helps)

> These three points hold *whether or not an older version of FreeBSD
> works for you*, because different versions of FreeBSD interact in
> different ways with the hardware, and a previously existing problem
> may suddenly leap out at you when you run a different version.

Sorry but to me the above paragraph is confusing.  I don't agree with
what I think it says.

The hardware runs just fine with 4.8-STABLE so I don't think you can 
convince me that my hardware is the cause of this problem.

> * you're not using out-of-date kernel modules, since in general they
> must be rebuilt whenever you update your kernel.

How do I verify this?

The proceedure I follow after once doing:
mkdir /root/kernels
cp GENERIC /root/kernels/MYKERNEL

is:
cp -Rp /etc /etc.old
cd /usr
rm -rf src/*
rm -rf obj/*
cd /usr/src
/usr/local/bin/cvsup -g -L 2 /etc/stable-supfile
cd /usr/src/sys/i386/conf
ln -s /root/kernels/MYKERNEL
/usr/sbin/config MYKERNEL
cd ../../compile/MYKERNEL
make depend
cd /usr/src
make -j4 buildworld
cd /usr/src
make buildkernel KERNCONF=MYKERNEL
make installkernel KERNCONF=MYKERNEL
make installworld
cd /dev
/bin/sh MAKEDEV all
cd /usr/src/release/sysinstall
make all install
shutdown -r now

Am I missing anything specific?

If you just point me to the handbook I will refer back to my question:
"Am I missing anything specific?"

> You said the machine panicked.  

I said the machine reboots without any warning and without leaving
anything useful in any of the logs.

> When you encounter a panic, the useful
> thing to do is to obtain a debugging traceback, as described in the
> developers handbook.
> 
>   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html
> 
> Your bug report will be more useful the more relevant details you can
> provide about it.  For example, provide a copy of boot -v, and details
> of what you are doing to provoke the problem, what you have tried to
> work around it, and any other partial results you might have.

boot -v
-bash: boot: command not found

Again, I am doing nothing to provoke the problem.  I check the uptime 
frm time to time and I notice that it has rebooted.

So far I have been unable to obtain any useful information by following
the handbook.  However, I think I have made some progress in that area.

#/etc/rc.conf
dumpdev=/dev/amrd0s1b
savecore=YES
dumpdir="/var/crash"

So, hopefully when the machine crashes, after the BIOS update, along 
with the above changes to rc.conf and the debugging traceback (if I can
obtain one) will help.

> After all this, there's no guarantee that one of the volunteer
> developers will be able to jump on board to try to solve your problem
> straight away [1].  Debugging this kind of thing typically takes time,
> so if you don't have it to spare then you'll just have to put on a
> happy face and accept that you can't put in the work needed to track
> newer versions of FreeBSD on your machine.

Yep I know but I feel like I must try anyway. :)

> Kris
> 
> [1] of course, you always have the option to pay an expert to
> investigate the problem.


Copyright (c) 1992-2003 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD 4.10-BETA #0: Tue Apr 13 21:49:08 PDT 2004
     root@govmail.ca.gov:/usr/obj/usr/src/sys/SMP
Timecounter "i8254"  frequency 1193182 Hz
CPU: Pentium III/Pentium III Xeon/Celeron (499.15-MHz 686-class CPU)
   Origin = "GenuineIntel"  Id = 0x673  Stepping = 3

Features=0x387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,PN,MMX,FXSR,SSE>
real memory  = 536870912 (524288K bytes)
avail memory = 519507968 (507332K bytes)
Programming 24 pins in IOAPIC #0
IOAPIC #0 intpin 2 -> irq 0
FreeBSD/SMP: Multiprocessor motherboard: 2 CPUs
  cpu0 (BSP): apic id:  1, version: 0x00040011, at 0xfee00000
  cpu1 (AP):  apic id:  0, version: 0x00040011, at 0xfee00000
  io0 (APIC): apic id:  2, version: 0x00170011, at 0xfec00000
Preloaded elf kernel "kernel" at 0xc0329000.
Pentium Pro MTRR support enabled
md0: Malloc disk
Using $PIR table, 14 entries at 0xc00fdee0
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Intel 82443BX host to PCI bridge (AGP disabled)> on motherboard
IOAPIC #0 intpin 19 -> irq 2
IOAPIC #0 intpin 17 -> irq 16
pci0: <PCI bus> on pcib0
isab0: <Intel 82371AB PCI to ISA bridge> at device 4.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel PIIX4 ATA33 controller> port 0xfcd0-0xfcdf at device 4.1
on pci0
ata0: at 0x1f0 irq 14 on atapci0
ata1: at 0x170 irq 15 on atapci0
pci0: <Intel 82371AB/EB (PIIX4) USB controller> at 4.2 irq 2
Timecounter "PIIX"  frequency 3579545 Hz
chip1: <Intel 82371AB Power management controller> port 0x2180-0x218f at
device 4.3 on pci0
pcib1: <PCI to PCI bridge (vendor=8086 device=0960)> at device 7.0 on pci0
IOAPIC #0 intpin 16 -> irq 17
pci1: <PCI bus> on pcib1
ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0xe800-0xe8ff mem
0xfebfe000-0xfebfefff irq 17 at device 4.0 on pci1
aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
pci1: <unknown card> (vendor=0x1000, dev=0x000c) at 7.0 irq 18
amr0: <LSILogic MegaRAID> mem 0xf0000000-0xf7ffffff irq 16 at device 7.1
on pci0
amr0: <Integrated HP NetRAID (T5)> Firmware D.02.05, BIOS B.01.04, 16MB RAM
pcib2: <DEC 21152 PCI-PCI bridge> at device 8.0 on pci0
pci2: <PCI bus> on pcib2
fxp0: <Intel 82558 Pro/100 Ethernet> port 0xdce0-0xdcff mem
0xfe900000-0xfe9fffff,0xefffe000-0xefffefff irq 16 at device 2.0 on pci2
fxp0: Ethernet address 00:90:27:b7:09:76
inphy0: <i82555 10/100 media interface> on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
pci0: <unknown card> (vendor=0x103c, dev=0x10c1) at 11.0
pci0: <Cirrus Logic GD5446 SVGA controller> at 13.0
orm0: <Option ROMs> at iomem
0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xc9000-0xc97ff on isa0
pmtimer0 on isa0
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0
kbd0 at atkbd0
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: model Generic PS/2 mouse, device ID 0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1: configured irq 3 not in bitmap of probed irqs 0
ppc0: parallel port not found.
APIC_IO: Testing 8254 interrupt delivery
APIC_IO: Broken MP table detected: 8254 is not connected to IOAPIC #0
intpin 2
APIC_IO: routing 8254 via 8259 and IOAPIC #0 intpin 0
ata0-slave: ATAPI identify retries exceeded
acd0: CDROM <CD-532E-B> at ata0-master PIO4
Waiting 15 seconds for SCSI devices to settle
amrd0: <LSILogic MegaRAID logical drive> on amr0
amrd0: 34708MB (71081984 sectors) RAID 5 (optimal)
SMP: AP CPU #1 Launched!
Mounting root from ufs:/dev/amrd0s1a
dumpon: crash dumps to /dev/amrd0s1b (133, 131073)
swapon: adding /dev/amrd0s1b as swap device
Automatic boot in progress...
/dev/amrd0s1a:
FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/amrd0s1a:
clean, 17512 free
(232 frags, 2160 blocks, 0.4% fragmentation)
/dev/amrd0s1f:
FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/amrd0s1f:
clean, 108490 free
(322 frags, 13521 blocks, 0.2% fragmentation)
/dev/amrd0s1g:
FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/amrd0s1g:
clean, 11804820 free
(392020 frags, 1426600 blocks, 2.4% fragmentation)
/dev/amrd0s1e:
FILESYSTEM CLEAN; SKIPPING CHECKS
/dev/amrd0s1e:
clean, 314499 free
(21563 frags, 36617 blocks, 4.2% fragmentation)
Doing initial network setup:
  hostname
.
fxp0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
	inet 134.186.104.10 netmask 0xffffff00 broadcast 134.186.104.255
	ether 00:90:27:b7:09:76
	media: Ethernet 100baseTX <full-duplex>
	status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
	inet 127.0.0.1 netmask 0xff000000
add net default: gateway 134.186.104.62
Additional routing options:
  TCP keepalive=YES
.
Routing daemons:
.
Additional daemons:
  syslogd
.
Checking for core dump:
savecore: no core dump
Doing additional network setup:
.
Starting final network daemons:
.
ELF ldconfig path: /usr/lib /usr/lib/compat /usr/X11R6/lib /usr/local/lib
a.out ldconfig path: /usr/lib/aout /usr/lib/compat/aout /usr/X11R6/lib/aout
Starting standard daemons:
  cron
  sshd
.
Initial rc.i386 initialization:
.
Configuring syscons:
  blanktime
.
Additional ABI support:
.
Starting local daemons:
starting svscan in /service
[1] 96
.
Local package initialization:
[Wed Apr 14 17:53:50 2004] [warn] Loaded DSO libexec/apache/libphp4.so
uses plain Apache 1.3 API, this module might crash under EAPI! (please
recompile it with -DEAPI)

  apache
Starting clamd
  mysqld
  (skipping samba.sh, not executable)
Starting spamd
  sqwebmaild
  svscan
.
Additional TCP options:
.

Wed Apr 14 17:53:52 PDT 2004
Apr 14 17:55:41 govmail sshd[385]: error: PAM: Authentication failure






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?407DE32B.8040304>