Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Sep 2001 13:30:36 +0200 (CEST)
From:      "Hartmann, O." <ohartman@klima.physik.uni-mainz.de>
To:        <freebsd-smp@freebsd.org>
Cc:        <freebsd-stable@freebsd.org>
Subject:   Spontanous reboot on SMP system FBSD 4.4-RC
Message-ID:  <20010905130056.K26477-100000@klima.physik.uni-mainz.de>

next in thread | raw e-mail | index | archive | help
Dear Sirs.

Again, I have a long time not seen problem on one of our SMP machines.
We have four servers with dual Intel CPUs around here and one machine
is euqipted with a TYAN 2500 (ServerWorks IIIHE chipset) mainboard,
Slot 1, dual 866 PIII, 2GB ECC RAM.

In the early time of FBSD 4.3 I have had this problem, too. But it seemed
to me that after a while the problem has gone away. At this moment the servers
run FBSD 4.4-RC, cvsupdated three days ago (and for that with a recent system
running).

This server runs a 4 channel AMI Enterprise 1600 RAID controller with over 240
GB hard disk space. Another server is running the 2 channel version of this controller
in a 32 PCI slot - without a  problem. So earlier responses of my serious problem
have targeted the AMI controller - but I think it isn't.
The machine is an a big cabinet with two redundant 300W quality power supplies
and a lot of fans for cooling. Internal temperature is never over 38 degrees
Celsius, the server's room is air conditioned. So I'm sure that no environmental
problems (e.g heat) is the problem.

The kernel of this system is configured 'normaly' execept that I use the
ISA option
		'options	AUTO_EOI_1'

The further option AUTO_EOI_2 works also, but only for a while and the server
could be forced to reboot sponatnous very likely by using this option.
I use AUTO_EOI_1 due the fact I was told that this option increases performance (?).
On all other system (one machine is a dual PII 350/GigaByte GA686BXD, one a dual
600MHz KATMAI on a ASUS P2B-D and one is a dual 800EB PIII on a ASUS CUV4X-D) the
option AUTO_EOI_1 works fine and these systems never have had these spontanous reboots.

The phenomenon is that the rebooting machine never reboots on heavy load or a while after
beeing under heavy load. I suspected faulty hardware for the problem, but I never tracked down
those components.

This machine is used as a computational system for numeric solutions and in addition as
a NFS Server. The longest uptime without a reboot was three weeks. The reason why the machine
gets so often rebooted is because we do very often cvsupdates, almost every day. For several
campaigns the duty cycle is one week and for that the system ran stable in the meanwhile,
since yesterday. After one day uptime it has a spontanous reboot at 10 o'clock in the
morningtime and this is a time it is not under load and no cron jobs are cycled.

This problem is very, very serious to me due the fact I can not rely on this machine
when we start a campaign next month in which we need this machine for several numerical
simulations (small ones, but they run a long time).
In the past FreeBSD has very often been the target of sponatnous reboots on several hardware
platforms, as I remember and ServerWorks chipsets seemed to be candidates. But expensive
hardware should be more stable than cheaper hardware, in my opinion (that's the reason why we
spent a lot of money for those systems).

In addition, I send you the dmesg output of the running kernel and a mptable output.
Hope someone can help me a little bit.


Many thanks in advance,

Oliver

-------------------------
- dmesg output		-
-------------------------


Copyright (c) 1992-2001 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD 4.4-RC #168: Wed Sep  5 01:22:09 CEST 2001
    root@atmos.physik.uni-mainz.de:/usr/obj/usr/src/sys/ATMOS
Timecounter "i8254"  frequency 1193182 Hz
CPU: Pentium III/Pentium III Xeon/Celeron (868.57-MHz 686-class CPU)
  Origin = "GenuineIntel"  Id = 0x683  Stepping = 3
  Features=0x387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,PN,MMX,FXSR,SSE>
real memory  = 2147483648 (2097152K bytes)
avail memory = 2087907328 (2038972K bytes)
Programming 16 pins in IOAPIC #0
IOAPIC #0 intpin 2 -> irq 0
Programming 16 pins in IOAPIC #1
FreeBSD/SMP: Multiprocessor motherboard
 cpu0 (BSP): apic id:  1, version: 0x00040011, at 0xfee00000
 cpu1 (AP):  apic id:  0, version: 0x00040011, at 0xfee00000
 io0 (APIC): apic id:  2, version: 0x000f0011, at 0xfec00000
 io1 (APIC): apic id:  3, version: 0x000f0011, at 0xfec01000
Preloaded elf kernel "kernel" at 0xc03a0000.
Pentium Pro MTRR support enabled
Using $PIR table, 12 entries at 0xc00fdf00
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <Host to PCI bridge> on motherboard
IOAPIC #1 intpin 13 -> irq 2
IOAPIC #1 intpin 12 -> irq 16
IOAPIC #1 intpin 7 -> irq 17
pci0: <PCI bus> on pcib0
pcib3: <PCI to PCI bridge (vendor=1166 device=0005)> at device 0.1 on pci0
IOAPIC #1 intpin 1 -> irq 18
pci1: <PCI bus> on pcib3
pci1: <NVidia Riva TNT2 graphics accelerator> at 0.0 irq 18
sym0: <896> port 0xf800-0xf8ff mem 0xfeafe000-0xfeafffff,0xfeafac00-0xfeafafff irq 2 at device 1.0 on pci0
sym0: Symbios NVRAM, ID 7, Fast-40, LVD, parity checking
sym0: open drain IRQ line driver, using on-chip SRAM
sym0: using LOAD/STORE-based firmware.
sym0: handling phase mismatch from SCRIPTS.
sym1: <896> port 0xf400-0xf4ff mem 0xfeafc000-0xfeafdfff,0xfeafa800-0xfeafabff irq 16 at device 1.1 on pci0
sym1: Symbios NVRAM, ID 7, Fast-40, SE, parity checking
sym1: open drain IRQ line driver, using on-chip SRAM
sym1: using LOAD/STORE-based firmware.
sym1: handling phase mismatch from SCRIPTS.
pcib5: <DEC 21154 PCI-PCI bridge> at device 3.0 on pci0
IOAPIC #1 intpin 2 -> irq 19
pci2: <PCI bus> on pcib5
pcib6: <DEC 21154 PCI-PCI bridge> at device 0.0 on pci2
IOAPIC #1 intpin 0 -> irq 20
pci3: <PCI bus> on pcib6
amr0: <AMI MegaRAID> mem 0xf4000000-0xf7ffffff irq 20 at device 0.0 on pci3
amr0: <Series 471 40 Logical Drive Firmware> Firmware A159, BIOS 3.11, 64MB RAM
pci2: <unknown card> (vendor=0x1077, dev=0x1216) at 1.0 irq 18
pci2: <unknown card> (vendor=0x1077, dev=0x1216) at 2.0 irq 19
fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0xfcc0-0xfcff mem 0xfe900000-0xfe9fffff,0xfeaf9000-0xfeaf9fff irq 17 at device 7.0 on pci0
fxp0: Ethernet address 00:e0:81:00:f0:d7
inphy0: <i82555 10/100 media interface> on miibus0
inphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
isab0: <ServerWorks IB6566 PCI to ISA bridge> at device 15.0 on pci0
isa0: <ISA bus> on isab0
pci0: <Unknown PCI ATA controller> at 15.1
pcib1: <ServerWorks NB6536 2.0HE host to PCI bridge> on motherboard
pci4: <PCI bus> on pcib1
pcib2: <ServerWorks host to PCI bridge> on motherboard
pci5: <PCI bus> on pcib2
pcib4: <ServerWorks host to PCI bridge> on motherboard
pci6: <PCI bus> on pcib4
orm0: <Option ROMs> at iomem 0xc0000-0xc9fff,0xca000-0xcdfff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: model IntelliMouse, device ID 3
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> on isa0
sc0: VGA <6 virtual consoles, flags=0x200>
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1 at port 0x2f8-0x2ff irq 3 flags 0x10 on isa0
sio1: type 16550A
ppc0: <Parallel port> at port 0x378-0x37f irq 7 drq 1 flags 0x8 on isa0
ppc0: SMC-like chipset (ECP-only) in ECP mode
ppc0: FIFO with 16/16/8 bytes threshold
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
APIC_IO: Testing 8254 interrupt delivery
APIC_IO: Broken MP table detected: 8254 is not connected to IOAPIC #0 intpin 2
APIC_IO: routing 8254 via 8259 and IOAPIC #0 intpin 0
DUMMYNET initialized (010124)
IP packet filtering initialized, divert enabled, rule-based forwarding enabled, default to deny, unlimited logging
IPsec: Initialized Security Association Processing.
Waiting 4 seconds for SCSI devices to settle
(noperiph:sym0:0:-1:-1): SCSI BUS reset delivered.
(noperiph:sym1:0:-1:-1): SCSI BUS reset delivered.
amrd0: <MegaRAID logical drive> on amr0
amrd0: 245014MB (501788672 sectors) RAID 5 (optimal)
SMP: AP CPU #1 Launched!
sa0 at sym1 bus 0 target 5 lun 0
sa0: <HP C5713A H910> Removable Sequential Access SCSI-2 device
sa0: 40.000MB/s transfers (20.000MHz, offset 31, 16bit)
Mounting root from ufs:/dev/amrd0s1a
ch0 at sym1 bus 0 target 5 lun 1
ch0: <HP C5713A H910> Removable Changer SCSI-2 device
ch0: 40.000MB/s transfers (20.000MHz, offset 31, 16bit)
ch0: 6 slots, 1 drive, 0 pickers, 0 portals
cd0 at sym1 bus 0 target 3 lun 0
cd0: <TEAC CD-ROM CD-532S 1.0A> Removable CD-ROM SCSI-2 device
cd0: 20.000MB/s transfers (20.000MHz, offset 16)
cd0: Attempt to query device size failed: NOT READY, Medium not present
link_elf: symbol splash_register undefined
fxp0: promiscuous mode enabled

-------------------------
- mptable output	-
-------------------------


===============================================================================

MPTable, version 2.0.15

 looking for EBDA pointer @ 0x040e, found, searching EBDA @ 0x0009ec00
 searching CMOS 'top of mem' @ 0x0009e800 (634K)
 searching default 'top of mem' @ 0x0009fc00 (639K)
 searching BIOS @ 0x000f0000

 MP FPS found in BIOS @ physical addr: 0x000f7450

-------------------------------------------------------------------------------

MP Floating Pointer Structure:

  location:			BIOS
  physical address:		0x000f7450
  signature:			'_MP_'
  length:			16 bytes
  version:			1.4
  checksum:			0x4a
  mode:				Virtual Wire

-------------------------------------------------------------------------------

MP Config Table Header:

  physical address:		0x0009ed60
  signature:			'PCMP'
  base table length:		348
  version:			1.4
  checksum:			0xef
  OEM ID:			'INTRGRPH'
  Product ID:			'ZX10        '
  OEM table pointer:		0x00000000
  OEM table size:		0
  entry count:			35
  local APIC address:		0xfee00000
  extended table length:	148
  extended table checksum:	247

-------------------------------------------------------------------------------

MP Config Base Table Entries:

--
Processors:	APIC ID	Version	State		Family	Model	Step	Flags
		 1	 0x11	 BSP, usable	 6	 8	 3	 0x387fbff
		 0	 0x11	 AP, usable	 6	 8	 3	 0x387fbff
--
Bus:		Bus ID	Type
		 0	 PCI
		 1	 PCI
		 2	 PCI
		 3	 PCI
		 4	 PCI
		 5	 ISA
--
I/O APICs:	APIC ID	Version	State		Address
		 2	 0x11	 usable		 0xfec00000
		 3	 0x11	 usable		 0xfec01000
--
I/O Ints:	Type	Polarity    Trigger	Bus ID	 IRQ	APIC ID	PIN#
		ExtINT	active-hi        edge	     5	   0	      2	   0
		INT	active-hi        edge	     5	   1	      2	   1
		INT	active-hi        edge	     5	   0	      2	   2
		INT	active-hi        edge	     5	   3	      2	   3
		INT	active-hi        edge	     5	   4	      2	   4
		INT	active-lo        edge	     5	   5	      2	   5
		INT	active-hi        edge	     5	   6	      2	   6
		INT	active-hi        edge	     5	   7	      2	   7
		INT	active-hi        edge	     5	   8	      2	   8
		INT	active-lo       level	     5	   9	      2	   9
		INT	active-lo        edge	     5	  10	      2	  10
		INT	active-lo        edge	     5	  11	      2	  11
		INT	active-hi        edge	     5	  12	      2	  12
		INT	active-hi        edge	     5	  13	      2	  13
		INT	active-hi        edge	     5	  14	      2	  14
		INT	active-hi        edge	     5	  15	      2	  15
		INT	active-lo       level	     0	 1:A	      3	  13
		INT	active-lo       level	     0	 1:B	      3	  12
		INT	active-lo       level	     3	 0:A	      3	   0
		INT	active-lo       level	     2	 1:A	      3	   1
		INT	active-lo       level	     2	 2:A	      3	   2
		INT	active-lo       level	     0	 7:A	      3	   7
		INT	active-lo       level	     1	 0:A	      3	   1
--
Local Ints:	Type	Polarity    Trigger	Bus ID	 IRQ	APIC ID	PIN#
		ExtINT	active-hi        edge	     5	   0	    255	   0
		NMI	active-hi        edge	     0	 0:A	    255	   1

-------------------------------------------------------------------------------

MP Config Extended Table Entries:

--
System Address Space
 bus ID: 0 address type: I/O address
 address base: 0x0
 address range: 0x10000
--
System Address Space
 bus ID: 0 address type: memory address
 address base: 0x80000000
 address range: 0x74000000
--
System Address Space
 bus ID: 0 address type: prefetch address
 address base: 0xf4000000
 address range: 0x8000000
--
System Address Space
 bus ID: 0 address type: memory address
 address base: 0xfc000000
 address range: 0x2e00000
--
System Address Space
 bus ID: 0 address type: memory address
 address base: 0xfee01000
 address range: 0x11ff000
--
System Address Space
 bus ID: 0 address type: memory address
 address base: 0xa0000
 address range: 0x20000
--
System Address Space
 bus ID: 0 address type: memory address
 address base: 0xd0000
 address range: 0x18000
--
Bus Heirarchy
 bus ID: 5 bus info: 0x01 parent bus ID: 0

===============================================================================

--
MfG
O. Hartmann

ohartman@klima.physik.uni-mainz.de
----------------------------------------------------------------
IT-Administration des Institutes fuer Physik der Atmosphaere (IPA)
----------------------------------------------------------------
Johannes Gutenberg Universitaet Mainz
Becherweg 21
55099 Mainz

Tel: +496131/3924662 (Maschinenraum)
Tel: +496131/3924144
FAX: +496131/3923532


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010905130056.K26477-100000>