Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 21 Mar 96 21:24 WET
From:      uhclem@nemesis.lonestar.org (Frank Durda IV)
To:        hackers@freebsd.org.current@freebsd.org
Cc:        uhclem@nemesis.lonestar.org
Subject:   Crash advice needed (long)
Message-ID:  <m0tzxTA-000CqnC@nemesis.lonestar.org>

next in thread | raw e-mail | index | archive | help
Ok, I have run out of ideas.   I welcome suggestions by private EMAIL
(no need to clog the mailing list) on how to proceed next.

I have a 486DX-33 system running FreeBSD 2.1.0.  It crashes, usually
about five times a week.  Some weeks more frequently, some less.
Max and min crashes for a week are around 20 and 3.  The crashes occur
at random, although around 7AM, 1PM and 1AM seem to have more
crashes near those times.  (It isn't a power problem, see below.)

I had crashes when the system ran 1.1.5.1, but they were more like four
a month.  There was a definite increase when 2.1.0 came on the scene,
and there were no changes in the hardware at that time.

Anyway, I was postive that 2.1 was interacting with flaky hardware
and that was the true cause, rather than an actual problem in 2.1.0.
Now I am not so sure.  Read on.

Here is the precise hardware configuration:
1.	System DEC 1027 486-33DX with 128K 485-Turbocache module
	EISA card cage, but no EISA cards present.
	(It will run with DX2/DX4, but I have removed these to
	 simplify the test.)
2.	Memory card with 12 Meg Parity RAM (Parity required on system).
3.	Primary hard disk 	Western Digital 2540	540Meg IDE
	small DOS partition, rest of drive used for / /usr /usr/src swap.
	System is old and only has IDE, not EIDE.  BIOS doesn't
	understand "big drives", so DOS partition is first, followed by
	root, swap, usr /usr/src.
4.	Adaptec 1540B, latest BIOS/MCODE (upgraded yesterday to
	BD00 BIOS, 3054 MCODE)  adapter is not terminated.
	(Was running MCODE F3F7 and BIOS BC00.)
	SCSI Controller #0 is Seagate 2GB Baracuda (terminated)
	SCSI Controller #2 is external Archive 2150eS tape drive
	(also terminated).
5.	Sio 0 16550A (StarTech 16550[A]) IRQ4.
	Sio 1 16550A (StarTech 16550[A]) IRQ3.
	Sio 2 16550A (NS16552) IRQ 11.
	Sio 3 16550A (NS16552) IRQ 10.
	Sio0/Sio1 are on a STB dual-serial adapter with socketed UARTS.
	Sio2/Sio3 are the MLB ports.
6.	One parallel port but no printer attached.
7.	NEC N82077A Floppy Disk controller (mis-identified by probe as 
	a NEC 72065)  One 2.8Meg floppy (BIOS set to 1.44 Meg because
	2.8 setting (type 6) confuses FreeBSD install floppy.  Second 1.44
	Meg floppy also set to 1.44Meg in BIOS.
8.	Video is a WC90C31-based video card (copy of WD reference design)
9.	A WD8013E netword card was removed during the tests.
10.	System and peripherals powered via APC 1400 SmartUPS.
11.	Two Telebit WorldBlazers and a Cardinal V.34 external modem,
	all connected to SmartUPS.  
12.	We are in a drought here so there have been no electrical storms
	of any kind and all telephone lines are underground all the
	way to the CO.

The system activities consist of news (Cnews), UUCP (2.1.0 release),
and mail forwarding/delivery (SMAIL).  X is not run on this system
and I am the only login user, apart from uucico.  There are no CD-ROM
drives present.


The crashes manifest themselves in three ways:

1.	System just reboots - screen clears, BIOS messages seen,
	system restarts.   No panic or other messages.  If you
	are in the room and the monitor is on, you hear the relays in the
	monitor go click-click and you look up and see the BIOS messages.
	Has happened while using the console - was typing a letter
	and the screen simply cleared and the system rebooted.

2.	System hangs.  Does not respond to keystrokes or
	network activity (when network card is present).
	IDE activity light is almost always (but not 100% of the time)
	on SOLID when lock-up is discovered.  DTR is left high on the
	modems.

3.	System panics.  Almost always a Page Not Present error.
	The addresses vary, but a frequently occurring location is in
	and around 0xf0195d3c.
	According to nm /kernel, that is:
		f0195c90 T _Xintr14
		f0195cd8 t Xresume14
		f0195d30 T _Xintr15
	somewhere in here...
		f0195d78 t Xresume15
		f0195dd0 t _doreti
		f0195dd4 t doreti_next
	On this system, IRQ 15 is used by the SCSI adapter.
	IRQ 14 is used by the IDE:	(taken from the most recent reboot)


Mar 21 20:03:19 nemesis /kernel: FreeBSD 2.1.0-RELEASE #0: Sun Nov 26 22:35:06 CST 1995
Mar 21 20:03:19 nemesis /kernel:     root@nemesis.lonestar.org:/usr/src/sys/compile/NEMESIS
Mar 21 20:03:19 nemesis /kernel: CPU: i486DX (486-class CPU)
Mar 21 20:03:19 nemesis /kernel: real memory  = 12910592 (12608K bytes)
Mar 21 20:03:19 nemesis /kernel: avail memory = 11132928 (10872K bytes)
Mar 21 20:03:19 nemesis /kernel: Probing for devices on the ISA bus:
Mar 21 20:03:19 nemesis /kernel: sc0 at 0x60-0x6f irq 1 on motherboard
Mar 21 20:03:19 nemesis /kernel: sc0: VGA color <16 virtual consoles, flags=0x0>
Mar 21 20:03:19 nemesis /kernel: ed0 not found at 0x280
Mar 21 20:03:19 nemesis /kernel: sio0 at 0x3f8-0x3ff irq 4 on isa
Mar 21 20:03:20 nemesis /kernel: sio0: type 16550A
Mar 21 20:03:20 nemesis /kernel: sio1 at 0x2f8-0x2ff irq 3 on isa
Mar 21 20:03:20 nemesis /kernel: sio1: type 16550A
Mar 21 20:03:20 nemesis /kernel: sio2 at 0x3e8-0x3ef irq 10 on isa
Mar 21 20:03:20 nemesis /kernel: sio2: type 16550A
Mar 21 20:03:20 nemesis /kernel: sio3 at 0x2e8-0x2ef irq 11 on isa
Mar 21 20:03:20 nemesis /kernel: sio3: type 16550A
Mar 21 20:03:20 nemesis /kernel: lpt0 at 0x3bc-0x3c3 irq 7 on isa
Mar 21 20:03:20 nemesis /kernel: lpt0: Interrupt-driven port
Mar 21 20:03:20 nemesis /kernel: lp0: TCP/IP capable interface
Mar 21 20:03:20 nemesis /kernel: lpt1 not found at 0xffffffff
Mar 21 20:03:20 nemesis /kernel: psm0 not found at 0x60
Mar 21 20:03:20 nemesis /kernel: fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
Mar 21 20:03:20 nemesis /kernel: fdc0: NEC 72065B
Mar 21 20:03:20 nemesis /kernel: fd0: 1.44MB 3.5in
Mar 21 20:03:20 nemesis /kernel: fd1: 1.44MB 3.5in
Mar 21 20:03:20 nemesis /kernel: wdc0 at 0x1f0-0x1f7 irq 14 on isa
Mar 21 20:03:20 nemesis /kernel: wdc0: unit 0 (wd0): <WDC AC2540H>
Mar 21 20:03:20 nemesis /kernel: wd0: 515MB (1056384 sectors), 1048 cyls, 16 heads, 63 S/T, 512 B/S
Mar 21 20:03:20 nemesis /kernel: aha0 at 0x330-0x333 irq 15 drq 5 on isa
Mar 21 20:03:20 nemesis /kernel: aha0 waiting for scsi devices to settle
Mar 21 20:03:20 nemesis /kernel: (aha0:0:0): "SEAGATE ST12550N 0013" type 0 fixed SCSI 2
Mar 21 20:03:20 nemesis /kernel: sd0(aha0:0:0): Direct-Access 2040MB (4178874 512 byte sectors)
Mar 21 20:03:20 nemesis /kernel: (aha0:2:0): "ARCHIVE VIPER 150  21247 -005" type 1 removable SCSI 1
Mar 21 20:03:21 nemesis /kernel: st0(aha0:2:0): Sequential-Access st0: Archive  Viper 150 is a known rogue
Mar 21 20:03:21 nemesis /kernel: density code 0x0,  drive empty
Mar 21 20:03:21 nemesis /kernel: matcdc0 not found at 0x230
Mar 21 20:03:21 nemesis /kernel: scd0 not found at 0x230
Mar 21 20:03:21 nemesis /kernel: npx0 on motherboard
Mar 21 20:03:21 nemesis /kernel: npx0: INT 16 interface
Mar 21 20:03:21 nemesis /kernel: WARNING: / was not properly dismounted.


	When the system panics, it usually displays the panic,
	then it may hang and not reboot by itself.
	Todays panic was TRAP 12,  virtual addr 0xff1c
	IP 8:0xf0195d2c
	Idle		<----That state is very common in the crashes I see
	Page Fault
	and it hung for ten hours.

	Sometimes after the panic it will say it is syncing disks, and
	start displaying a long string of number (a full line of numbers on
	the screen of them, sometimes multiple lines) and then hang.
	Sometimes I will find it in this state and the SCSI drive will be
	seeking back and forth endlessly.  If you unplug the SCSI cable from
	the drive, the seeking will stop, so it isn't the drive doing it
	on its own.


The crashes where the system does not restart itself are the real
killers, such as today when the system was down for ten hours
before I could get to it to press RESET.

Also note that the kernel is essentially a stock 2.1.0 kernel
(built in November) that has had unneeded drivers removed.
No other code changes/fixes have been added.   (If someone thinks the
stock driver should be used as something to try, I'll try it.)


What I have tried:

About two weeks ago, each time I discovered the system had crashed,
I replaced a piece of hardware on the system, hoping to discover the
hardware flaw.  I have three identical computer systems (except peripherals)
so I was able to do this.  When each substitution was made, all previous
substitutions were left in place.  Here are the substitutions that
were done:

1.	Remove network card.  No improvement.
2.	Replace Chassis, backplane & power supply (this also replaces floppy,
	IDE, two of the four serial ports and IDE cable).  (CPU/Cache/Memory
	is not on the backplane in this computer and were not changed.) 
	No improvement.
3.	Replace CPU card with cache module and processor chip.
	No improvement.
4.	Replace memory card and switch to different card with completely
	different 12Meg of RAM SIMMs.  No improvement.
5.	Replace video card.  No improvement.
6.	Replace 1540B SCSI adapter.  No improvement.
7.	Upgrade BIOS/MCODE on 1540B SCSI adapter.  No improvement.
8.	Replace internal SCSI cable.  No improvement.
9.	Lower room temperature 5F to 64F.  No improvement and I am getting
	cold.

Now I have a pile of cards, a gutted chassis and the system is crashing about
as often as I was before I started exchanging parts.  At this point, only
the hard drives, tape drive, floppy drives and modems in this system have
not been changed and the crashes persist.  Arggh!

I know that because I have an IDE drive in the system some will 
immediately blame it as the cause, and that doesn't surprise me, particularly
with the IDE activity light being left on, but it doesn't explain the panic
in the IRQ 15 code, or the SCSI drive doing the waltz after a panic, etc.
Do we have some bugs in the IDE hard disk driver that are new to 2.1.0?

So, if you can think of anything else to disconnect, change or do
without, let me know.   But I have just about eliminated hardware as
a cause.   Thanks for the review.


Frank Durda IV <uhclem@nemesis.lonestar.org>|"The Knights who say "LETNi"
or uhclem%nemesis@rwsystr.nkn.net           | demand...  A SEGMENT REGISTER!!!"
					    |"A what?"
or ...letni!rwsys!nemesis!uhclem	    |"LETNi! LETNi! LETNi!"  - 1983





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0tzxTA-000CqnC>