Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Mar 96 23:21 WET
From:      uhclem@nemesis.lonestar.org (Frank Durda IV)
To:        hackers@freebsd.org, current@freebsd.org
Cc:        uhclem@nemesis.lonestar.org
Subject:   Crash advice needed APPENDIX B
Message-ID:  <m0u0LlJ-000DYmC@nemesis.lonestar.org>

next in thread | raw e-mail | index | archive | help

Thanks for all the responses to my query.  However, it seems
that some other peoples ancedotal experiences got merged with my symptoms
and now people think I have hardware I don't have.   I'll try to
clarify the config and respond to as many of the questions as I can.

1.	The system has a 1540B as stated in the original posting,
	not a 1542C or CF.  It is a 1540B.  (No floppy controller on board.)
	I have been using Rev H boards until tonight when I
	switched to a Rev J to see if that makes any difference.


2.	I was using MCODE F3F7 BIOS BC00 throughout most
	of the tests since changing the SCSI card was one of
	the last things I tried.  
	I then switched to a 1540B rev H with MCODE 3054 BIOS BD00.
	Based on 30 hours, this change increased the failure rate
	dramatically, and ALL of the failures with this second
	board were panics IDENTICAL to this one:
	Fatal trap 12: page fault while in kernel mode
	fault virtual address = 0x10
	fault code = supervisor read, page not present
	instruction pointer = 0x8:0xf01953dc	(_Xintr15+some)
	code segment: base=0x0, limit 0xfffff, type=0x1b
	DPL=0, pres 1, def32 1 gran 1
	processor eflags = resume IOPL=0
	current process = Idle
	Interrupt mask = 	(nothing)
	panic: page fault

	syncing disks...  (which fails to occur)

	While this microcode was in place, uptime fell to about three
	hours per crash and there have been four crashes.  (I am not here
	constantly, so the system just sat at one panic for several
	hours.)

	This sharp increase in panics may indicate some sort of
	incompatibility with FreeBSD and this particular revision of
	Adaptec Microcode.  According to Adaptec it is the latest firmware,
	although it is dated two or three years ago.   This SCSI issue
	may need serious investigation.


3.	This evening, I switched to a 1540B Rev J card with a
	MCODE 3054 BIOS BD00 to see if the failure rate goes down
	at least to previous levels.  This is the third SCSI card tried.
	This config has only been up three hours (it has seven hours of
	catch-up so it is busy) but this isn't long enough for any
	conclusions.


4.	The SCSI cabling consists of an internal ribbon-style cable
	(which has now been replaced three times) no longer than 18".
	The sole internal device is connected to the last connector.
	There are only three connectors TOTAL on the cable.

	The external cable is a round shielded cable with molded CHAMP 50
	connectors on each end which came with the Archive 2150eS.
	This cable is approx 6' long.  I do not have another cable
	to try in this position, but as a next step, I will
	completely remove the tape drive from the system and terminate the
	1540B.   

	The 2150eS uses a CHAMP-50 plug terminator, which is
	installed on the back of the drive cabinet and clamped in
	position.

	So total SCSI length is under 8 Feet.


5.	Termination resistors are NOT installed on the 1540B.

	For those curious about 1540B settings:
		J5 1 8 9 11
		J6 1
		J7 2
		J9 2 6 14
	which translates into:
	J5 1  Synchronous Transfer
	   8  DMA 1 (factory setting)
	   9  IRQ select 0 
	   11 IRQ select 2
	(9+11 and 14 in J9 == IRQ15)
	(DMA transfer speed is set to 5.0MB/s since no jumpers are in place)
	J6 1  BIOS Enable
	J7 2  I/O port 0x330
	J9 2  DRQ 5
	   6  DACK 5
	   14 IRQ 15
	   

6.	Someone asked about the settings on the Barracuda, which is a
	ST12550N (not a NW or ND).   The jumpers are:
	J4-9-10  Delay motor start (10 sec * ID)
	J01- 2-4 Terminator power from pin 26 on the SCSI bus.
	The resistor packs are installed (never removed).
	The drive selected as controller 0.

	If someone thinks drive power for termination would be better,
	I'll try it.  

	In answer to another question, the above settings prevent the
	Barracuda from supplying terminating power to the bus.

	Oh, and the drive has no fear of overheating.  It is mounted
	in a 5.25" bay with 1/2" slot free below and a full slot free
	above.   Brackets and chassis were drilled to increase air flow.
	The system cover has been off since this exercise began.


7.	It has been suggested that I remove the cache.  I'll just
	mention that the cache and the board it is plugged-into were
	both replaced earlier and there was no change in failure rate.
	Further, this board/cache has no trouble with SCO UNIX,
	Windows '95 and Windows NT which have all run on it previously
	(for weeks at a time under heavy stress-loads using a program
	called "evildisk" which is used to qualify hard disk drives).
	The CPU also ran 1.1.5.1 and 2.0.5 which some crashes, but not
	nearly the volume.  Why would the cache "detect" 2.1.0 and fail
	on cue?

	Granted, FreeBSD 2.0.5 did not like the cache when booting
	a compressed kernel.  It would always fail during uncompression.
	This was fixed in 2.1.0 and I have seen no other solid
	errors of that type.

	However, if other substitutions and removals do not correct
	the problem, I will pull the cache module for a period of time.

	Note that I also have three types of cache module I can try:
	Intel 64K and 128K parts, and IDTs pin-compatible 128K version. The
	IDT version happens to be the kind I have been using to date, not
	the Intel one.  (Yes, I do have lots of parts around.)


8.	Yes, the 486DX CPU has been replaced.  It was exchanged when
	the cache was exchanged.   The problem remained.


9.	Someone suggested the EISA config might be strange.  Just
	a reminder that the motherboard was replaced and with it the
	non-volatile EISA/CMOS settings.  Also, there are no timing
	adjustments available via System Configuration/CMOS tools
	on this system.  The ISA bus runs at 8.33MHz.


10.	Someone suggested reseating all socketed parts.   In effect,
	all boards with parts in them have been replaced, so ALL parts
	have been replaced, except peripherals.


11.	Someone suggested building a new kernel with "options DIAGNOSTIC",
	but didn't mention what this will do to/for me.  I will
	go ahead and do this and switch to it on the next crash,
	which should be about 1AM or 5AM if we stick to our time table.


So there we are, perhaps this will eliminate some of the stranger stuff
people were thinking and help focus things down to a short list of
things to try.   Near term, the list appears to be:
	1.	Switch to DIAGNOSTIC kernel.  Won't fix anything, but
		will hopefully scream louder.
	2.	Switch to old BIOS/MCODE on 1540B SCSI adapter.
	3.	Switch SCSI adapter to a different IRQ as this might
		be a problem with spurious interrupt on the master
		controller and the SCSI driver not dealing with it well.
		(Two people suggested getting it off IRQ 15.)
		IRQ 12 is used by the PS/2 mouse, but IRQ 9 is free
		since the network card was removed.  No other IRQs are
		available on the master interrupt controller.
	4.	Abandon Synchronous transfer capability on SCSI adapter.
	5.	Remove cache.  About a 30% system performance hit,
		but I will try it.
	6.	Disconnect floppies.  Might have to leave one to keep
		BIOS from freaking, and this seems like a long shot since
		the floppy drives aren't being used.
	7.	Switch to ALL SCSI.  Expensive at this point in the
		game, and nobody should have to abandon IDE since it
		works fine on many other operating systems.
	8.	By Pentium-class system from Rod.  Ah, yes.  Well that
		might happen once the Triton IIs are stable, but it can't
		happen today.

Again, thanks to all who offered suggestions, even the insane ones!  :-)
No, I can't afford a dual-processor Alpha.  Besides, it doesn't run
FreeBSD today!


Frank Durda IV <uhclem@nemesis.lonestar.org>|"Microsoft Support, can I help
or uhclem%nemesis@rwsystr.nkn.net           | you?"  "Yeah, yuck, yuck, I'd
					    | like to talk to Bill."  "Just
or ...letni!rwsys!nemesis!uhclem	    | a minute..."  





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0u0LlJ-000DYmC>