Date: Fri, 22 Mar 96 23:21 WET From: uhclem@nemesis.lonestar.org (Frank Durda IV) To: hackers@freebsd.org, current@freebsd.org Cc: uhclem@nemesis.lonestar.org Subject: Crash advice needed APPENDIX B Message-ID: <m0u0LlJ-000DYmC@nemesis.lonestar.org>
next in thread | raw e-mail | index | archive | help
Thanks for all the responses to my query. However, it seems that some other peoples ancedotal experiences got merged with my symptoms and now people think I have hardware I don't have. I'll try to clarify the config and respond to as many of the questions as I can. 1. The system has a 1540B as stated in the original posting, not a 1542C or CF. It is a 1540B. (No floppy controller on board.) I have been using Rev H boards until tonight when I switched to a Rev J to see if that makes any difference. 2. I was using MCODE F3F7 BIOS BC00 throughout most of the tests since changing the SCSI card was one of the last things I tried. I then switched to a 1540B rev H with MCODE 3054 BIOS BD00. Based on 30 hours, this change increased the failure rate dramatically, and ALL of the failures with this second board were panics IDENTICAL to this one: Fatal trap 12: page fault while in kernel mode fault virtual address = 0x10 fault code = supervisor read, page not present instruction pointer = 0x8:0xf01953dc (_Xintr15+some) code segment: base=0x0, limit 0xfffff, type=0x1b DPL=0, pres 1, def32 1 gran 1 processor eflags = resume IOPL=0 current process = Idle Interrupt mask = (nothing) panic: page fault syncing disks... (which fails to occur) While this microcode was in place, uptime fell to about three hours per crash and there have been four crashes. (I am not here constantly, so the system just sat at one panic for several hours.) This sharp increase in panics may indicate some sort of incompatibility with FreeBSD and this particular revision of Adaptec Microcode. According to Adaptec it is the latest firmware, although it is dated two or three years ago. This SCSI issue may need serious investigation. 3. This evening, I switched to a 1540B Rev J card with a MCODE 3054 BIOS BD00 to see if the failure rate goes down at least to previous levels. This is the third SCSI card tried. This config has only been up three hours (it has seven hours of catch-up so it is busy) but this isn't long enough for any conclusions. 4. The SCSI cabling consists of an internal ribbon-style cable (which has now been replaced three times) no longer than 18". The sole internal device is connected to the last connector. There are only three connectors TOTAL on the cable. The external cable is a round shielded cable with molded CHAMP 50 connectors on each end which came with the Archive 2150eS. This cable is approx 6' long. I do not have another cable to try in this position, but as a next step, I will completely remove the tape drive from the system and terminate the 1540B. The 2150eS uses a CHAMP-50 plug terminator, which is installed on the back of the drive cabinet and clamped in position. So total SCSI length is under 8 Feet. 5. Termination resistors are NOT installed on the 1540B. For those curious about 1540B settings: J5 1 8 9 11 J6 1 J7 2 J9 2 6 14 which translates into: J5 1 Synchronous Transfer 8 DMA 1 (factory setting) 9 IRQ select 0 11 IRQ select 2 (9+11 and 14 in J9 == IRQ15) (DMA transfer speed is set to 5.0MB/s since no jumpers are in place) J6 1 BIOS Enable J7 2 I/O port 0x330 J9 2 DRQ 5 6 DACK 5 14 IRQ 15 6. Someone asked about the settings on the Barracuda, which is a ST12550N (not a NW or ND). The jumpers are: J4-9-10 Delay motor start (10 sec * ID) J01- 2-4 Terminator power from pin 26 on the SCSI bus. The resistor packs are installed (never removed). The drive selected as controller 0. If someone thinks drive power for termination would be better, I'll try it. In answer to another question, the above settings prevent the Barracuda from supplying terminating power to the bus. Oh, and the drive has no fear of overheating. It is mounted in a 5.25" bay with 1/2" slot free below and a full slot free above. Brackets and chassis were drilled to increase air flow. The system cover has been off since this exercise began. 7. It has been suggested that I remove the cache. I'll just mention that the cache and the board it is plugged-into were both replaced earlier and there was no change in failure rate. Further, this board/cache has no trouble with SCO UNIX, Windows '95 and Windows NT which have all run on it previously (for weeks at a time under heavy stress-loads using a program called "evildisk" which is used to qualify hard disk drives). The CPU also ran 1.1.5.1 and 2.0.5 which some crashes, but not nearly the volume. Why would the cache "detect" 2.1.0 and fail on cue? Granted, FreeBSD 2.0.5 did not like the cache when booting a compressed kernel. It would always fail during uncompression. This was fixed in 2.1.0 and I have seen no other solid errors of that type. However, if other substitutions and removals do not correct the problem, I will pull the cache module for a period of time. Note that I also have three types of cache module I can try: Intel 64K and 128K parts, and IDTs pin-compatible 128K version. The IDT version happens to be the kind I have been using to date, not the Intel one. (Yes, I do have lots of parts around.) 8. Yes, the 486DX CPU has been replaced. It was exchanged when the cache was exchanged. The problem remained. 9. Someone suggested the EISA config might be strange. Just a reminder that the motherboard was replaced and with it the non-volatile EISA/CMOS settings. Also, there are no timing adjustments available via System Configuration/CMOS tools on this system. The ISA bus runs at 8.33MHz. 10. Someone suggested reseating all socketed parts. In effect, all boards with parts in them have been replaced, so ALL parts have been replaced, except peripherals. 11. Someone suggested building a new kernel with "options DIAGNOSTIC", but didn't mention what this will do to/for me. I will go ahead and do this and switch to it on the next crash, which should be about 1AM or 5AM if we stick to our time table. So there we are, perhaps this will eliminate some of the stranger stuff people were thinking and help focus things down to a short list of things to try. Near term, the list appears to be: 1. Switch to DIAGNOSTIC kernel. Won't fix anything, but will hopefully scream louder. 2. Switch to old BIOS/MCODE on 1540B SCSI adapter. 3. Switch SCSI adapter to a different IRQ as this might be a problem with spurious interrupt on the master controller and the SCSI driver not dealing with it well. (Two people suggested getting it off IRQ 15.) IRQ 12 is used by the PS/2 mouse, but IRQ 9 is free since the network card was removed. No other IRQs are available on the master interrupt controller. 4. Abandon Synchronous transfer capability on SCSI adapter. 5. Remove cache. About a 30% system performance hit, but I will try it. 6. Disconnect floppies. Might have to leave one to keep BIOS from freaking, and this seems like a long shot since the floppy drives aren't being used. 7. Switch to ALL SCSI. Expensive at this point in the game, and nobody should have to abandon IDE since it works fine on many other operating systems. 8. By Pentium-class system from Rod. Ah, yes. Well that might happen once the Triton IIs are stable, but it can't happen today. Again, thanks to all who offered suggestions, even the insane ones! :-) No, I can't afford a dual-processor Alpha. Besides, it doesn't run FreeBSD today! Frank Durda IV <uhclem@nemesis.lonestar.org>|"Microsoft Support, can I help or uhclem%nemesis@rwsystr.nkn.net | you?" "Yeah, yuck, yuck, I'd | like to talk to Bill." "Just or ...letni!rwsys!nemesis!uhclem | a minute..."
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0u0LlJ-000DYmC>