Date: Thu, 21 Mar 96 21:24 WET From: uhclem@nemesis.lonestar.org (Frank Durda IV) To: hackers@freebsd.org.current@freebsd.org Cc: uhclem@nemesis.lonestar.org Subject: Crash advice needed (long) Message-ID: <m0tzxTA-000CqnC@nemesis.lonestar.org>
next in thread | raw e-mail | index | archive | help
Ok, I have run out of ideas. I welcome suggestions by private EMAIL (no need to clog the mailing list) on how to proceed next. I have a 486DX-33 system running FreeBSD 2.1.0. It crashes, usually about five times a week. Some weeks more frequently, some less. Max and min crashes for a week are around 20 and 3. The crashes occur at random, although around 7AM, 1PM and 1AM seem to have more crashes near those times. (It isn't a power problem, see below.) I had crashes when the system ran 1.1.5.1, but they were more like four a month. There was a definite increase when 2.1.0 came on the scene, and there were no changes in the hardware at that time. Anyway, I was postive that 2.1 was interacting with flaky hardware and that was the true cause, rather than an actual problem in 2.1.0. Now I am not so sure. Read on. Here is the precise hardware configuration: 1. System DEC 1027 486-33DX with 128K 485-Turbocache module EISA card cage, but no EISA cards present. (It will run with DX2/DX4, but I have removed these to simplify the test.) 2. Memory card with 12 Meg Parity RAM (Parity required on system). 3. Primary hard disk Western Digital 2540 540Meg IDE small DOS partition, rest of drive used for / /usr /usr/src swap. System is old and only has IDE, not EIDE. BIOS doesn't understand "big drives", so DOS partition is first, followed by root, swap, usr /usr/src. 4. Adaptec 1540B, latest BIOS/MCODE (upgraded yesterday to BD00 BIOS, 3054 MCODE) adapter is not terminated. (Was running MCODE F3F7 and BIOS BC00.) SCSI Controller #0 is Seagate 2GB Baracuda (terminated) SCSI Controller #2 is external Archive 2150eS tape drive (also terminated). 5. Sio 0 16550A (StarTech 16550[A]) IRQ4. Sio 1 16550A (StarTech 16550[A]) IRQ3. Sio 2 16550A (NS16552) IRQ 11. Sio 3 16550A (NS16552) IRQ 10. Sio0/Sio1 are on a STB dual-serial adapter with socketed UARTS. Sio2/Sio3 are the MLB ports. 6. One parallel port but no printer attached. 7. NEC N82077A Floppy Disk controller (mis-identified by probe as a NEC 72065) One 2.8Meg floppy (BIOS set to 1.44 Meg because 2.8 setting (type 6) confuses FreeBSD install floppy. Second 1.44 Meg floppy also set to 1.44Meg in BIOS. 8. Video is a WC90C31-based video card (copy of WD reference design) 9. A WD8013E netword card was removed during the tests. 10. System and peripherals powered via APC 1400 SmartUPS. 11. Two Telebit WorldBlazers and a Cardinal V.34 external modem, all connected to SmartUPS. 12. We are in a drought here so there have been no electrical storms of any kind and all telephone lines are underground all the way to the CO. The system activities consist of news (Cnews), UUCP (2.1.0 release), and mail forwarding/delivery (SMAIL). X is not run on this system and I am the only login user, apart from uucico. There are no CD-ROM drives present. The crashes manifest themselves in three ways: 1. System just reboots - screen clears, BIOS messages seen, system restarts. No panic or other messages. If you are in the room and the monitor is on, you hear the relays in the monitor go click-click and you look up and see the BIOS messages. Has happened while using the console - was typing a letter and the screen simply cleared and the system rebooted. 2. System hangs. Does not respond to keystrokes or network activity (when network card is present). IDE activity light is almost always (but not 100% of the time) on SOLID when lock-up is discovered. DTR is left high on the modems. 3. System panics. Almost always a Page Not Present error. The addresses vary, but a frequently occurring location is in and around 0xf0195d3c. According to nm /kernel, that is: f0195c90 T _Xintr14 f0195cd8 t Xresume14 f0195d30 T _Xintr15 somewhere in here... f0195d78 t Xresume15 f0195dd0 t _doreti f0195dd4 t doreti_next On this system, IRQ 15 is used by the SCSI adapter. IRQ 14 is used by the IDE: (taken from the most recent reboot) Mar 21 20:03:19 nemesis /kernel: FreeBSD 2.1.0-RELEASE #0: Sun Nov 26 22:35:06 CST 1995 Mar 21 20:03:19 nemesis /kernel: root@nemesis.lonestar.org:/usr/src/sys/compile/NEMESIS Mar 21 20:03:19 nemesis /kernel: CPU: i486DX (486-class CPU) Mar 21 20:03:19 nemesis /kernel: real memory = 12910592 (12608K bytes) Mar 21 20:03:19 nemesis /kernel: avail memory = 11132928 (10872K bytes) Mar 21 20:03:19 nemesis /kernel: Probing for devices on the ISA bus: Mar 21 20:03:19 nemesis /kernel: sc0 at 0x60-0x6f irq 1 on motherboard Mar 21 20:03:19 nemesis /kernel: sc0: VGA color <16 virtual consoles, flags=0x0> Mar 21 20:03:19 nemesis /kernel: ed0 not found at 0x280 Mar 21 20:03:19 nemesis /kernel: sio0 at 0x3f8-0x3ff irq 4 on isa Mar 21 20:03:20 nemesis /kernel: sio0: type 16550A Mar 21 20:03:20 nemesis /kernel: sio1 at 0x2f8-0x2ff irq 3 on isa Mar 21 20:03:20 nemesis /kernel: sio1: type 16550A Mar 21 20:03:20 nemesis /kernel: sio2 at 0x3e8-0x3ef irq 10 on isa Mar 21 20:03:20 nemesis /kernel: sio2: type 16550A Mar 21 20:03:20 nemesis /kernel: sio3 at 0x2e8-0x2ef irq 11 on isa Mar 21 20:03:20 nemesis /kernel: sio3: type 16550A Mar 21 20:03:20 nemesis /kernel: lpt0 at 0x3bc-0x3c3 irq 7 on isa Mar 21 20:03:20 nemesis /kernel: lpt0: Interrupt-driven port Mar 21 20:03:20 nemesis /kernel: lp0: TCP/IP capable interface Mar 21 20:03:20 nemesis /kernel: lpt1 not found at 0xffffffff Mar 21 20:03:20 nemesis /kernel: psm0 not found at 0x60 Mar 21 20:03:20 nemesis /kernel: fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa Mar 21 20:03:20 nemesis /kernel: fdc0: NEC 72065B Mar 21 20:03:20 nemesis /kernel: fd0: 1.44MB 3.5in Mar 21 20:03:20 nemesis /kernel: fd1: 1.44MB 3.5in Mar 21 20:03:20 nemesis /kernel: wdc0 at 0x1f0-0x1f7 irq 14 on isa Mar 21 20:03:20 nemesis /kernel: wdc0: unit 0 (wd0): <WDC AC2540H> Mar 21 20:03:20 nemesis /kernel: wd0: 515MB (1056384 sectors), 1048 cyls, 16 heads, 63 S/T, 512 B/S Mar 21 20:03:20 nemesis /kernel: aha0 at 0x330-0x333 irq 15 drq 5 on isa Mar 21 20:03:20 nemesis /kernel: aha0 waiting for scsi devices to settle Mar 21 20:03:20 nemesis /kernel: (aha0:0:0): "SEAGATE ST12550N 0013" type 0 fixed SCSI 2 Mar 21 20:03:20 nemesis /kernel: sd0(aha0:0:0): Direct-Access 2040MB (4178874 512 byte sectors) Mar 21 20:03:20 nemesis /kernel: (aha0:2:0): "ARCHIVE VIPER 150 21247 -005" type 1 removable SCSI 1 Mar 21 20:03:21 nemesis /kernel: st0(aha0:2:0): Sequential-Access st0: Archive Viper 150 is a known rogue Mar 21 20:03:21 nemesis /kernel: density code 0x0, drive empty Mar 21 20:03:21 nemesis /kernel: matcdc0 not found at 0x230 Mar 21 20:03:21 nemesis /kernel: scd0 not found at 0x230 Mar 21 20:03:21 nemesis /kernel: npx0 on motherboard Mar 21 20:03:21 nemesis /kernel: npx0: INT 16 interface Mar 21 20:03:21 nemesis /kernel: WARNING: / was not properly dismounted. When the system panics, it usually displays the panic, then it may hang and not reboot by itself. Todays panic was TRAP 12, virtual addr 0xff1c IP 8:0xf0195d2c Idle <----That state is very common in the crashes I see Page Fault and it hung for ten hours. Sometimes after the panic it will say it is syncing disks, and start displaying a long string of number (a full line of numbers on the screen of them, sometimes multiple lines) and then hang. Sometimes I will find it in this state and the SCSI drive will be seeking back and forth endlessly. If you unplug the SCSI cable from the drive, the seeking will stop, so it isn't the drive doing it on its own. The crashes where the system does not restart itself are the real killers, such as today when the system was down for ten hours before I could get to it to press RESET. Also note that the kernel is essentially a stock 2.1.0 kernel (built in November) that has had unneeded drivers removed. No other code changes/fixes have been added. (If someone thinks the stock driver should be used as something to try, I'll try it.) What I have tried: About two weeks ago, each time I discovered the system had crashed, I replaced a piece of hardware on the system, hoping to discover the hardware flaw. I have three identical computer systems (except peripherals) so I was able to do this. When each substitution was made, all previous substitutions were left in place. Here are the substitutions that were done: 1. Remove network card. No improvement. 2. Replace Chassis, backplane & power supply (this also replaces floppy, IDE, two of the four serial ports and IDE cable). (CPU/Cache/Memory is not on the backplane in this computer and were not changed.) No improvement. 3. Replace CPU card with cache module and processor chip. No improvement. 4. Replace memory card and switch to different card with completely different 12Meg of RAM SIMMs. No improvement. 5. Replace video card. No improvement. 6. Replace 1540B SCSI adapter. No improvement. 7. Upgrade BIOS/MCODE on 1540B SCSI adapter. No improvement. 8. Replace internal SCSI cable. No improvement. 9. Lower room temperature 5F to 64F. No improvement and I am getting cold. Now I have a pile of cards, a gutted chassis and the system is crashing about as often as I was before I started exchanging parts. At this point, only the hard drives, tape drive, floppy drives and modems in this system have not been changed and the crashes persist. Arggh! I know that because I have an IDE drive in the system some will immediately blame it as the cause, and that doesn't surprise me, particularly with the IDE activity light being left on, but it doesn't explain the panic in the IRQ 15 code, or the SCSI drive doing the waltz after a panic, etc. Do we have some bugs in the IDE hard disk driver that are new to 2.1.0? So, if you can think of anything else to disconnect, change or do without, let me know. But I have just about eliminated hardware as a cause. Thanks for the review. Frank Durda IV <uhclem@nemesis.lonestar.org>|"The Knights who say "LETNi" or uhclem%nemesis@rwsystr.nkn.net | demand... A SEGMENT REGISTER!!!" |"A what?" or ...letni!rwsys!nemesis!uhclem |"LETNi! LETNi! LETNi!" - 1983
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0tzxTA-000CqnC>