From owner-freebsd-hackers Sun Feb 19 05:33:42 1995 Return-Path: hackers-owner Received: (from root@localhost) by freefall.cdrom.com (8.6.9/8.6.6) id FAA02192 for hackers-outgoing; Sun, 19 Feb 1995 05:33:42 -0800 Received: from augean.eleceng.adelaide.edu.au (daemon@augean.eleceng.adelaide.edu.au [129.127.28.4]) by freefall.cdrom.com (8.6.9/8.6.6) with SMTP id FAA02184 for ; Sun, 19 Feb 1995 05:33:38 -0800 Received: by augean (5.61+IDA+MU/4.8.36) id AA04058 for freebsd-hackers@freebsd.org; Mon, 20 Feb 1995 00:03:00 +1030 Message-Id: <9502191333.AA04058@augean.eleceng.adelaide.edu.au> Received: by frenzy (4.1/COMMUNICA1.2-950118) id AA04370 for freebsd-hackers%freebsd.org@augean.oz; Sun, 19 Feb 95 22:13:44 CDT From: mark@communica.oz.au (Mark Newton) Subject: 2.0-950210-SNAP hangs To: freebsd-hackers@FreeBSD.org Date: Sun, 19 Feb 1995 22:13:44 +1030 (CST) X-Mailer: ELM [version 2.4 PL21] Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Content-Length: 3798 Sender: hackers-owner@FreeBSD.org Precedence: bulk I submitted a bug report a few weeks ago about hangs on my 2.0-RELEASE system. In an act of desperation, I've upgraded to 2.0-950210-SNAP, and have been disappointed to see that it's still happening. I've had my hardware fully checked-out. Memory is ok, disk controllers are ok, etc, etc -- I really doubt the hangs are hardware related. First: A configuration summary: i486DX2/66 (Intel), 16Mb RAM, ISA+VLbus Ethernet: Generic NE2000 Serial I/O: 10 (2x16450 on multi-io card + 8x16550 on AST-style multiport card) Parallel I/O: lpt on multi-io card Disks: wd0 Western Digital WDC AC2340H Caviar 340Mb wd1 Conner Peripherals CP30254 240Mb sd0 Maxtor XT4380S 320Mb (slow enough for news spool only :-) sd1 Seagate ST3550N 450Mb sd2 Quantum ELS170S 170Mb Tape: st0 Archive Viper 150 SCSI: UltraStor 34F VLbus host adaptor System is configured with about 100Mb of swap split over the two IDE spindles and the two fastest SCSI spindles (wd0, wd1, sd1, sd2). Dumps are meant to occur on sd1, but I've never had one work yet (it hangs on "Dumping: 16" when it tries). The system is fairly well loaded: It has five dialup modem lines running with 38.4kbps DTEs, a 9600bps SLIP connection to the outside world. It also acts as a secondary nameserver for the apana.org.au zone and its reverse mappings (about 500-odd hosts), an inn-based NNTP server feeding full newsfeeds to half a dozen downstream NNTP feeds and partial feeds to 30-odd UUCP sites, it's the second-priority MX forwarder for about 100 hosts, a medium-sized anonymous ftp archive, a WWW proxy caching server, and a POP server. It's also the file server for a Sun 3/60 running as an X11R6 X terminal. At any given time, it can have up to 20-or-so users logged in. Summary: It's a busy box. I'm telling you all this because I suspect that one of the causes of the problem described below is the load placed on the system. Now, the problem: At periods ranging from every 6 hours through to every 2 days, the system hangs for no apparent reason. When it hangs, it goes completely catatonic: It doesn't respond to pings from other hosts on my ethernet, the console doesn't work, all disk activity stops; nothing can get any response out of it. Now, normally this wouldn't be a problem; debugging things like this is what kernel debuggers are for, right? Well, no, not really -- When it hangs, it is obviously splx()'ed to a priority higher than the console, 'cos I can't jump to the debugger (or, indeed, get anything else on the console happening). If it is splx()'ed to a value like that, that would tend to suggest that the root cause is either something to do with the network or something to do with the disks (it could be anything with a higher priority than the console, I know, but those two seem most likely to me). Since I can't escape to the debugger, I am only able to guess at the cause of the problem. Now, I said above that I suspected load, but again, that's only a guess. To be totally truthful, it's a wild guess at that: The load average on the system rarely gets above 3, and spends most of its time at values less than 1. systat shows that the CPU usually spends at leats 20% of its time idle. Before I switched to 2.0, I was getting good uptimes (under 1.1.5.1, I had 38 days before I had to shut it down to do some recabling). Since I upgraded to 2.0 a month ago, I haven't had an uptime greater than 3 days. Has anyone else with a similar configuration and/or load had similar problems with 2.0 (release or snapshot)? Does anyone have any suggestions on how to debug a problem like this when there is no indication of where to start before it manifests itself and no way to perform a post mortem after it has happened? - mark