From owner-freebsd-hackers  Sun Feb 19 05:33:42 1995
Return-Path: hackers-owner
Received: (from root@localhost) by freefall.cdrom.com (8.6.9/8.6.6) id FAA02192 for hackers-outgoing; Sun, 19 Feb 1995 05:33:42 -0800
Received: from augean.eleceng.adelaide.edu.au (daemon@augean.eleceng.adelaide.edu.au [129.127.28.4]) by freefall.cdrom.com (8.6.9/8.6.6) with SMTP id FAA02184 for <freebsd-hackers@freebsd.org>; Sun, 19 Feb 1995 05:33:38 -0800
Received: by augean (5.61+IDA+MU/4.8.36)
	id AA04058 for freebsd-hackers@freebsd.org; Mon, 20 Feb 1995 00:03:00 +1030
Message-Id: <9502191333.AA04058@augean.eleceng.adelaide.edu.au>
Received: by frenzy (4.1/COMMUNICA1.2-950118)
	id AA04370 for freebsd-hackers%freebsd.org@augean.oz; Sun, 19 Feb 95 22:13:44 CDT
From: mark@communica.oz.au (Mark Newton)
Subject: 2.0-950210-SNAP hangs
To: freebsd-hackers@FreeBSD.org
Date: Sun, 19 Feb 1995 22:13:44 +1030 (CST)
X-Mailer: ELM [version 2.4 PL21]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit
Content-Length: 3798      
Sender: hackers-owner@FreeBSD.org
Precedence: bulk

I submitted a bug report a few weeks ago about hangs on my 2.0-RELEASE 
system.  In an act of desperation, I've upgraded to 2.0-950210-SNAP, and
have been disappointed to see that it's still happening.

I've had my hardware fully checked-out.  Memory is ok, disk controllers
are ok, etc, etc -- I really doubt the hangs are hardware related.

First:  A configuration summary:

i486DX2/66 (Intel), 16Mb RAM, ISA+VLbus
Ethernet:  Generic NE2000
Serial I/O:  10 (2x16450 on multi-io card + 8x16550 on AST-style
        multiport card)
Parallel I/O:  lpt on multi-io card
Disks:
    wd0    Western Digital WDC AC2340H Caviar 340Mb
    wd1    Conner Peripherals CP30254 240Mb
    sd0    Maxtor XT4380S 320Mb (slow enough for news spool only :-)
    sd1    Seagate ST3550N 450Mb
    sd2    Quantum ELS170S 170Mb
Tape: st0  Archive Viper 150
SCSI: UltraStor 34F VLbus host adaptor

System is configured with about 100Mb of swap split over the two IDE spindles
and the two fastest SCSI spindles (wd0, wd1, sd1, sd2).  Dumps are meant to
occur on sd1, but I've never had one work yet (it hangs on "Dumping: 16"
when it tries).

The system is fairly well loaded:  It has five dialup modem lines running
with 38.4kbps DTEs, a 9600bps SLIP connection to the outside world.  It also 
acts as a secondary nameserver for the apana.org.au zone and its reverse
mappings (about 500-odd hosts), an inn-based NNTP server feeding full
newsfeeds to half a dozen downstream NNTP feeds and partial feeds to 30-odd
UUCP sites, it's the second-priority MX forwarder for about 100 hosts, a
medium-sized anonymous ftp archive, a WWW proxy caching server, and a POP
server.  It's also the file server for a Sun 3/60 running as an X11R6 X
terminal.  At any given time, it can have up to 20-or-so users logged in.

Summary:  It's a busy box.  I'm telling you all this because I suspect that
one of the causes of the problem described below is the load placed on the
system.

Now, the problem:  At periods ranging from every 6 hours through to every 
2 days, the system hangs for no apparent reason.  When it hangs, it goes
completely catatonic:  It doesn't respond to pings from other hosts on my
ethernet, the console doesn't work, all disk activity stops;  nothing can
get any response out of it.

Now, normally this wouldn't be a problem;  debugging things like this is
what kernel debuggers are for, right?  Well, no, not really -- When it 
hangs, it is obviously splx()'ed to a priority higher than the console,
'cos I can't jump to the debugger (or, indeed, get anything else on the 
console happening).  If it is splx()'ed to a value like that, that would
tend to suggest that the root cause is either something to do with the 
network or something to do with the disks (it could be anything with
a higher priority than the console, I know, but those two seem most
likely to me).

Since I can't escape to the debugger, I am only able to guess at the
cause of the problem.

Now, I said above that I suspected load, but again, that's only a guess.
To be totally truthful, it's a wild guess at that:  The load average on
the system rarely gets above 3, and spends most of its time at values 
less than 1.  systat shows that the CPU usually spends at leats 20% of its
time idle.

Before I switched to 2.0, I was getting good uptimes (under 1.1.5.1, I had
38 days before I had to shut it down to do some recabling).  Since I upgraded
to 2.0 a month ago, I haven't had an uptime greater than 3 days.

Has anyone else with a similar configuration and/or load had similar
problems with 2.0 (release or snapshot)?

Does anyone have any suggestions on how to debug a problem like this 
when there is no indication of where to start before it manifests itself
and no way to perform a post mortem after it has happened?


    - mark