From owner-freebsd-hackers  Fri Jan  5 09:35:12 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id JAA18691
          for hackers-outgoing; Fri, 5 Jan 1996 09:35:12 -0800 (PST)
Received: from brasil.moneng.mei.com (brasil.moneng.mei.com [151.186.109.160])
          by freefall.freebsd.org (8.7.3/8.7.3) with ESMTP id JAA18684
          for <hackers@freebsd.org>; Fri, 5 Jan 1996 09:35:08 -0800 (PST)
Received: (from jgreco@localhost) by brasil.moneng.mei.com (8.7.Beta.1/8.7.Beta.1) id LAA02253 for hackers@freebsd.org; Fri, 5 Jan 1996 11:34:37 -0600
From: Joe Greco <jgreco@brasil.moneng.mei.com>
Message-Id: <199601051734.LAA02253@brasil.moneng.mei.com>
Subject: Machine "disappears" off the net..?
To: hackers@freebsd.org
Date: Fri, 5 Jan 1996 11:34:37 -0600 (CST)
X-Mailer: ELM [version 2.4 PL24]
Content-Type: text
Sender: owner-hackers@freebsd.org
Precedence: bulk

I'm seeing an odd problem on news.sol.net (DX4/100 ASUS SP3G, 48MB, NCR810,
AHA-3940, SMC8216 incorrectly identified as SMC8416), that just started
happening recently, and has suddenly been very bad this morning.

At first I was fairly sure it was a hardware problem, but then I wasn't so
sure, as I increased the network load on the machine and these problems
started to appear, and then yesterday increased the network load again and
the problems got worse.  And there is something else strange (see below).

The machine appears to "drop off" the network for indefinite (1min-1hr)
periods of time.  Symptoms are consistent with a marginal Ethernet cable
at first glance:

hummin# netstat -I ed0		(*news.sol.net*)
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
ed0   1500  <Link>00.00.c0.1e.84.75      7080742     0  5956928  4503 64262

Note in particular the output error rate and relatively high collision count
(as compared to the router, below, which has been up 90++ days):

trantor# netstat -I ed4		(*router*)
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
ed4   1500  <Link>00.40.c7.20.d5.c1     71246705  2522 86820038     0 71608

It looks somewhat odd to me, because I usually see both ierrs and oerrs on
segments with bad cables, yet I only seem to be seeing it in "one"
direction.  Could be a bad card, probably the tx on hummin...

But I then noticed that trantor continues to receive rwho broadcasts from
hummin during these periods of deadness, even though I am not able to ping
hummin from either trantor or another host on that wire.  And hummin isn't
showing any signs of input errors.

Syslog messages:  none.

Level of head scratching:  severe.

I have not had the good fortune to catch this happening while I am down at
the office, so I can't say what hummin is/isn't seeing during these
periods.

My most recent thinking is that there may be some sort of resource shortage,
so I looked at netstat -m, but I don't see anything obviously wrong:

hummin# netstat -m
184 mbufs in use:
        84 mbufs allocated to data
        26 mbufs allocated to packet headers
        68 mbufs allocated to protocol control blocks
        6 mbufs allocated to socket names and addresses
72/232 mbuf clusters in use
487 Kbytes allocated to network (34% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

Does anyone have any ideas, theories, or suggestions?  Obviously the
networking hardware is already slated for replacement, but I'm not 
confident that that's the problem here.

... Joe

-------------------------------------------------------------------------------
Joe Greco - Systems Administrator			      jgreco@ns.sol.net
Solaria Public Access UNIX - Milwaukee, WI			   414/342-4847