From owner-freebsd-hackers Fri Jan 5 09:35:12 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id JAA18691 for hackers-outgoing; Fri, 5 Jan 1996 09:35:12 -0800 (PST) Received: from brasil.moneng.mei.com (brasil.moneng.mei.com [151.186.109.160]) by freefall.freebsd.org (8.7.3/8.7.3) with ESMTP id JAA18684 for ; Fri, 5 Jan 1996 09:35:08 -0800 (PST) Received: (from jgreco@localhost) by brasil.moneng.mei.com (8.7.Beta.1/8.7.Beta.1) id LAA02253 for hackers@freebsd.org; Fri, 5 Jan 1996 11:34:37 -0600 From: Joe Greco Message-Id: <199601051734.LAA02253@brasil.moneng.mei.com> Subject: Machine "disappears" off the net..? To: hackers@freebsd.org Date: Fri, 5 Jan 1996 11:34:37 -0600 (CST) X-Mailer: ELM [version 2.4 PL24] Content-Type: text Sender: owner-hackers@freebsd.org Precedence: bulk I'm seeing an odd problem on news.sol.net (DX4/100 ASUS SP3G, 48MB, NCR810, AHA-3940, SMC8216 incorrectly identified as SMC8416), that just started happening recently, and has suddenly been very bad this morning. At first I was fairly sure it was a hardware problem, but then I wasn't so sure, as I increased the network load on the machine and these problems started to appear, and then yesterday increased the network load again and the problems got worse. And there is something else strange (see below). The machine appears to "drop off" the network for indefinite (1min-1hr) periods of time. Symptoms are consistent with a marginal Ethernet cable at first glance: hummin# netstat -I ed0 (*news.sol.net*) Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ed0 1500 00.00.c0.1e.84.75 7080742 0 5956928 4503 64262 Note in particular the output error rate and relatively high collision count (as compared to the router, below, which has been up 90++ days): trantor# netstat -I ed4 (*router*) Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll ed4 1500 00.40.c7.20.d5.c1 71246705 2522 86820038 0 71608 It looks somewhat odd to me, because I usually see both ierrs and oerrs on segments with bad cables, yet I only seem to be seeing it in "one" direction. Could be a bad card, probably the tx on hummin... But I then noticed that trantor continues to receive rwho broadcasts from hummin during these periods of deadness, even though I am not able to ping hummin from either trantor or another host on that wire. And hummin isn't showing any signs of input errors. Syslog messages: none. Level of head scratching: severe. I have not had the good fortune to catch this happening while I am down at the office, so I can't say what hummin is/isn't seeing during these periods. My most recent thinking is that there may be some sort of resource shortage, so I looked at netstat -m, but I don't see anything obviously wrong: hummin# netstat -m 184 mbufs in use: 84 mbufs allocated to data 26 mbufs allocated to packet headers 68 mbufs allocated to protocol control blocks 6 mbufs allocated to socket names and addresses 72/232 mbuf clusters in use 487 Kbytes allocated to network (34% in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines Does anyone have any ideas, theories, or suggestions? Obviously the networking hardware is already slated for replacement, but I'm not confident that that's the problem here. ... Joe ------------------------------------------------------------------------------- Joe Greco - Systems Administrator jgreco@ns.sol.net Solaria Public Access UNIX - Milwaukee, WI 414/342-4847