From owner-freebsd-current  Fri Mar 28 03:09:42 1997
Return-Path: <owner-current>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id DAA04503
          for current-outgoing; Fri, 28 Mar 1997 03:09:42 -0800 (PST)
Received: from oxmail4.ox.ac.uk (oxmail4.ox.ac.uk [163.1.2.33])
          by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id DAA04495
          for <freebsd-current@freebsd.org>; Fri, 28 Mar 1997 03:09:37 -0800 (PST)
Received: from njl2.materials.ox.ac.uk by oxmail4 with SMTP (PP);
          Fri, 28 Mar 1997 11:09:20 +0000
Received: by njl2.materials.ox.ac.uk (950413.SGI.8.6.12/940406.SGI)	 
          id LAA08977; Fri, 28 Mar 1997 11:09:17 GMT
Date: Fri, 28 Mar 1997 11:09:17 GMT
From: neil.long@materials.oxford.ac.uk (Neil J Long)
Message-Id: <9703281109.ZM8975@njl2.materials.ox.ac.uk>
In-Reply-To: Bill Fenner <fenner@parc.xerox.com>        "Re:  Possible routed problem 2.2" (Mar 22, 10:00am)
References: <97Mar22.100042pst.177486@crevenia.parc.xerox.com>
X-Mailer: Z-Mail-SGI (3.2S.3 08feb96 MediaMail)
To: Bill Fenner <fenner@parc.xerox.com>, freebsd-current@freebsd.org
Subject: 2.2.1 network and amanda dump problem (again)
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Repetition of previous problem which I thought was a routed bug.
System is at 2.2.1 cvsup'd and a world and kernel built as of March 26th.

I don't know how to debug this problem as remotely logging in changes the
network related allocations and when it happens I cannot log in any more. This
seems to be a problem during level 0 dumps where the quantity of data is higher
and so takes longer. I will run netstat -m in a loop and log what I can but
this looks like being obscure and probably limited to the amanda sendbackup
program (which runs suid root).

Original postings follow and then some more info after that.

On Mar 22, 10:00am, Bill Fenner wrote:
> Subject: Re:  Possible routed problem 2.2
> >The machine was not contactable via the net but was still running when I
> >went in to check. These are the errors logged in messages
> >
> >Mar 22 01:15:00 njl sendbackup[446]: error [dump returned 3, compress
> >got signal
> > 13]
> >Mar 22 01:15:22 njl routed[58]: punt RTM_LOSING without gateway
> >Mar 22 01:15:46 njl routed[58]: punt RTM_LOSING without gateway
>
> These mean that TCP to a machine on the local network was having
> trouble.  (Probably just another symptom of what you show below).  It's
> a bug in routed that it logs this fact at such a high level; the TCP
> trouble has nothing to do with routed.
>
> >ping: sendto: No buffer space available
>
> Looks like your machine ran out of mbuf's.  If you can recreate this, try
> "netstat -m" and see if any category has a much higher number than the
others.
>
>   Bill
>-- End of excerpt from Bill Fenner

This happened again last night - the PC was idle when the amanda dumps started
(/usr/ports/misc/amanda).

Logging in on the console netstat -m shows

144 mbufs in use
	86 mbufs allocated to data
	49 mbufs allocated to packet headers
	7 mbufs allocated to protocol control blocks
	2 mbufs allocated to socket names and addresses
10/26 mbuf clusters in use
70 Kbytes allocated to network (54% in use)
0 requests for memery denied
0 requests for memory delayed
0 calls to protocol drain routines

On the amanda server the dumps were recorded as

FAILURE AND STRANGE DUMP SUMMARY:
  njl        /usr lev 0 FAILED [data timeout]
  njl        /home lev 0 FAILED [data timeout]
  njl        /var lev 0 FAILED [could not connect to njl]
  njl        / lev 1 FAILED [could not connect to njl]
...

FAILED AND STRANGE DUMP DETAILS:

/-- njl        /usr lev 0 FAILED [data timeout]
sendbackup: start [njl:/usr level 0 datestamp 19970328]
|   DUMP: Date of this level 0 dump: Fri Mar 28 00:56:23 1997
|   DUMP: Date of last level 0 dump: the epoch
|   DUMP: Dumping /dev/rsd0s1e (/usr) to standard output
|   DUMP: mapping (Pass I) [regular files]
|   DUMP: mapping (Pass II) [directories]
|   DUMP: estimated 276523 tape blocks.
|   DUMP: dumping (Pass III) [directories]
|   DUMP: dumping (Pass IV) [regular files]
|   DUMP: 8.31% done, finished in 0:55
|   DUMP: 15.83% done, finished in 0:53
|   DUMP: 22.28% done, finished in 0:52
|   DUMP: 28.84% done, finished in 0:49
|   DUMP: 36.26% done, finished in 0:43
|   DUMP: 44.74% done, finished in 0:37
|   DUMP: 51.38% done, finished in 0:33
|   DUMP: 57.95% done, finished in 0:29
\--------

/-- njl        /home lev 0 FAILED [data timeout]
sendbackup: start [njl:/home level 0 datestamp 19970328]
|   DUMP: Date of this level 0 dump: Fri Mar 28 00:56:12 1997
|   DUMP: Date of last level 0 dump: the epoch
|   DUMP: Dumping /dev/rsd0s4e (/home) to standard output
|   DUMP: mapping (Pass I) [regular files]
|   DUMP: mapping (Pass II) [directories]
|   DUMP: estimated 272941 tape blocks.
|   DUMP: dumping (Pass III) [directories]
|   DUMP: dumping (Pass IV) [regular files]
|   DUMP: 5.83% done, finished in 1:20
|   DUMP: 9.84% done, finished in 1:31
|   DUMP: 17.51% done, finished in 1:10
|   DUMP: 26.87% done, finished in 0:54
|   DUMP: 34.17% done, finished in 0:48
|   DUMP: 43.31% done, finished in 0:39
|   DUMP: 52.10% done, finished in 0:32
|   DUMP: 60.08% done, finished in 0:26
\--------


The PC has a 3Com 3C589 combo - ep0

routed reported the same error before sendbackup timed out.


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*  Neil J Long, Department of Materials, University of Oxford
*               Parks Road, Oxford, OX1 3PH, UK
*  EMail:       Neil.Long@materials.oxford.ac.uk  
*  Tel:         +44 (0)1865-273678 Fax: +44 (0)1865-273789