From owner-freebsd-stable@FreeBSD.ORG  Tue May  5 21:18:43 2009
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 890D81065672
	for <freebsd-stable@freebsd.org>; Tue,  5 May 2009 21:18:43 +0000 (UTC)
	(envelope-from scrappy@hub.org)
Received: from hub.org (hub.org [200.46.204.220])
	by mx1.freebsd.org (Postfix) with ESMTP id 2F2B68FC1C
	for <freebsd-stable@freebsd.org>; Tue,  5 May 2009 21:18:42 +0000 (UTC)
	(envelope-from scrappy@hub.org)
Received: from localhost (maia-1.hub.org [200.46.208.211])
	by hub.org (Postfix) with ESMTP id B362F53BC78
	for <freebsd-stable@freebsd.org>; Tue,  5 May 2009 17:59:32 -0300 (ADT)
Received: from hub.org ([200.46.204.220])
	by localhost (mx1.hub.org [200.46.208.211]) (amavisd-maia, port 10024)
	with ESMTP id 59520-03 for <freebsd-stable@freebsd.org>;
	Tue,  5 May 2009 17:59:27 -0300 (ADT)
Received: by hub.org (Postfix, from userid 1002)
	id 4918353BC68; Tue,  5 May 2009 17:59:32 -0300 (ADT)
Received: from localhost (localhost [127.0.0.1])
	by hub.org (Postfix) with ESMTP id 47D8153BC63
	for <freebsd-stable@freebsd.org>; Tue,  5 May 2009 17:59:32 -0300 (ADT)
Date: Tue, 5 May 2009 17:59:32 -0300 (ADT)
From: "Marc G. Fournier" <scrappy@hub.org>
To: freebsd-stable@freebsd.org
Message-ID: <20090505174426.M18967@hub.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Subject: server hangs, break to DDB hangs ...
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 May 2009 21:18:44 -0000


I have two HP Proliant servers that, until recently, have run very stable 
... within the past 2 months, the servers hang after anywhere from 10hrs 
through 19 days (one just hung up this aft) ...

vmstat, about the time it hangs, shows:

# cat 16/vmstat.out
  procs      memory      page                    disks     faults 
cpu
  r b w     avm    fre   flt  re  pi  po    fr  sr da0 pa0   in   sy   cs 
us sy id
109 156 1 17035752   62152   803  19   5   3  1907 1785   0   0  437  294 
853 50 28 22
  2 332 5 17109460   23056 147346 4319 2061 3139 44030 6539423 1029   0 
4027 398263 38616 40 58  2
  0 32 8 17110588   23052   626 4216  35 203   344 745 572   0  597 16414 
5741  4 10 86
  0 35 14 17110592   23084   446 5102   2 410   210 1596 540   0  516 31616 
4461  4 10 85
  0 25 20 17110588   23032   196 7734   2 280    22 1179 445   0  434 34992 
3543  5  7 88

with, by the time I was able to reboot it, the final vmstat was showing:

# cat 46/vmstat.out
  procs      memory      page                    disks     faults 
cpu
  r b w     avm    fre   flt  re  pi  po    fr  sr da0 pa0   in   sy   cs 
us sy id
  1 492 1595 24292424   99564   809  20   5   4  1909 1896   0   0  437 
737  863 50 28 22
  1 399 1596 24285028   90708  6195 152 393  76  3185 1061 414   0  683 
54948 32062  8  9 82
  2 231 1595 24276684   85164  4709  94 219 152  3729 642 554   0  420 
39442 20612  7 12 80
  1 174 1595 24259144   71288  8204 143 314 158  3379 1314 605   0  547 
36228 21219 11 18 71
  2 199 1593 24242500   72116  4637  52 251 195  3957 1609 496   0  383 
32305 20225  6 12 82

When I try and break to DDB, all I get on the screen is:

===
KDB: enter: Break sequence on conec
===

And then it hangs there ...

I have ps listings that go back for just over an hour before I rebooted 
(the script runs every 5 minutes, or is supposed to):

# ls -lt */ps*
-rw-r--r--  1 root  wheel  509908 May  5 16:47 46/ps.out
-rw-r--r--  1 root  wheel  450704 May  5 16:35 35/ps.out
-rw-r--r--  1 root  wheel  424047 May  5 16:32 26/ps.out
-rw-r--r--  1 root  wheel  329105 May  5 16:21 21/ps.out
-rw-r--r--  1 root  wheel  278189 May  5 16:17 16/ps.out
-rw-r--r--  1 root  wheel  246726 May  5 15:55 55/ps.out
-rw-r--r--  1 root  wheel  231937 May  5 15:50 50/ps.out
-rw-r--r--  1 root  wheel  240260 May  5 15:45 45/ps.out
-rw-r--r--  1 root  wheel  234731 May  5 15:40 40/ps.out
-rw-r--r--  1 root  wheel  233719 May  5 15:30 30/ps.out
-rw-r--r--  1 root  wheel  222749 May  5 15:25 25/ps.out
-rw-r--r--  1 root  wheel  231617 May  5 15:20 20/ps.out


Looking at swap usage over that period, its obvious that something is 
sucking back the RAM reasonably fast:

neptune# cat 46/swap.out
Device          512-blocks     Used    Avail Capacity
/dev/da0s1b       16777216 13789464  2987752    82%
neptune# cat 35/swap.out
Device          512-blocks     Used    Avail Capacity
/dev/da0s1b       16777216 12482312  4294904    74%
neptune# cat 26/swap.out
Device          512-blocks     Used    Avail Capacity
/dev/da0s1b       16777216 12351920  4425296    74%
neptune# cat 21/swap.out
Device          512-blocks     Used    Avail Capacity
/dev/da0s1b       16777216  7807240  8969976    47%
neptune# cat 16/swap.out
Device          512-blocks     Used    Avail Capacity
/dev/da0s1b       16777216  5752832 11024384    34%
neptune# cat 55/swap.out
Device          512-blocks     Used    Avail Capacity
/dev/da0s1b       16777216  4398928 12378288    26%

But I'm not sure what to look at in the ps output to determine what is 
going awry here ...

I'm running

      7.1-STABLE FreeBSD 7.1-STABLE #14: Sat Mar 28 00:05:19 ADT 2009

On the server that just hung, so will upgrade to the latest 7.2-RELEASE 
next, but ... if someone can give me pointers at what else I should be 
checking for, or something in the ps listings that I should be looking 
for?  My monitor script is currently doing:

/usr/sbin/jls > jaillist.out
/bin/ps -aucxHl -O jid > ps.out
/usr/sbin/pstat -s > swap.out
/usr/bin/vmstat 1 5 > vmstat.out
/usr/bin/awk '{print $15}' /proc/*/status | /usr/bin/sort | /usr/bin/uniq 
-c > vps_dist.out

Any pointers appreciated ...

Thx

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email . scrappy@hub.org                              MSN . scrappy@hub.org
Yahoo . yscrappy               Skype: hub.org        ICQ . 7615664