From owner-freebsd-stable@FreeBSD.ORG Tue May 5 21:18:43 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 890D81065672 for ; Tue, 5 May 2009 21:18:43 +0000 (UTC) (envelope-from scrappy@hub.org) Received: from hub.org (hub.org [200.46.204.220]) by mx1.freebsd.org (Postfix) with ESMTP id 2F2B68FC1C for ; Tue, 5 May 2009 21:18:42 +0000 (UTC) (envelope-from scrappy@hub.org) Received: from localhost (maia-1.hub.org [200.46.208.211]) by hub.org (Postfix) with ESMTP id B362F53BC78 for ; Tue, 5 May 2009 17:59:32 -0300 (ADT) Received: from hub.org ([200.46.204.220]) by localhost (mx1.hub.org [200.46.208.211]) (amavisd-maia, port 10024) with ESMTP id 59520-03 for ; Tue, 5 May 2009 17:59:27 -0300 (ADT) Received: by hub.org (Postfix, from userid 1002) id 4918353BC68; Tue, 5 May 2009 17:59:32 -0300 (ADT) Received: from localhost (localhost [127.0.0.1]) by hub.org (Postfix) with ESMTP id 47D8153BC63 for ; Tue, 5 May 2009 17:59:32 -0300 (ADT) Date: Tue, 5 May 2009 17:59:32 -0300 (ADT) From: "Marc G. Fournier" To: freebsd-stable@freebsd.org Message-ID: <20090505174426.M18967@hub.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: server hangs, break to DDB hangs ... X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 May 2009 21:18:44 -0000 I have two HP Proliant servers that, until recently, have run very stable ... within the past 2 months, the servers hang after anywhere from 10hrs through 19 days (one just hung up this aft) ... vmstat, about the time it hangs, shows: # cat 16/vmstat.out procs memory page disks faults cpu r b w avm fre flt re pi po fr sr da0 pa0 in sy cs us sy id 109 156 1 17035752 62152 803 19 5 3 1907 1785 0 0 437 294 853 50 28 22 2 332 5 17109460 23056 147346 4319 2061 3139 44030 6539423 1029 0 4027 398263 38616 40 58 2 0 32 8 17110588 23052 626 4216 35 203 344 745 572 0 597 16414 5741 4 10 86 0 35 14 17110592 23084 446 5102 2 410 210 1596 540 0 516 31616 4461 4 10 85 0 25 20 17110588 23032 196 7734 2 280 22 1179 445 0 434 34992 3543 5 7 88 with, by the time I was able to reboot it, the final vmstat was showing: # cat 46/vmstat.out procs memory page disks faults cpu r b w avm fre flt re pi po fr sr da0 pa0 in sy cs us sy id 1 492 1595 24292424 99564 809 20 5 4 1909 1896 0 0 437 737 863 50 28 22 1 399 1596 24285028 90708 6195 152 393 76 3185 1061 414 0 683 54948 32062 8 9 82 2 231 1595 24276684 85164 4709 94 219 152 3729 642 554 0 420 39442 20612 7 12 80 1 174 1595 24259144 71288 8204 143 314 158 3379 1314 605 0 547 36228 21219 11 18 71 2 199 1593 24242500 72116 4637 52 251 195 3957 1609 496 0 383 32305 20225 6 12 82 When I try and break to DDB, all I get on the screen is: === KDB: enter: Break sequence on conec === And then it hangs there ... I have ps listings that go back for just over an hour before I rebooted (the script runs every 5 minutes, or is supposed to): # ls -lt */ps* -rw-r--r-- 1 root wheel 509908 May 5 16:47 46/ps.out -rw-r--r-- 1 root wheel 450704 May 5 16:35 35/ps.out -rw-r--r-- 1 root wheel 424047 May 5 16:32 26/ps.out -rw-r--r-- 1 root wheel 329105 May 5 16:21 21/ps.out -rw-r--r-- 1 root wheel 278189 May 5 16:17 16/ps.out -rw-r--r-- 1 root wheel 246726 May 5 15:55 55/ps.out -rw-r--r-- 1 root wheel 231937 May 5 15:50 50/ps.out -rw-r--r-- 1 root wheel 240260 May 5 15:45 45/ps.out -rw-r--r-- 1 root wheel 234731 May 5 15:40 40/ps.out -rw-r--r-- 1 root wheel 233719 May 5 15:30 30/ps.out -rw-r--r-- 1 root wheel 222749 May 5 15:25 25/ps.out -rw-r--r-- 1 root wheel 231617 May 5 15:20 20/ps.out Looking at swap usage over that period, its obvious that something is sucking back the RAM reasonably fast: neptune# cat 46/swap.out Device 512-blocks Used Avail Capacity /dev/da0s1b 16777216 13789464 2987752 82% neptune# cat 35/swap.out Device 512-blocks Used Avail Capacity /dev/da0s1b 16777216 12482312 4294904 74% neptune# cat 26/swap.out Device 512-blocks Used Avail Capacity /dev/da0s1b 16777216 12351920 4425296 74% neptune# cat 21/swap.out Device 512-blocks Used Avail Capacity /dev/da0s1b 16777216 7807240 8969976 47% neptune# cat 16/swap.out Device 512-blocks Used Avail Capacity /dev/da0s1b 16777216 5752832 11024384 34% neptune# cat 55/swap.out Device 512-blocks Used Avail Capacity /dev/da0s1b 16777216 4398928 12378288 26% But I'm not sure what to look at in the ps output to determine what is going awry here ... I'm running 7.1-STABLE FreeBSD 7.1-STABLE #14: Sat Mar 28 00:05:19 ADT 2009 On the server that just hung, so will upgrade to the latest 7.2-RELEASE next, but ... if someone can give me pointers at what else I should be checking for, or something in the ps listings that I should be looking for? My monitor script is currently doing: /usr/sbin/jls > jaillist.out /bin/ps -aucxHl -O jid > ps.out /usr/sbin/pstat -s > swap.out /usr/bin/vmstat 1 5 > vmstat.out /usr/bin/awk '{print $15}' /proc/*/status | /usr/bin/sort | /usr/bin/uniq -c > vps_dist.out Any pointers appreciated ... Thx ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664