From owner-freebsd-stable@FreeBSD.ORG  Wed May 13 17:44:56 2009
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5804E106564A;
	Wed, 13 May 2009 17:44:56 +0000 (UTC) (envelope-from scrappy@hub.org)
Received: from hub.org (hub.org [200.46.204.220])
	by mx1.freebsd.org (Postfix) with ESMTP id 08EA58FC0A;
	Wed, 13 May 2009 17:44:55 +0000 (UTC) (envelope-from scrappy@hub.org)
Received: from maia.hub.org (maia-4.hub.org [200.46.204.183])
	by hub.org (Postfix) with ESMTP id 8717653BC93;
	Wed, 13 May 2009 14:44:55 -0300 (ADT)
Received: from hub.org ([200.46.204.220])
	by maia.hub.org (mx1.hub.org [200.46.204.183]) (amavisd-maia,
	port 10024)
	with ESMTP id 98813-01; Wed, 13 May 2009 14:44:55 -0300 (ADT)
Received: by hub.org (Postfix, from userid 1002)
	id 3172053BC8B; Wed, 13 May 2009 14:44:55 -0300 (ADT)
Received: from localhost (localhost [127.0.0.1])
	by hub.org (Postfix) with ESMTP id 2DA9153BC7F;
	Wed, 13 May 2009 14:44:55 -0300 (ADT)
Date: Wed, 13 May 2009 14:44:55 -0300 (ADT)
From: "Marc G. Fournier" <scrappy@hub.org>
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200905131252.15171.jhb@freebsd.org>
Message-ID: <20090513142806.V17646@hub.org>
References: <20090513040719.D17646@hub.org>
	<200905131009.00403.jhb@freebsd.org>
	<20090513133143.M17646@hub.org> <200905131252.15171.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-stable@freebsd.org
Subject: Re: More data on 7.2-RELEASE "hangs"
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 May 2009 17:44:56 -0000

On Wed, 13 May 2009, John Baldwin wrote:

> Well, you had a whole lot of page faults and other VM activity, plus 500k
> syscalls.  The 'w' is a count of swapped processes, so basically your box is
> swapping a whole lot it seems.  I think your box is just overloaded.

I knew I was going to regret posting that :(

What I posted was what vmstat 5 shows after the issue *starts*, not what 
it normally looks like ... right now, after 10 hours of uptime, and all 
the same processes running, it looks like:

io# vmstat 5 (10 hours uptime now)
  procs      memory      page                    disks     faults         cpu
  r b w     avm    fre   flt  re  pi  po    fr  sr da0 pa0   in   sy   cs us sy id
  0 1 0  10477M   301M  3503  13   1   2  3620 286   0   0  331 45491 4566 26  8 66
  0 1 0  10430M   305M   278   7   0   0   550   0  18   0  186 19243 2917 4  3 93
  1 1 0  10474M   295M   511   0   0   0   359   0  91   0  253 11632 3516 7  3 90
  0 1 0  10447M   310M   819   3   0   0  1473   0  14   0  143 29575 2486 8  3 89
  0 1 0  10558M   295M  5008  18  13   5  4128   0 121   0  345 24212 4215 16  7 77

Right now, IO is running ~775 processes ... at the time of the vmstat I 
provided earlier, it was up to 1400 processes ... since there is only 5 
minutes between script runs, something is causing it to go from zero swap 
-> high swap within a very short period of time, but since things get 
badly locked up when it happens, I can't isolate where ...

I've got the following two ps outputs at the time of the high paging:

/bin/ps -aucxHl -O jid > ps-long.out
/bin/ps -aux -O jid > ps-short.out

Is there anything in there that I could look at as far as what is putting 
things over the edge?

====

As to the 'overloaded server', here is another server, with more running 
on it, but exact same configuration:

neptune# vmstat 5 (3 days, 18 hours uptime now)
  procs      memory      page                    disks     faults         cpu
  r b w     avm    fre   flt  re  pi  po    fr  sr da0 pa0   in   sy   cs us sy id
  0 0 0  12521M   303M  3969  15   5   3  2271 1603   0   0  444 6491 5165 37 19 44
  0 0 0  12464M   309M  3009   1   0  15  2833   0 104   0  296 9378 3689  7  5 88
23 0 0  12476M   297M  3845   3   0   0  2627   0  31   0  279 10545 2986 14  5 81
  0 1 0  12530M   266M  5259   0   1   0  2551   0 145   0  432 18070 4133 45  8 47
  1 0 0  12587M   237M  7049   0   1   0  4484   0 171   0  357 15953 4715 29  7 64

So, normally these servers purr ... and are highly responsive ...

In fact, here is an older 32bit server, less RAM, run about 50% more 
processes then neptune:

mercury# vmstat 5
  procs      memory      page                    disks     faults         cpu
  r b w     avm    fre  flt  re  pi  po  fr  sr da0 pa0   in   sy  cs us sy id
  3 14 1   6817M   114M  641   7   3   1 1036 386   0   0 1109  464 157  5  5 90
  0 8 0   6817M   224M  596  33   0   5 5667 3850  86   0 1303 5768 3885  6 7 87
  1 10 0   6824M   220M 4332  32   2   0 3228   0  17   0  755 9689 3057  8 7 85
  0 9 0   6798M   219M  430   0   0   0 712   0  12   0 1274 4276 3877  2  2 95
  0 11 0   6830M   205M 1026   4   1   3 481   0  84   0 1503 5586 4370  6 4 89


----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email . scrappy@hub.org                              MSN . scrappy@hub.org
Yahoo . yscrappy               Skype: hub.org        ICQ . 7615664