From owner-freebsd-hackers@FreeBSD.ORG  Tue Feb 16 18:11:06 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3D53A1065698;
	Tue, 16 Feb 2010 18:11:06 +0000 (UTC)
	(envelope-from sobomax@FreeBSD.org)
Received: from sippysoft.com (gk1.360sip.com [72.236.70.240])
	by mx1.freebsd.org (Postfix) with ESMTP id B1C918FC13;
	Tue, 16 Feb 2010 18:11:05 +0000 (UTC)
Received: from [192.168.1.38] ([70.71.167.197]) (authenticated bits=0)
	by sippysoft.com (8.14.3/8.14.3) with ESMTP id o1GIB2Zj022403
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 16 Feb 2010 10:11:02 -0800 (PST)
	(envelope-from sobomax@FreeBSD.org)
Message-ID: <4B7ADFC6.7020202@FreeBSD.org>
Date: Tue, 16 Feb 2010 10:11:18 -0800
From: Maxim Sobolev <sobomax@FreeBSD.org>
Organization: Sippy Software, Inc.
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: Sergey Babkin <babkin@verizon.net>
References: <4B79297D.9080403@FreeBSD.org> <4B79205B.619A0A1A@verizon.net>
In-Reply-To: <4B79205B.619A0A1A@verizon.net>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Alfred Perlstein <alfred@FreeBSD.org>, freebsd-net@FreeBSD.org,
	"David G. Lawrence" <dg@dglawrence.com>, Jack Vogel <jfvogel@gmail.com>,
	FreeBSD Hackers <freebsd-hackers@FreeBSD.org>
Subject: Re: Sudden mbuf demand increase and shortage under the load (igb
 issue?)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Feb 2010 18:11:06 -0000

OK, here is some new data that I think rules out any issues with the 
applications. Following Alfred's suggestion I have made a script to run 
every second and output some system statistics:

date
netstat -m
vmstat -i
ps -axl
pstat -T
vmstat -z
sysctl -a

The problem had hit us again today several times and upon investigating 
the log I found that increase in the mbuf usage happened in one step - 
going from normal 10% to 100% between two script runs. What is more 
interesting, is that time from two such subsequent runs were about 2 
minutes apart (instead of 1 second as it should be) and when inspecting 
cron logs I noticed the same time gap in there. I ruled out any VM 
starvation as a cause of the delay because system has plenty of free 
memory. The incoming network traffic was not sufficient to starve VM so 
quickly either - it was about 7MB/sec at that time, so even if all 
receivers stopped draining their buffers it should have taken at least 
1-2 seconds to fill up mbuf cache and create demand for an additional 
kernel memory. The failure would likely to be more gradual and I should 
have seen how it builds up in the debug log.

So it looks like kernel issue of a sort, which causes all userland 
activity to cease for 2 minutes when the system reaches certain load. 
Mbuf build-up is only the by-product of this, not really a cause. igb(4) 
is being the primary suspect now, since we have other machines with more 
load not having this problem and we don't have anybody else using this 
driver.  The chip is the following:

igb0@pci0:5:0:0:        class=0x020000 card=0x323f103c chip=0x10c98086 
rev=0x01 hdr=0x00
     vendor     = 'Intel Corporation'
     class      = network
     subclass   = ethernet
igb1@pci0:5:0:1:        class=0x020000 card=0x323f103c chip=0x10c98086 
rev=0x01 hdr=0x00
     vendor     = 'Intel Corporation'
     class      = network
     subclass   = ethernet

Hardware in question is a new HP DL160G6. I have also checked IPMI logs 
and sensors and have not found any issue in there as well. No sensors 
reported off-range values and chassis temperature is within normal limits.

I am not sure how to debug this problem further. We are now 
investigating opportunity to install external non-igb card to the server 
and see if it solves the issue.

I have the whole log if anyone wants to take a closer peek.

Regards,
-- 
Maksym Sobolyev
Sippy Software, Inc.
Internet Telephony (VoIP) Experts
T/F: +1-646-651-1110
Web: http://www.sippysoft.com
MSN: sales@sippysoft.com
Skype: SippySoft