From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 16 18:11:06 2010 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3D53A1065698; Tue, 16 Feb 2010 18:11:06 +0000 (UTC) (envelope-from sobomax@FreeBSD.org) Received: from sippysoft.com (gk1.360sip.com [72.236.70.240]) by mx1.freebsd.org (Postfix) with ESMTP id B1C918FC13; Tue, 16 Feb 2010 18:11:05 +0000 (UTC) Received: from [192.168.1.38] ([70.71.167.197]) (authenticated bits=0) by sippysoft.com (8.14.3/8.14.3) with ESMTP id o1GIB2Zj022403 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 16 Feb 2010 10:11:02 -0800 (PST) (envelope-from sobomax@FreeBSD.org) Message-ID: <4B7ADFC6.7020202@FreeBSD.org> Date: Tue, 16 Feb 2010 10:11:18 -0800 From: Maxim Sobolev Organization: Sippy Software, Inc. User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: Sergey Babkin References: <4B79297D.9080403@FreeBSD.org> <4B79205B.619A0A1A@verizon.net> In-Reply-To: <4B79205B.619A0A1A@verizon.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Alfred Perlstein , freebsd-net@FreeBSD.org, "David G. Lawrence" , Jack Vogel , FreeBSD Hackers Subject: Re: Sudden mbuf demand increase and shortage under the load (igb issue?) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Feb 2010 18:11:06 -0000 OK, here is some new data that I think rules out any issues with the applications. Following Alfred's suggestion I have made a script to run every second and output some system statistics: date netstat -m vmstat -i ps -axl pstat -T vmstat -z sysctl -a The problem had hit us again today several times and upon investigating the log I found that increase in the mbuf usage happened in one step - going from normal 10% to 100% between two script runs. What is more interesting, is that time from two such subsequent runs were about 2 minutes apart (instead of 1 second as it should be) and when inspecting cron logs I noticed the same time gap in there. I ruled out any VM starvation as a cause of the delay because system has plenty of free memory. The incoming network traffic was not sufficient to starve VM so quickly either - it was about 7MB/sec at that time, so even if all receivers stopped draining their buffers it should have taken at least 1-2 seconds to fill up mbuf cache and create demand for an additional kernel memory. The failure would likely to be more gradual and I should have seen how it builds up in the debug log. So it looks like kernel issue of a sort, which causes all userland activity to cease for 2 minutes when the system reaches certain load. Mbuf build-up is only the by-product of this, not really a cause. igb(4) is being the primary suspect now, since we have other machines with more load not having this problem and we don't have anybody else using this driver. The chip is the following: igb0@pci0:5:0:0: class=0x020000 card=0x323f103c chip=0x10c98086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' class = network subclass = ethernet igb1@pci0:5:0:1: class=0x020000 card=0x323f103c chip=0x10c98086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' class = network subclass = ethernet Hardware in question is a new HP DL160G6. I have also checked IPMI logs and sensors and have not found any issue in there as well. No sensors reported off-range values and chassis temperature is within normal limits. I am not sure how to debug this problem further. We are now investigating opportunity to install external non-igb card to the server and see if it solves the issue. I have the whole log if anyone wants to take a closer peek. Regards, -- Maksym Sobolyev Sippy Software, Inc. Internet Telephony (VoIP) Experts T/F: +1-646-651-1110 Web: http://www.sippysoft.com MSN: sales@sippysoft.com Skype: SippySoft