From owner-freebsd-stable@FreeBSD.ORG  Mon Feb 15 12:47:21 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C8F551065670
	for <freebsd-stable@FreeBSD.org>; Mon, 15 Feb 2010 12:47:21 +0000 (UTC)
	(envelope-from sobomax@FreeBSD.org)
Received: from sippysoft.com (gk1.360sip.com [72.236.70.240])
	by mx1.freebsd.org (Postfix) with ESMTP id 889038FC14
	for <freebsd-stable@FreeBSD.org>; Mon, 15 Feb 2010 12:47:21 +0000 (UTC)
Received: from [192.168.1.38] ([70.71.167.197]) (authenticated bits=0)
	by sippysoft.com (8.14.3/8.14.3) with ESMTP id o1FCOjQL007986
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <freebsd-stable@FreeBSD.ORG>; Mon, 15 Feb 2010 04:24:46 -0800 (PST)
	(envelope-from sobomax@FreeBSD.org)
Message-ID: <4B793D1D.1000108@FreeBSD.org>
Date: Mon, 15 Feb 2010 04:25:01 -0800
From: Maxim Sobolev <sobomax@FreeBSD.org>
Organization: Sippy Software, Inc.
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: FreeBSD Hackers <freebsd-stable@FreeBSD.org>
Content-Type: text/plain; charset=windows-1251; format=flowed
Content-Transfer-Encoding: 7bit
Cc: 
Subject: Sudden mbuf demand increase and shortage under the load
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Feb 2010 12:47:22 -0000

Hi,

Our company have a FreeBSD based product that consists of the numerous
interconnected processes and it does some high-PPS UDP processing
(30-50K PPS is not uncommon). We are seeing some strange periodic
failures under the load in several such systems, which usually evidences
itself in IPC (even through unix domain sockets) suddenly either
breaking down or pausing and restoring only some time later (like 5-10
minutes). The only sign of failure I managed to find was the increase of
the "requests for mbufs denied" in the netstat -m and number of total
mbuf clusters (nmbclusters) raising up to the limit.

I have tried to raise some network-related limits (most notably maxusers
and nmbclusters), but it has not helped with the issue - it's still
happening from time to time to us. Below you can find output from the
netstat -m few minutes right after that shortage period - you see that
somehow the system has allocated huge amount of memory for the network
(700MB), with only tiny amount of that being actually in use. This is
for the kern.ipc.nmbclusters: 302400. Eventually the system reclaims all
that memory and goes back to its normal use of 30-70MB.

This problem is killing us, so any suggestions are greatly appreciated.
My current hypothesis is that due to some issues either with the network
driver or network subsystem itself, the system goes insane and "eats" up
all mbufs up to nmbclusters limit. But since mbufs are shared between
network and local IPC, IPC goes down as well.

We observe this issue with systems using both em(4) driver and igb(4)
driver. I believe both drivers share the same design, however I am not
sure if this is some kind of design flaw in the driver or part of a
larger problem with the network subsystem.

This happens on amd64 7.2-RELEASE and 7.3-PRERELEASE alike, with 8GB of
memory. I have not tried upgrading to 8.0, this is production system so
upgrading will not be easy.  I don't believe there are some differences
that let us hope that this problem will go away after upgrade, but I can
try it as the last resort.

As I said, this is very critical issue, so I can provide any additional
debug information upon request. We are ready to go as far as paying
somebody reasonable amount of money for tracking down and resolving the
issue.

Regards,
-- 
Maksym Sobolyev
Sippy Software, Inc.
Internet Telephony (VoIP) Experts
T/F: +1-646-651-1110
Web: http://www.sippysoft.com
MSN: sales@sippysoft.com
Skype: SippySoft


[ssp-root@ds-467 /usr/src]$ netstat -m
17061/417669/434730 mbufs in use (current/cache/total)
10420/291980/302400/302400 mbuf clusters in use (current/cache/total/max)
10420/0 mbuf+clusters out of packet secondary zone in use (current/cache)
19/1262/1281/51200 4k (page size) jumbo clusters in use
(current/cache/total/max)
0/0/0/25600 9k jumbo clusters in use (current/cache/total/max)
0/0/0/12800 16k jumbo clusters in use (current/cache/total/max)
25181K/693425K/718606K bytes allocated to network (current/cache/total)
1246681/129567494/67681640 requests for mbufs denied
(mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

[FEW MINUTES LATER]

[ssp-root@ds-467 /usr/src]$ netstat -m
10001/84574/94575 mbufs in use (current/cache/total)
6899/6931/13830/302400 mbuf clusters in use (current/cache/total/max)
6899/6267 mbuf+clusters out of packet secondary zone in use (current/cache)
2/1151/1153/51200 4k (page size) jumbo clusters in use
(current/cache/total/max)
0/0/0/25600 9k jumbo clusters in use (current/cache/total/max)
0/0/0/12800 16k jumbo clusters in use (current/cache/total/max)
16306K/39609K/55915K bytes allocated to network (current/cache/total)
1246681/129567494/67681640 requests for mbufs denied
(mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines