From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 16 06:19:22 2010 Return-Path: Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D18E21065676; Tue, 16 Feb 2010 06:19:22 +0000 (UTC) (envelope-from sobomax@FreeBSD.org) Received: from sippysoft.com (gk1.360sip.com [72.236.70.240]) by mx1.freebsd.org (Postfix) with ESMTP id 23B1E8FC1C; Tue, 16 Feb 2010 06:19:22 +0000 (UTC) Received: from [192.168.1.38] ([70.71.167.197]) (authenticated bits=0) by sippysoft.com (8.14.3/8.14.3) with ESMTP id o1G6JHTL017344 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 15 Feb 2010 22:19:18 -0800 (PST) (envelope-from sobomax@FreeBSD.org) Message-ID: <4B7A38F5.3090404@FreeBSD.org> Date: Mon, 15 Feb 2010 22:19:33 -0800 From: Maxim Sobolev Organization: Sippy Software, Inc. User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: Sergey Babkin References: <4B79297D.9080403@FreeBSD.org> <4B79205B.619A0A1A@verizon.net> In-Reply-To: <4B79205B.619A0A1A@verizon.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Alfred Perlstein , freebsd-net@FreeBSD.org, "David G. Lawrence" , Jack Vogel , FreeBSD Hackers Subject: Re: Sudden mbuf demand increase and shortage under the load X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Feb 2010 06:19:23 -0000 Sergey Babkin wrote: > Maxim Sobolev wrote: >> Hi, >> >> Our company have a FreeBSD based product that consists of the numerous >> interconnected processes and it does some high-PPS UDP processing >> (30-50K PPS is not uncommon). We are seeing some strange periodic >> failures under the load in several such systems, which usually evidences >> itself in IPC (even through unix domain sockets) suddenly either >> breaking down or pausing and restoring only some time later (like 5-10 >> minutes). The only sign of failure I managed to find was the increase of >> the "requests for mbufs denied" in the netstat -m and number of total >> mbuf clusters (nmbclusters) raising up to the limit. > > As a simple idea: UDP is not flow-controlled. So potentially > nothing stops an application from sending the packets as fast > as it can. If it's faster than the network card can process, > they would start collecting. So this might be worth a try > as a way to reproduce the problem and see if the system has > a safeguard against it or not. > > Another possibility: what happens if a process is bound to > an UDP socket but doesn't actually read the data from it? > FreeBSD used to be pretty good at it, just throwing away > the data beyond a certain limit, SVR4 was running out of > network memory. But it might have changed, so might be > worth a look too. Thanks. Yes, the latter could be actually the case. The former is less likely since the system doesn't generate so much traffic by itself, but rather relays what it receives from the network pretty much in 1:1 ratio. It could happen though, if somehow the output path has been stalled. However, netstat -I igb0 shows zero Oerrs, which I guess means that we can rule that out too, unless there is some bug in the driver. So we are looking for potential issues that can cause UDP forwarding application to stall and not dequeue packets on time. So far we have identified some culprits in application logic that can cause such stalls in the unlikely event of gettimeofday() time going backwards. I've seen some messages from ntpd around the time of the problem, although it's unclear whether those are result of the that mbuf shortage or could indicate the root issue. We've also added some debug output to catch any abnormalities in the processing times. In any case I am a little bit surprised on how easy the FreeBSD can let mbuf storage to overflow. I'd expect it to be more aggressive in dropping things received from network once one application stalls. Combined with the fact that we apparently use shared storage for different kinds of network activity and perhaps IPC too, this gives an easy opportunity for DOS attacks. To me, separate limits for separate protocols or even classes of traffic (i.e. local/remote) would make much sense. Thanks to everybody for useful tips and suggestions, I will do more research along the lines and let you know once we either resolve the case or when I have more diagnostic output. Regards, -- Maksym Sobolyev Sippy Software, Inc. Internet Telephony (VoIP) Experts T/F: +1-646-651-1110 Web: http://www.sippysoft.com MSN: sales@sippysoft.com Skype: SippySoft