From owner-freebsd-net@FreeBSD.ORG Fri Aug 5 07:23:08 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 11DEC1065670; Fri, 5 Aug 2011 07:23:08 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) by mx1.freebsd.org (Postfix) with ESMTP id 695F68FC16; Fri, 5 Aug 2011 07:23:07 +0000 (UTC) Received: from slw by zxy.spb.ru with local (Exim 4.69 (FreeBSD)) (envelope-from ) id 1QpELn-000Hd0-PL; Fri, 05 Aug 2011 10:57:43 +0400 Date: Fri, 5 Aug 2011 10:57:43 +0400 From: Slawa Olhovchenkov To: Lawrence Stewart Message-ID: <20110805065743.GC94016@zxy.spb.ru> References: <1F95A4C2D54E4F369830143CBDB5FF86@multiplay.co.uk> <4E37C0F2.4080004@freebsd.org> <2B063B6D95AA4C27B004C50D96393F91@multiplay.co.uk> <4E3AA66A.6060605@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4E3AA66A.6060605@freebsd.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false Cc: Andre Oppermann , Steven Hartland , freebsd-net@freebsd.org Subject: Re: tcp failing to recover from a packet loss under 8.2-RELEASE? X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Aug 2011 07:23:08 -0000 On Fri, Aug 05, 2011 at 12:02:18AM +1000, Lawrence Stewart wrote: > > Setting net.inet.tcp.reass.maxsegments=8148 and rerunning the > > tests appears to result in a solid 14MB/s, its still running a > > full soak test but looking very promising :) > > This is exactly the necessary tuning required to drive high BDP links > successfully. The unfortunate problem with my reassembly change was that > by removing the global count of reassembly segments and using the uma > zone to enforce the restrictions on memory use, we wouldn't necessarily > have room for the last segment (particularly if a single flow has a BDP > larger than the max size of the reassembly queue - which is the case for > you and Slawa). > > This is bad as Andre explained in his message, as we could stall > connections. I hadn't even considered the idea of allocating on the > stack as Andre has suggested in his patch, which I believe is an > appropriate solution to the the stalling problem assuming the function > will never return with the stack allocated tqe still in the reassembly > queue. My longer term goal is discussed below. > > > So I suppose the question is should maxsegments be larger by > > default due to the recent changes e.g. > > - V_tcp_reass_maxseg = nmbclusters / 16; > > + V_tcp_reass_maxseg = nmbclusters / 8; > > > > or is the correct fix something more involved? > > I'm not sure if bumping the value is appropriate - we have always > expected users to tune their network stack to perform well when used in > "unusual" scenarios - a large BDP fibre path still being in the > "unusual" category. > > The real fix which is somewhere down on my todo list is to make all > these memory constraints elastic and respond to VM pressure, thus > negating the need for a hard limit at all. This would solve many if not > most of the TCP tuning problems we currently have with one foul swoop > and would greatly reduce the need for tuning in many situations that > currently are in the "needs manual tuning" basket. Autotunig w/o limits is bad idea. This is way to DoS. May be solved this trouble by preallocation "hidden" element in tqe for segment received in valid order and ready for send to application? T.e. when creating reassembled queue for tcp connection we allocation queue element (with room for data payload), used only when data ready for application. Allocation in queue for not breaking ABI (don't change struct tcpcb). > Andre and Steven, I'm a bit too sleepy to properly review your combined > proposed changes right now and will follow up in the next few days instead. > > Cheers, > Lawrence