From owner-freebsd-net@freebsd.org Fri Jan 22 22:02:19 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4E0D6A8DBA6 for ; Fri, 22 Jan 2016 22:02:19 +0000 (UTC) (envelope-from mgrooms@shrew.net) Received: from mx2.shrew.net (mx2.shrew.net [38.97.5.132]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id ED804158C for ; Fri, 22 Jan 2016 22:02:18 +0000 (UTC) (envelope-from mgrooms@shrew.net) Received: from mail.shrew.net (mail.shrew.prv [10.24.10.20]) by mx2.shrew.net (8.14.7/8.14.7) with ESMTP id u0MLxgiu030641 for ; Fri, 22 Jan 2016 15:59:42 -0600 (CST) (envelope-from mgrooms@shrew.net) Received: from [10.16.32.30] (72-48-144-84.static.grandenetworks.net [72.48.144.84]) by mail.shrew.net (Postfix) with ESMTPSA id 0702318C732 for ; Fri, 22 Jan 2016 15:59:31 -0600 (CST) Subject: Re: pf state disappearing [ adaptive timeout bug ] To: freebsd-net@freebsd.org References: <56A003B8.9090104@shrew.net> <56A13531.8090209@shrew.net> From: Matthew Grooms Message-ID: <56A2A6DA.1040304@shrew.net> Date: Fri, 22 Jan 2016 16:02:02 -0600 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mx2.shrew.net [10.24.10.11]); Fri, 22 Jan 2016 15:59:42 -0600 (CST) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jan 2016 22:02:19 -0000 On 1/22/2016 3:35 PM, Nick Rogers wrote: > On Thu, Jan 21, 2016 at 11:44 AM, Matthew Grooms wrote: > >> # pfctl -si >> Status: Enabled for 0 days 02:25:41 Debug: Urgent >> >> State Table Total Rate >> current entries 77759 >> searches 483831701 55352.0/s >> inserts 825821 94.5/s >> removals 748060 85.6/s >> Counters >> match 27118754 3102.5/s >> bad-offset 0 0.0/s >> fragment 0 0.0/s >> short 0 0.0/s >> normalize 0 0.0/s >> memory 0 0.0/s >> bad-timestamp 0 0.0/s >> congestion 0 0.0/s >> ip-option 6655 0.8/s >> proto-cksum 0 0.0/s >> state-mismatch 0 0.0/s >> state-insert 0 0.0/s >> state-limit 0 0.0/s >> src-limit 0 0.0/s >> synproxy 0 0.0/s >> >> # pfctl -st >> tcp.first 120s >> tcp.opening 30s >> tcp.established 86400s >> tcp.closing 900s >> tcp.finwait 45s >> tcp.closed 90s >> tcp.tsdiff 30s >> udp.first 600s >> udp.single 600s >> udp.multiple 900s >> icmp.first 20s >> icmp.error 10s >> other.first 60s >> other.single 30s >> other.multiple 60s >> frag 30s >> interval 10s >> adaptive.start 90000 states >> adaptive.end 120000 states >> src.track 0s >> >> I think there may be a problem with the code that calculates adaptive >> timeout values that is making it way too aggressive. If by default it's >> supposed to decrease linearly between %60 and %120 of the state table max, >> I shouldn't be loosing TCP connections that are only idle for a few minutes >> when the sate table is < %70 full. Unfortunately that appears to be the >> case. At most this should have decreased the 86400s timeout by %17 to >> 72000s for established TCP connections. > That doesn't make sense to me either. Even if the math is off by a factor > of 10 the state should live for about 24 minutes. > >> I've tested this for a few hours now and all my idle SSH sessions have >> been rock solid. If anyone else is scratching their head over a problem >> like this, I would suggest disabling the adaptive timeout feature or >> increasing it to a much higher value. Maybe one of the pf maintainers can >> chime in and shed some light on why this is happening. If not, I'm going to >> file a bug report as this certainly feels like one. >> > Did you go with making adaptive timeout less aggressive or disable it > entirely? I would think that if adaptive timeout is really that broken more > people would notice this problem, especially myself since I have many > servers running a very short tcp.established timeout, but the fact that you > are noticing this kind of weirdness has me concerned about how the adaptive > setting is affecting my environment. I increased the value to 90K for the 10K limit. Yes, it's concerning. Today I setup a test environment at about 1/10th the connections to see if I could reproduce the issue on a smaller scale, but had no luck. I'm trying to find a cmd line test program that will generate enough tcp connections so I can reproduce it on a similar scale to my production environment. So far I haven't found anything that will do the trick. I may end up rolling my own. I'll reply back to the list if I can find a way to reproduce this. Thanks again, -Matthew