From owner-freebsd-net@freebsd.org  Fri Jan 22 22:02:19 2016
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4E0D6A8DBA6
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Fri, 22 Jan 2016 22:02:19 +0000 (UTC)
 (envelope-from mgrooms@shrew.net)
Received: from mx2.shrew.net (mx2.shrew.net [38.97.5.132])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id ED804158C
 for <freebsd-net@freebsd.org>; Fri, 22 Jan 2016 22:02:18 +0000 (UTC)
 (envelope-from mgrooms@shrew.net)
Received: from mail.shrew.net (mail.shrew.prv [10.24.10.20])
 by mx2.shrew.net (8.14.7/8.14.7) with ESMTP id u0MLxgiu030641
 for <freebsd-net@freebsd.org>; Fri, 22 Jan 2016 15:59:42 -0600 (CST)
 (envelope-from mgrooms@shrew.net)
Received: from [10.16.32.30] (72-48-144-84.static.grandenetworks.net
 [72.48.144.84])
 by mail.shrew.net (Postfix) with ESMTPSA id 0702318C732
 for <freebsd-net@freebsd.org>; Fri, 22 Jan 2016 15:59:31 -0600 (CST)
Subject: Re: pf state disappearing [ adaptive timeout bug ]
To: freebsd-net@freebsd.org
References: <56A003B8.9090104@shrew.net>
 <CAKOb=YakqYqeGYUh3PKm-PGQma7E69ZPAtAe7og3byN7s5d4SA@mail.gmail.com>
 <56A13531.8090209@shrew.net>
 <CAKOb=YYwPO-VYG9wC7x1eBDPFQUnv48PC2XV+acRm2Sa9P+XOw@mail.gmail.com>
From: Matthew Grooms <mgrooms@shrew.net>
Message-ID: <56A2A6DA.1040304@shrew.net>
Date: Fri, 22 Jan 2016 16:02:02 -0600
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.5.1
MIME-Version: 1.0
In-Reply-To: <CAKOb=YYwPO-VYG9wC7x1eBDPFQUnv48PC2XV+acRm2Sa9P+XOw@mail.gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3
 (mx2.shrew.net [10.24.10.11]); Fri, 22 Jan 2016 15:59:42 -0600 (CST)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Jan 2016 22:02:19 -0000

On 1/22/2016 3:35 PM, Nick Rogers wrote:
> On Thu, Jan 21, 2016 at 11:44 AM, Matthew Grooms <mgrooms@shrew.net> wrote:
>
>> # pfctl -si
>> Status: Enabled for 0 days 02:25:41           Debug: Urgent
>>
>> State Table                          Total             Rate
>>    current entries                    77759
>>    searches                       483831701        55352.0/s
>>    inserts                           825821           94.5/s
>>    removals                          748060           85.6/s
>> Counters
>>    match                           27118754         3102.5/s
>>    bad-offset                             0            0.0/s
>>    fragment                               0            0.0/s
>>    short                                  0            0.0/s
>>    normalize                              0            0.0/s
>>    memory                                 0            0.0/s
>>    bad-timestamp                          0            0.0/s
>>    congestion                             0            0.0/s
>>    ip-option                           6655            0.8/s
>>    proto-cksum                            0            0.0/s
>>    state-mismatch                         0            0.0/s
>>    state-insert                           0            0.0/s
>>    state-limit                            0            0.0/s
>>    src-limit                              0            0.0/s
>>    synproxy                               0            0.0/s
>>
>> # pfctl -st
>> tcp.first                   120s
>> tcp.opening                  30s
>> tcp.established           86400s
>> tcp.closing                 900s
>> tcp.finwait                  45s
>> tcp.closed                   90s
>> tcp.tsdiff                   30s
>> udp.first                   600s
>> udp.single                  600s
>> udp.multiple                900s
>> icmp.first                   20s
>> icmp.error                   10s
>> other.first                  60s
>> other.single                 30s
>> other.multiple               60s
>> frag                         30s
>> interval                     10s
>> adaptive.start            90000 states
>> adaptive.end             120000 states
>> src.track                     0s
>>
>> I think there may be a problem with the code that calculates adaptive
>> timeout values that is making it way too aggressive. If by default it's
>> supposed to decrease linearly between %60 and %120 of the state table max,
>> I shouldn't be loosing TCP connections that are only idle for a few minutes
>> when the sate table is < %70 full. Unfortunately that appears to be the
>> case. At most this should have decreased the 86400s timeout by %17 to
>> 72000s for established TCP connections.
> That doesn't make sense to me either. Even if the math is off by a factor
> of 10 the state should live for about 24 minutes.
>
>> I've tested this for a few hours now and all my idle SSH sessions have
>> been rock solid. If anyone else is scratching their head over a problem
>> like this, I would suggest disabling the adaptive timeout feature or
>> increasing it to a much higher value. Maybe one of the pf maintainers can
>> chime in and shed some light on why this is happening. If not, I'm going to
>> file a bug report as this certainly feels like one.
>>
> Did you go with making adaptive timeout less aggressive or disable it
> entirely? I would think that if adaptive timeout is really that broken more
> people would notice this problem, especially myself since I have many
> servers running a very short tcp.established timeout, but the fact that you
> are noticing this kind of weirdness has me concerned about how the adaptive
> setting is affecting my environment.

I increased the value to 90K for the 10K limit. Yes, it's concerning. 
Today I setup a test environment at about 1/10th the connections to see 
if I could reproduce the issue on a smaller scale, but had no luck. I'm 
trying to find a cmd line test program that will generate enough tcp 
connections so I can reproduce it on a similar scale to my production 
environment. So far I haven't found anything that will do the trick. I 
may end up rolling my own. I'll reply back to the list if I can find a 
way to reproduce this.

Thanks again,

-Matthew