From owner-freebsd-transport@freebsd.org Wed May 18 05:35:14 2016 Return-Path: Delivered-To: freebsd-transport@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8B251B401D8 for ; Wed, 18 May 2016 05:35:14 +0000 (UTC) (envelope-from pkelsey@gmail.com) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 699F4180F for ; Wed, 18 May 2016 05:35:14 +0000 (UTC) (envelope-from pkelsey@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 653C9B401D7; Wed, 18 May 2016 05:35:14 +0000 (UTC) Delivered-To: transport@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 64D32B401D6 for ; Wed, 18 May 2016 05:35:14 +0000 (UTC) (envelope-from pkelsey@gmail.com) Received: from mail-oi0-x232.google.com (mail-oi0-x232.google.com [IPv6:2607:f8b0:4003:c06::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 289D2180C; Wed, 18 May 2016 05:35:14 +0000 (UTC) (envelope-from pkelsey@gmail.com) Received: by mail-oi0-x232.google.com with SMTP id x19so60144960oix.2; Tue, 17 May 2016 22:35:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc; bh=pEqZpVAd3grUQqhRuIIMZ9WzMeLsX5ELECjEqf5LAxQ=; b=BdP8F5I1WcTDATMSDUC3kSHnA94o+KXuwJkQ4R+dagYu6DlV9GCWFdNrne68WdfX6m 8IzxAaUxQK/tzxKedIW/iVjwZI7MLs3BNWvgkm5Livx/OwZ1j5kfrVXWy0uIn96HerY5 8RzTOvu0vgoKSLOWg/lP0j8BWdtnFpO90k73C/vzZqL9TkmPUtEYB8N1j5ymP50EWy1B FS/kuv4z6CToACQIotNcvsxo7BVy+/aJ9oWnc36NilEjPo9uDI70dypYWzESurIyQSuw SiQELG/p/TZubPxD9wJgvLe2TrEhju+d4JzGhRiHSisq+siouPE3KkqBfuhvg+Xxg88e 6zCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to:cc; bh=pEqZpVAd3grUQqhRuIIMZ9WzMeLsX5ELECjEqf5LAxQ=; b=XJ4TqhPcFFN04pSgrPb/ov7zyOGRIbm2YpXtwhN0j9OsU/G40BkRRGIU+znvYUPv8m UI5kdfF5hqCG1zgzH7g8Xec3ZwFKrPFk1EXmOezPT5K+xL6M/mOjxYTqK3x1fwRNLFWL UYsvGNmD2FIOhyds/9+7p701fqddQWwmz6OmYLJmCNFlR5nC5TARTtDl8GuUf3YIKMP9 ECGGNWGpDORQ/kiwshYGHJEhrDur0GJiKua9JCA8FRimm6uI1FsA6hrZU0g7xOeoESX7 sYQglIPRyaegcuwvIDFiVW5OHAWpZr4ZC7XSUvjeZb61H9FH95mhqIDFBemwYDB3yDrH Pqfw== X-Gm-Message-State: AOPr4FWdpf+kAg2tSa5nJHRm7kLakbdlqiQMGTEIH9rdav64Rt4Df7h0Jh/Z0lGh98h9DFKai54KVoZaInZ4Vg== MIME-Version: 1.0 X-Received: by 10.157.56.114 with SMTP id r47mr3013679otd.68.1463549713506; Tue, 17 May 2016 22:35:13 -0700 (PDT) Sender: pkelsey@gmail.com Received: by 10.157.16.84 with HTTP; Tue, 17 May 2016 22:35:13 -0700 (PDT) In-Reply-To: References: <20160513173633.GG44085@strugglingcoder.info> <20160517233618.GS44085@strugglingcoder.info> Date: Wed, 18 May 2016 01:35:13 -0400 X-Google-Sender-Auth: mG-zZiaMUFAg6Hwg_WbG47DrzU8 Message-ID: Subject: Re: Abrupt reset sent instead of retransmitting a lost packet From: Patrick Kelsey To: hiren panchasara Cc: transport@freebsd.org, Gleb Smirnoff , Lawrence Stewart , killing@multiplay.co.uk Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-transport@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Discussions of transport level network protocols in FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 05:35:14 -0000 On Wed, May 18, 2016 at 1:09 AM, Patrick Kelsey wrote: > > > On Tue, May 17, 2016 at 7:36 PM, hiren panchasara < > hiren@strugglingcoder.info> wrote: > >> On 05/13/16 at 10:36P, hiren panchasara wrote: >> > https://people.freebsd.org/~hiren/pcaps/tcp_weird_reset.txt >> > Something we saw in the wild on 10.2ish systems (server and client >> both). >> > >> > The most interesting thing can be seen at the end of the file. >> > >> > 3298737767:3298739215 gets lost, client tells us about it via a bunch of >> > dupacks with SACK info. It SACKs all the outstanding data but this one >> > missing packet. We (server) never retransmits that missing >> > packet but rather decide to send a Reset after 0.312582ms. Which somehow >> > causes client to pause for 75secs. (which might be another issue and not >> > particularly important for this discussion.) >> > >> > What could cause this behavior of sending a reset instead of >> > retransmitting a lost packet? >> >> Turns out I am finding a lot of "discarded due to memory problems" in >> 'netstat -sp tcp' and also net.inet.tcp.reass.overflows is rapidly >> increasing. >> >> This is happening in a very low RTT env (in the range of 0.20ms) and >> about 1G of b/w. >> >> So seems like following code is where reass queue is overflowing: >> (I've also confirmed with tcp debug that I am seeing this message) >> >> In tcp_reass() >> >> if ((th->th_seq != tp->rcv_nxt || >> !TCPS_HAVEESTABLISHED(tp->t_state)) && >> tp->t_segqlen >= (so->so_rcv.sb_hiwat / tp->t_maxseg) + 1) { >> V_tcp_reass_overflows++; >> TCPSTAT_INC(tcps_rcvmemdrop); >> m_freem(m); >> *tlenp = 0; >> if ((s = tcp_log_addrs(&tp->t_inpcb->inp_inc, th, NULL, >> NULL))) { >> log(LOG_DEBUG, "%s; %s: queue limit reached, " >> "segment dropped\n", s, __func__); >> free(s, M_TCPLOG); >> } >> return (0); >> } >> >> I know this is a bit older (stable/10) code but I think problem still >> remains. >> >> This is the gist of this issue: >> tp->t_segqlen >= (so->so_rcv.sb_hiwat / tp->t_maxseg) + 1 - evaluating >> to be true which makes us drop packets on the floor. >> >> I've tried to restore default behavior with: >> net.inet.tcp.recvbuf_max: 131072 >> net.inet.tcp.recvbuf_inc: 16384 >> net.inet.tcp.sendbuf_max: 131072 >> net.inet.tcp.sendbuf_inc: 16384 >> net.inet.tcp.sendbuf_auto: 1 >> net.inet.tcp.sendspace: 65536 >> net.inet.tcp.recvspace: 65536 >> >> net.inet.tcp.reass.overflows: 156440623 >> net.inet.tcp.reass.cursegments: 91 >> net.inet.tcp.reass.maxsegments: 557900 >> >> And the app is *not* setting SO_SNDBUF or SO_RCVBUF to keep SB_AUTOSIZE >> into effect. >> >> I was assuming the usual auto-sizing would kick in and do the right >> thing where we don't run into this issue but something is amiss. >> >> I am seeing a bunch of connections to inter-colo hosts with high Recv-Q >> (close to recvbuf_max) from 'netstat -an'. >> >> I found and old issue which seems similar: >> https://lists.freebsd.org/pipermail/freebsd-net/2011-August/029491.html >> >> I am cc'ing a few folks who've touched this code of may have some idea. >> Any help is appreciated. >> > > So far, I mainly just have an initial comment regarding the expectation > that receive buffer auto-sizing would kick in. For an increment to happen, > allegedly (I haven't fully audited this code path yet) greater than 7/8 of > the current receive buffer size has to be received in one RTT. Given your > stated bw of ~1G (assuming that means 1 Gbps, on a single connection) and > RTT of 0.2ms, you have ~200 kbits per RTT. When the receive buffer is at > the initial 64KiB, the threshold for the first increment is ~459 kbits > within one RTT. To get that first increment, the traffic would have to be > sufficiently bursty. > > In the trace you sent, I'm seeing at most 23 or so segments of length 1448 > with the same ecr, which is a peak of around ~250 kbits per RTT. > Here is the point where I betrayed that I had read the scaling code a bit more closely, but was still thinking mostly in terms of what the comments said was going on. I had gone and looked at the trace file for how many segments are transmitted with the same ecr and called that the RTT because that is what the scaling code does - it evaluates the scaling logic when there is an increment in the ecr above the base timestamp, but I didn't fully account for the fact that this new reference frame affects the math. Looking for a change in the ecr value in order to trigger the 'RTT-based' scaling logic effectively puts a floor of about 1 timer tick on the time interval over which the scaling code counts bytes. So when real RTT is less than a timer tick or so, the scaling code is counting bytes roughly over a timer tick (plus the smaller RTT), and when real RTT is greater than a timer tick or so, the scaling code is counting bytes over an interval closer to the real RTT. In this case, the timer tick is probably 1ms, so the scaling code is not counting bytes over the real RTT of 0.2ms, it's counting them over something like 1.2ms. That means at 1 Gbps, there will be 1.2 Mbits per 'scaling-logic-RTT', and we would expect to see that initial receive buffer increment (as well as a number of others after that one).