From owner-freebsd-net@FreeBSD.ORG  Sat Jan  4 21:41:04 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A4E851EE;
 Sat,  4 Jan 2014 21:41:04 +0000 (UTC)
Received: from mail-n.franken.de (drew.ipv6.franken.de
 [IPv6:2001:638:a02:a001:20e:cff:fe4a:feaa])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id CF585138B;
 Sat,  4 Jan 2014 21:41:03 +0000 (UTC)
Received: from [192.168.1.103] (p508F1427.dip0.t-ipconnect.de [80.143.20.39])
 (Authenticated sender: macmic)
 by mail-n.franken.de (Postfix) with ESMTP id 0A76C1C0C0692;
 Sat,  4 Jan 2014 22:41:00 +0100 (CET)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\))
Subject: Re: Long-haul problems - connections stuck in slow start
From: Michael Tuexen <Michael.Tuexen@lurchi.franken.de>
In-Reply-To: <52C85537.7080307@wemm.org>
Date: Sat, 4 Jan 2014 22:40:59 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <90E0038B-7ED8-49B8-A947-86F8F33438D9@lurchi.franken.de>
References: <52C85537.7080307@wemm.org>
To: Peter Wemm <peter@wemm.org>
X-Mailer: Apple Mail (2.1510)
Cc: freebsd-net@freebsd.org, Gavin Atkinson <gavin@FreeBSD.org>,
 andre@freebsd.org, Peter Wemm <peter@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Jan 2014 21:41:04 -0000

On Jan 4, 2014, at 7:38 PM, Peter Wemm <peter@wemm.org> wrote:

> We're seeing some unfortunate misbehavior with tcp over an =
intercontinental
> link.
>=20
> eg: fetching a 30GB http file from various package mirrors by a =
remote:
> us-west(ISC) -> london(BME)
> bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB  961 kBps =
09h03m^C
> us-east(NYI) -> london(BME)
> bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB 1070 kBps =
08h08m^C
> us-west(YSV) -> london(BME)
> bd93e71c-cae4-44fd-943c-d1a88dbf6c6d.tar  0% of   29 GB   14 kBps =
590h22m^C
>=20
> Spot the one we're concerned about...
>=20
> Ping times for the three (in order):
> round-trip min/avg/max/std-dev =3D 144.330/144.532/144.797/0.157 ms
> round-trip min/avg/max/std-dev =3D 79.650/79.965/80.488/0.287 ms
> round-trip min/avg/max/std-dev =3D 148.588/153.292/155.688/2.903 ms
>=20
> The problem pair is worth showing some detail on:
> 16 bytes from ..:206a::1001:10, icmp_seq=3D4 hlim=3D55 time=3D148.588 =
ms
> 16 bytes from ..:206a::1001:10, icmp_seq=3D5 hlim=3D55 time=3D155.140 =
ms
> 16 bytes from ..:206a::1001:10, icmp_seq=3D6 hlim=3D55 time=3D149.443 =
ms
> 16 bytes from ..:206a::1001:10, icmp_seq=3D7 hlim=3D55 time=3D155.688 =
ms
> 16 bytes from ..:206a::1001:10, icmp_seq=3D8 hlim=3D55 time=3D148.630 =
ms
> 16 bytes from ..:206a::1001:10, icmp_seq=3D9 hlim=3D55 time=3D155.486 =
ms
> It appears that there are two packet paths between the endpoints that =
have
> either ~148ms or ~155ms.  I've done some longer samples and they're =
fairly
> consistent clusters.
>=20
> All four machines talk to each other.
>=20
> Here's where it gets interesting.  On the sender at us-west(YSV), I =
see this:
> net.inet.tcp.hostcache.list:
> IP address    SSTRESH    RTT RTTVAR     CWND HITS
> us-west(ISC)    59521    5ms    1ms    16845 15055031
> eu-west(BME)     7343  150ms    2ms    13501 3433775
> us-east(NYI)   530489  100ms   37ms    16681 43043786
>=20
> The ssthresh is very low for the problematic ysv<->bme pair.
>=20
> When I do a tcpdump, I see the sender fire off 7343 bytes of data, =
then stop
> and wait for acks.  It's completely ignoring the receiver's window =
state.
> It appears stuck in slowstart mode.
>=20
> Some other data:
> Proto Recv-Q Send-Q Local Address  Foreign Address        (state)
> tcp6       0 1047852 2001:19:2.443   2001:41c8:.24490 ESTABLISHED
>=20
> (netstat -x, sorry about the wrap)
> Proto Recv-Q Send-Q Local Address          Foreign Address        =
R-MBUF
> S-MBUF R-CLUS S-CLUS R-HIWA S-HIWA R-LOWA S-LOWA R-BCNT S-BCNT R-BMAX =
S-BMAX
>  rexmt persist    keep    2msl  delack rcvtime
> tcp6       0 1048152 2001:1900:2254:2.443   2001:41c8:112:83.24490     =
 0
> 374      0    373  65688 1049580      1   2048      0 1420800 525504
> 8396640    0.43    0.00 7199.93    0.00    0.00    0.06
>=20
> The "interesting" parts of -x:
> rexmt persist    keep    2msl  delack rcvtime
> 0.43    0.00 7199.93    0.00    0.00    0.06
>=20
> -T
> Proto Rexmit OOORcv 0-win  Local Address   Foreign Address
> tcp6   54161      0      0 2001:192.443   2001:41:83.24490
> note retransmits(!)
>=20
> Some tcpcb fields that caught my eye for the connection:
>  snd_wnd =3D 1048576,
>  snd_cwnd =3D 5712,
>  t_srtt =3D 6391,
>  t_rttvar =3D 903,
>  t_rxtshift =3D 0,
>  t_rttmin =3D 30,
>  t_rttbest =3D 4903,
>  t_rttupdated =3D 220095,
>  max_sndwnd =3D 1048576,
>  snd_cwnd_prev =3D 4284,
>  snd_ssthresh_prev =3D 2856,
>  snd_recover_prev =3D 1397053524,
>  t_sndzerowin =3D 0,
>  t_badrxtwin =3D 584273259,
>  snd_limited =3D 0 '\0',
>  t_rttlow =3D 150,
> I've stored some dumps of the tcpcb at
>  http://people.freebsd.org/~peter/tcpcb.txt
> Note that some in the tcpcb.txt file also have
>  snd_limited =3D 2 '\002',
>=20
> Over the last few days I've tried things like turning off sack, tso, =
the
> various rfc knobs etc.  I believe they're all back to normal now.
>=20
> There's small ~15 second tcpdump sample of the sender side and the =
receiver
> side at: http://people.freebsd.org/~peter/send.cap.gz and
> http://people.freebsd.org/~peter/recv.cap.gz
> Both ends were ntp synced.  The dumps have no sensitive data.
>=20
> For amusement, I just tried this, with roughly 1 second in between:
> peter@bme:~ %	scp pkg-ysv:k.gz /tmp
> k.gz              100%   25MB   5.0MB/s   00:05
> peter@bme:~ %	scp pkg-ysv:k.gz /tmp
> k.gz                0%  960KB  20.3KB/s   41:29 ETA^C
>=20
> There was no pre-existing hostcache state between those two endpoints =
for
> the first run.  At the end, this was created in the hostcache:
> IP address   SSTRESH   RTT  RTTVAR BANDWIDTH     CWND
> 213.138..       5952 165ms    21ms         0     8688
> All connections went slow after that.  Note that the ssh test was over =
ipv4
> - the rest above is on ipv6.  However, we're seeing the same weird =
stuff
> with http over ipv4 as well between the same two endpoints.
>=20
> It was pointed out to me that this has come up before, eg: misc/173859
> I know we've seen this at work as well.
>=20
> A few days earlier we were pushing ~45MB/sec (bytes, not bits) between =
these
> endpoints. Out of the blue it crashed to ~10KB/sec.  Why can't it get =
out of
> slow-start?  Is it even stuck in slow-start like I think?  Is the =
148-155ms
> bimodal rtt the problem?
>=20
> Any insight would be greatly appreciated.  (please don't drop me from =
cc:)
Looking at the receiver tracefile shows that there is some message loss.
This limits the throughput... Do you also observe a message loss rate =
when
using ping?

Best regards
Michael
> --=20
> Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; =
KI6FJV
> UTF-8: for when a ' just won\342\200\231t do.
>=20