FreeBSD Mail Archives

Date:      Thu, 5 Dec 2013 11:57:46 +0100
From:      Michael Tuexen <Michael.Tuexen@lurchi.franken.de>
To:        Valentin Nechayev <netch@netch.kiev.ua>
Cc:        freebsd-net@freebsd.org
Subject:   Re: SCTP huge connect delays (at amd64) and yet another question
Message-ID:  <45DB7B10-68DE-41F2-A5E9-22AFFC65999E@lurchi.franken.de>
In-Reply-To: <11932BA9-A734-4D4F-BCBB-6A0D926A22A9@lurchi.franken.de>
References:  <20131205084142.GA31113@netch.kiev.ua> <11932BA9-A734-4D4F-BCBB-6A0D926A22A9@lurchi.franken.de>

More thinking and testing.

Without SCTP_NODELAY the following can also happen:

  > INIT
  < INIT-ACK
  < COOKIE-ECHO
  > COOKIE-ACK
  < DATA(abc)
  > SACK
  < DATA(def) possibly more...
200 ms delay
  > SACK
  < all remaining DATA chunks
  > SHUTDOWN
  < SHUTDOWN-ACK
  > SHUTDOWN-COMPLETE

Timing comes into the game. The question is if all send() calls have =
been completed
before the first SACK is received. Not sure this depends in i386 vs. =
amd64, but
timing is important. On a Raspberry Pi I saw in a reproducable way

  > INIT
  < INIT-ACK
  < COOKIE-ECHO
  > COOKIE-ACK
  < DATA(abc)
  > SACK
  < DATA(def)
200 ms delay
  > SACK
  < DATA(ghi);DATA(jkl);DATA(mno);DATA(pqr);
  > SHUTDOWN
  < SHUTDOWN-ACK
  > SHUTDOWN-COMPLETE

Best regards
Michael
On Dec 5, 2013, at 11:32 AM, Michael Tuexen =
<Michael.Tuexen@lurchi.franken.de> wrote:

> On Dec 5, 2013, at 9:41 AM, Valentin Nechayev <netch@netch.kiev.ua> =
wrote:
>=20
>> Hi,
>>=20
>> I've got some test results which are surprising and I would get
>> a clarification.
>>=20
>> A simple connection is created between two one-to-one SCTP sockets
>> (AF_INET, SOCK_STREAM, IPPROTO_SCTP) at loopback (127.0.0.1). The
>> server side sends 6 3-byte messages to client side and optionally
>> designates writing shutdown. Client receives all them and measures
>> a time before each receiving.
>> Code is showed at the end of this message.
>> Tested systems are:
>> * FreeBSD 9.2-release/amd64
>> * FreeBSD 9.1-release/amd64
>> * FreeBSD 9.1-release/i386
>> * Linux OpenSuSE 12.2, kernel 3.4.63-2.44-default, x86_64
>> * Linux RHEL 6.3, kernel 2.6.32-279.22.1.38.0.el6.x86_64
>>=20
>> The first discrepancy found is specific for FreeBSD on amd64 and not
>> for i386 version; it's that connection setup lasts 2-4 seconds (!!)
>> Tcpdump shows indication that could be parsed as message miss:
> Hi Valentin,
>=20
> could you send me the .pcap file instead of the tcpdump output.
> I would like to see the addresses listed in the INIT and INIT-ACK.
>=20
> You can send that file to tuexen@freebsd.org.
>>=20
>> tcpdump: listening on lo0, link-type NULL (BSD loopback), capture =
size 65535 byt
>> es
>> 08:18:34.639422 IP (tos 0x0, ttl 64, id 65094, offset 0, flags =
[none], proto SCT
>> P (132), length 188, bad cksum 0 (->f274)!)
>>   10.0.0.2.50025 > 127.0.0.1.2500: sctp
> I'm wondering why 10.0.0.2 is the source address and not 127.0.0.1
>>       1) [INIT] [init tag: 3943463987] [rwnd: 1864135] [OS: 10] [MIS: =
2048] [i
>> nit TSN: 3475830004]
>> 08:18:34.639450 IP (tos 0x0, ttl 64, id 42621, offset 0, flags =
[none], proto SCT
>> P (132), length 524, bad cksum 0 (->48ee)!)
>>   127.0.0.1.2500 > 10.0.0.2.50025: sctp
>>       1) [INIT ACK] [init tag: 59811639] [rwnd: 1864135] [OS: 10] =
[MIS: 2048]
>> [init TSN: 466863335]
>> 08:18:34.639467 IP (tos 0x0, ttl 64, id 52783, offset 0, flags =
[none], proto SCT
>> P (132), length 424, bad cksum 0 (->21a0)!)
>>   10.0.0.2.50025 > 127.0.0.1.2500: sctp
>>       1) [COOKIE ECHO]
>> 08:18:35.639618 IP (tos 0x0, ttl 64, id 12109, offset 0, flags [DF], =
proto SCTP
>> (132), length 424, bad cksum 0 (->8082)!)
>>   10.0.0.2.50025 > 127.0.0.1.2500: sctp
>>       1) [COOKIE ECHO]
>> 08:18:36.692628 IP (tos 0x0, ttl 64, id 48682, offset 0, flags [DF], =
proto SCTP
>> (132), length 76, bad cksum 0 (->7e01)!)
>>   127.0.0.1.2500 > 127.0.0.1.50025: sctp
> The retransmission goes from 127.0.0.1. Hmm. Not sure why.
>>       1) [HB REQ]
>> 08:18:36.692668 IP (tos 0x0, ttl 64, id 10809, offset 0, flags [DF], =
proto SCTP (132), length 76, bad cksum 0 (->86f2)!)
>>   10.0.0.2.50025 > 127.0.0.1.2500: sctp
>>       1) [HB ACK]=20
>> 08:18:36.692707 IP (tos 0x2,ECT(0), ttl 64, id 16588, offset 0, flags =
[DF], proto SCTP (132), length 52, bad cksum 0 (->fb75)!)
>>   127.0.0.1.2500 > 127.0.0.1.50025: sctp
>>       1) [DATA] (B)(E) [TSN: 466863335] [SID: 0] [SSEQ 0] [PPID 0x0] =
[Payload:
>>       0x0000:  6162 63                                  abc
>> [...]
>>=20
>> At 08:18:34.639467, cookie echo was sent but likely ignored. One
>> second later it was resent. Then, yet another strange timeout was
>> invented before HB REQ.
>>=20
>> Test series show this can spend more than 4 seconds, average value
>> is about 3 seconds. Two 20-times run summary times are 58 to 63
>> seconds, so, I've got 2.9...3.15 average connect time.
>>=20
>> Neither Linux nor 32-bit FreeBSD shows this.
> FreeBSD should neither... Do you see this on FreeBSD 9.2 amd64?
>>=20
>> The second discrepancy is well known case of so-called "Nagle"
>> algorithm adapted for SCTP but details are confusing. If
>> SCTP_NODELAY isn't turned on on server side, tcpdump shows that the
>> second packet is sent from sender side without delay, but receiver's
>> SACK is delayed for 200 ms by default. These results are identical =
for
>> FreeBSD (32 bit) and Linux, but not amd64 FreeBSD (see below). But
>> why? A common sense suggests that, if client receives all =
immediately,
>> and server has already prepared its data, no additional delay shall =
be
>> invented. In analogue to TCP, I would expect that, until acknoledge
>> for "abc" is got, "def" isn't sent, but then the latter is sent
>> immediately.
>>=20
>> 09:28:11.374335 IP (tos 0x2,ECT(0), ttl 64, id 24204, offset 0, flags =
[DF], prot
>> o SCTP (132), length 52, bad cksum 0 (->ddb5)!)
>>   127.0.0.1.2500 > 127.0.0.1.41007: sctp
>>       1) [DATA] (B)(E) [TSN: 183313025] [SID: 0] [SSEQ 0] [PPID 0x0] =
[Payload:
>>       0x0000:  6162 63                                  abc
>> 09:28:11.374349 IP (tos 0x0, ttl 64, id 522, offset 0, flags [none], =
proto SCTP=20
>> (132), length 48, bad cksum 0 (->7a3e)!)
>>   127.0.0.1.41007 > 127.0.0.1.2500: sctp
>>       1) [SACK] [cum ack 183313025] [a_rwnd 1863876] [#gap acks 0] =
[#dup tsns=20
>> 0]=20
>> 09:28:11.374368 IP (tos 0x2,ECT(0), ttl 64, id 64629, offset 0, flags =
[DF], prot
>> o SCTP (132), length 52, bad cksum 0 (->3fcc)!)
>>   127.0.0.1.2500 > 127.0.0.1.41007: sctp
>>       1) [DATA] (B)(E) [TSN: 183313026] [SID: 0] [SSEQ 1] [PPID 0x0] =
[Payload:
>>       0x0000:  6465 66                                  def
>> 09:28:11.573780 IP (tos 0x0, ttl 64, id 12179, offset 0, flags =
[none], proto SCT
>> P (132), length 48, bad cksum 0 (->4cb5)!)
>>   127.0.0.1.41007 > 127.0.0.1.2500: sctp
>>       1) [SACK] [cum ack 183313026] [a_rwnd 1864135] [#gap acks 0] =
[#dup tsns=20
>> 0]=20
>>=20
> Please note, that the first SACK is returned without the 200ms delay. =
This is
> required by the RFC and the above trace seems to show that.
>> But, if server shuts its writing side down ("s" in argv[]), this
>> laziness disappears. Again, the logic is too opaque and confusing.
> What do you mean by this?
>>=20
>> 64-bit (amd64) FreeBSD shows another behavior (both 9.1 and 9.2): in
>> addition to setup delay (see above), the delay between 2nd and 3rd
>> received packet (case SCTP_NODELAY isn't activated) could be longer
>> than minimally needed one and spreads between a few hundreds of
>> microseconds up to full 0.2 second delay shown on other platforms.
>> In average, 1/8 of runs show this delay:
>>=20
>> $ fgrep ghi ll | sort -rn -k2,2 -t=3D | uniq -c
>>  1 got: ghi (with MSG_EOR) tdiff=3D200835
>>  1 got: ghi (with MSG_EOR) tdiff=3D200829
>>  1 got: ghi (with MSG_EOR) tdiff=3D200826
>>  1 got: ghi (with MSG_EOR) tdiff=3D200822
>>  1 got: ghi (with MSG_EOR) tdiff=3D200819
>>  1 got: ghi (with MSG_EOR) tdiff=3D200800
>>  1 got: ghi (with MSG_EOR) tdiff=3D200792
>>  1 got: ghi (with MSG_EOR) tdiff=3D199885
>>  1 got: ghi (with MSG_EOR) tdiff=3D163816
>>  1 got: ghi (with MSG_EOR) tdiff=3D55849
>>  1 got: ghi (with MSG_EOR) tdiff=3D1825
>> 21 got: ghi (with MSG_EOR) tdiff=3D2
>> 38 got: ghi (with MSG_EOR) tdiff=3D1
>>=20
>> It's definitely better than delay each run, as on other platforms
>> (but the initial delay annoys roughly).
> Without SCTP_NODELAY bundling can happen or not, it depends on timing.
> It would be great, if you can provide a .pcap file for a transfer you
> think shows some buggy behaviour. Then we can figure out what is going =
on.
>>=20
>> The testing code:
>> =3D=3D=3D
>> #include <sys/types.h>
>> #include <sys/socket.h>
>> #include <sys/poll.h>
>> #include <sys/time.h>
>> #include <netinet/in.h>
>> #include <netinet/sctp.h>
>> #include <arpa/inet.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <stdio.h>
>> #include <string.h>
>> #include <err.h>
>>=20
>> #define PORT 2500
>>=20
>> int main(int argc, char *argv[])
>> {
>> int s_li, s_ac, s_cl;
>> struct sockaddr_in sia;
>> struct iovec iov[1];
>> struct msghdr msg;
>> socklen_t slen;
>> struct timeval tv0, tv1;
>> int tdiff;
>> int i;
>>=20
>> s_li =3D socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP);
>> if (s_li < 0)
>>   err(1, "socket");
>> memset(&sia, 0, sizeof(sia));
>> sia.sin_family =3D AF_INET;
>> sia.sin_addr.s_addr =3D htonl(0x7F000001);
>> sia.sin_port =3D htons(PORT);
>> if (bind(s_li, (struct sockaddr*)&sia, sizeof(sia)) < 0)
>>   err(1, "bind");
>> if (listen(s_li, 1) < 0)
>>   err(1, "listen");
>> s_cl =3D socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP);
>> if (s_cl < 0)
>>   err(1, "socket");
>> if (connect(s_cl, (struct sockaddr*)&sia, sizeof(sia)) < 0)
>>   err(1, "connect");
>> slen =3D sizeof(sia);
>> s_ac =3D accept(s_li, (struct sockaddr*) &sia, &slen);
>> if (s_ac < 0)
>>   err(1, "accept");
>> for (i =3D 1; i < argc; ++i) {
>>   if (!strcmp(argv[i], "nn")) {
>>     const int one =3D 1;
>>     if (setsockopt(s_ac, IPPROTO_SCTP, SCTP_NODELAY, &one, =
sizeof(one)) < 0)
>>       warn("setsockopt(SCTP_NODELAY)");
>>   }
>> }
>> if (send(s_ac, "abc", 3, 0) !=3D 3)
>>   err(1, "send");
>> if (send(s_ac, "def", 3, MSG_EOR) !=3D 3)
> MSG_EOR is nothing you provide at a send() call. The flag is only
> returned by the recvmsg() call.
>>   err(1, "send");
>> if (send(s_ac, "ghi", 3, 0) !=3D 3)
>>   err(1, "send");
>> if (send(s_ac, "jkl", 3, MSG_EOR) !=3D 3)
>>   err(1, "send");
>> if (send(s_ac, "mno", 3, 0) !=3D 3)
>>   err(1, "send");
>> if (send(s_ac, "pqr", 3, MSG_EOR) !=3D 3)
>>   err(1, "send");
>> for (i =3D 1; i < argc; ++i) {
>>   if (!strcmp(argv[i], "s"))
>>     shutdown(s_ac, SHUT_WR);
>> }
>> for(;;) {
>>   char buf[1024];
>>   memset(&msg, 0, sizeof(msg));
>>   iov[0].iov_base =3D buf; iov[0].iov_len =3D sizeof(buf) - 1;
>>   msg.msg_iov =3D iov; msg.msg_iovlen =3D 1;
>>   gettimeofday(&tv0, NULL);
>>   ssize_t got =3D recvmsg(s_cl, &msg, 0);
>>   gettimeofday(&tv1, NULL);
>>   tdiff =3D (int)tv1.tv_usec - (int)tv0.tv_usec;
>>   if (tdiff < 0)
>>     tdiff +=3D 1000000;
>>   if (got =3D=3D 0)
>>     break;
>>   if (got =3D=3D -1) {
>>     perror("recvmsg");
>>     break;
>>   }
>>   buf[got] =3D 0;
>>   printf("got: %s (%s MSG_EOR) tdiff=3D%d\n",
>>       buf,
>>       (msg.msg_flags & MSG_EOR) ? "with" : "without",
>>       tdiff);
>>   if (!strncmp(buf, "pqr", 3))
>>     break;
>> }
>> return 0;
>> }
> OK. Here is what I would expect on the wire:
>=20
> Without SCTP_NODELAY:
>=20
>> INIT
> < INIT_ACK
>> COOKIE_ECHO
> < COOKIE_ACK
> < DATA(abc)
>> SACK
> < DATA(def);DATA(ghi);DATA(jkl);DATA(mno);DATA(pqr)
>> SACK
>> SHUTDOWN
> < SHUTDOWN_ACK
>> SHUTDOWN_COMPLETE
>=20
> There should be no substantial delay between any messages above.
>=20
> With SCTP_NODELAY
>> INIT
> < INIT_ACK
>> COOKIE_ECHO
> < COOKIE_ACK
> < DATA(abc)
> < DATA(def)
> < DATA(ghi)
> < DATA(mno)
> < DATA(pqr)
>> SHUTDOWN
> < SHUTDOWN_ACK
>> SHUTDOWN_COMPLETE
>=20
> There will be three SACK somewhere between the DATA chunks depending =
on
> the timing.
>=20
> There should be no substantial delay between any messages above.
>=20
> I think if you see anything else, there is a bug. So do you see a =
different
> behavior on FreeBSD 9.2 (i386/amd64)? If yes, can you provide a .pcap =
file?
>=20
>=20
> Here is what I see on a 9.2 amd64 system:
>=20
> tuexen@bsd9:~ % uname -a
> FreeBSD bsd9.fh-muenster.de 9.2-RELEASE FreeBSD 9.2-RELEASE #0 =
r255898: Thu Sep 26 22:50:31 UTC 2013     =
root@bake.isc.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
> tuexen@bsd9:~ % ./valentin=20
> got: abc (with MSG_EOR) tdiff=3D3
> got: def (with MSG_EOR) tdiff=3D1
> got: ghi (with MSG_EOR) tdiff=3D1
> got: jkl (with MSG_EOR) tdiff=3D1
> got: mno (with MSG_EOR) tdiff=3D1
> got: pqr (with MSG_EOR) tdiff=3D0
> tuexen@bsd9:~ % ./valentin nn
> got: abc (with MSG_EOR) tdiff=3D4
> got: def (with MSG_EOR) tdiff=3D2
> got: ghi (with MSG_EOR) tdiff=3D1
> got: jkl (with MSG_EOR) tdiff=3D1
> got: mno (with MSG_EOR) tdiff=3D1
> got: pqr (with MSG_EOR) tdiff=3D1
>=20
> Do you have any special routing setup?
>=20
> Best regards
> Michael
>> // vim:ts=3D2:sts=3D2:sw=3D2:et:si:
>> =3D=3D=3D
>>=20
>>=20
>> -netch-
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to =
"freebsd-net-unsubscribe@freebsd.org"
>>=20
>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>=20

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?45DB7B10-68DE-41F2-A5E9-22AFFC65999E>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation