Date: Thu, 5 Dec 2013 11:57:46 +0100 From: Michael Tuexen <Michael.Tuexen@lurchi.franken.de> To: Valentin Nechayev <netch@netch.kiev.ua> Cc: freebsd-net@freebsd.org Subject: Re: SCTP huge connect delays (at amd64) and yet another question Message-ID: <45DB7B10-68DE-41F2-A5E9-22AFFC65999E@lurchi.franken.de> In-Reply-To: <11932BA9-A734-4D4F-BCBB-6A0D926A22A9@lurchi.franken.de> References: <20131205084142.GA31113@netch.kiev.ua> <11932BA9-A734-4D4F-BCBB-6A0D926A22A9@lurchi.franken.de>
next in thread | previous in thread | raw e-mail | index | archive | help
More thinking and testing. Without SCTP_NODELAY the following can also happen: > INIT < INIT-ACK < COOKIE-ECHO > COOKIE-ACK < DATA(abc) > SACK < DATA(def) possibly more... 200 ms delay > SACK < all remaining DATA chunks > SHUTDOWN < SHUTDOWN-ACK > SHUTDOWN-COMPLETE Timing comes into the game. The question is if all send() calls have = been completed before the first SACK is received. Not sure this depends in i386 vs. = amd64, but timing is important. On a Raspberry Pi I saw in a reproducable way > INIT < INIT-ACK < COOKIE-ECHO > COOKIE-ACK < DATA(abc) > SACK < DATA(def) 200 ms delay > SACK < DATA(ghi);DATA(jkl);DATA(mno);DATA(pqr); > SHUTDOWN < SHUTDOWN-ACK > SHUTDOWN-COMPLETE Best regards Michael On Dec 5, 2013, at 11:32 AM, Michael Tuexen = <Michael.Tuexen@lurchi.franken.de> wrote: > On Dec 5, 2013, at 9:41 AM, Valentin Nechayev <netch@netch.kiev.ua> = wrote: >=20 >> Hi, >>=20 >> I've got some test results which are surprising and I would get >> a clarification. >>=20 >> A simple connection is created between two one-to-one SCTP sockets >> (AF_INET, SOCK_STREAM, IPPROTO_SCTP) at loopback (127.0.0.1). The >> server side sends 6 3-byte messages to client side and optionally >> designates writing shutdown. Client receives all them and measures >> a time before each receiving. >> Code is showed at the end of this message. >> Tested systems are: >> * FreeBSD 9.2-release/amd64 >> * FreeBSD 9.1-release/amd64 >> * FreeBSD 9.1-release/i386 >> * Linux OpenSuSE 12.2, kernel 3.4.63-2.44-default, x86_64 >> * Linux RHEL 6.3, kernel 2.6.32-279.22.1.38.0.el6.x86_64 >>=20 >> The first discrepancy found is specific for FreeBSD on amd64 and not >> for i386 version; it's that connection setup lasts 2-4 seconds (!!) >> Tcpdump shows indication that could be parsed as message miss: > Hi Valentin, >=20 > could you send me the .pcap file instead of the tcpdump output. > I would like to see the addresses listed in the INIT and INIT-ACK. >=20 > You can send that file to tuexen@freebsd.org. >>=20 >> tcpdump: listening on lo0, link-type NULL (BSD loopback), capture = size 65535 byt >> es >> 08:18:34.639422 IP (tos 0x0, ttl 64, id 65094, offset 0, flags = [none], proto SCT >> P (132), length 188, bad cksum 0 (->f274)!) >> 10.0.0.2.50025 > 127.0.0.1.2500: sctp > I'm wondering why 10.0.0.2 is the source address and not 127.0.0.1 >> 1) [INIT] [init tag: 3943463987] [rwnd: 1864135] [OS: 10] [MIS: = 2048] [i >> nit TSN: 3475830004] >> 08:18:34.639450 IP (tos 0x0, ttl 64, id 42621, offset 0, flags = [none], proto SCT >> P (132), length 524, bad cksum 0 (->48ee)!) >> 127.0.0.1.2500 > 10.0.0.2.50025: sctp >> 1) [INIT ACK] [init tag: 59811639] [rwnd: 1864135] [OS: 10] = [MIS: 2048] >> [init TSN: 466863335] >> 08:18:34.639467 IP (tos 0x0, ttl 64, id 52783, offset 0, flags = [none], proto SCT >> P (132), length 424, bad cksum 0 (->21a0)!) >> 10.0.0.2.50025 > 127.0.0.1.2500: sctp >> 1) [COOKIE ECHO] >> 08:18:35.639618 IP (tos 0x0, ttl 64, id 12109, offset 0, flags [DF], = proto SCTP >> (132), length 424, bad cksum 0 (->8082)!) >> 10.0.0.2.50025 > 127.0.0.1.2500: sctp >> 1) [COOKIE ECHO] >> 08:18:36.692628 IP (tos 0x0, ttl 64, id 48682, offset 0, flags [DF], = proto SCTP >> (132), length 76, bad cksum 0 (->7e01)!) >> 127.0.0.1.2500 > 127.0.0.1.50025: sctp > The retransmission goes from 127.0.0.1. Hmm. Not sure why. >> 1) [HB REQ] >> 08:18:36.692668 IP (tos 0x0, ttl 64, id 10809, offset 0, flags [DF], = proto SCTP (132), length 76, bad cksum 0 (->86f2)!) >> 10.0.0.2.50025 > 127.0.0.1.2500: sctp >> 1) [HB ACK]=20 >> 08:18:36.692707 IP (tos 0x2,ECT(0), ttl 64, id 16588, offset 0, flags = [DF], proto SCTP (132), length 52, bad cksum 0 (->fb75)!) >> 127.0.0.1.2500 > 127.0.0.1.50025: sctp >> 1) [DATA] (B)(E) [TSN: 466863335] [SID: 0] [SSEQ 0] [PPID 0x0] = [Payload: >> 0x0000: 6162 63 abc >> [...] >>=20 >> At 08:18:34.639467, cookie echo was sent but likely ignored. One >> second later it was resent. Then, yet another strange timeout was >> invented before HB REQ. >>=20 >> Test series show this can spend more than 4 seconds, average value >> is about 3 seconds. Two 20-times run summary times are 58 to 63 >> seconds, so, I've got 2.9...3.15 average connect time. >>=20 >> Neither Linux nor 32-bit FreeBSD shows this. > FreeBSD should neither... Do you see this on FreeBSD 9.2 amd64? >>=20 >> The second discrepancy is well known case of so-called "Nagle" >> algorithm adapted for SCTP but details are confusing. If >> SCTP_NODELAY isn't turned on on server side, tcpdump shows that the >> second packet is sent from sender side without delay, but receiver's >> SACK is delayed for 200 ms by default. These results are identical = for >> FreeBSD (32 bit) and Linux, but not amd64 FreeBSD (see below). But >> why? A common sense suggests that, if client receives all = immediately, >> and server has already prepared its data, no additional delay shall = be >> invented. In analogue to TCP, I would expect that, until acknoledge >> for "abc" is got, "def" isn't sent, but then the latter is sent >> immediately. >>=20 >> 09:28:11.374335 IP (tos 0x2,ECT(0), ttl 64, id 24204, offset 0, flags = [DF], prot >> o SCTP (132), length 52, bad cksum 0 (->ddb5)!) >> 127.0.0.1.2500 > 127.0.0.1.41007: sctp >> 1) [DATA] (B)(E) [TSN: 183313025] [SID: 0] [SSEQ 0] [PPID 0x0] = [Payload: >> 0x0000: 6162 63 abc >> 09:28:11.374349 IP (tos 0x0, ttl 64, id 522, offset 0, flags [none], = proto SCTP=20 >> (132), length 48, bad cksum 0 (->7a3e)!) >> 127.0.0.1.41007 > 127.0.0.1.2500: sctp >> 1) [SACK] [cum ack 183313025] [a_rwnd 1863876] [#gap acks 0] = [#dup tsns=20 >> 0]=20 >> 09:28:11.374368 IP (tos 0x2,ECT(0), ttl 64, id 64629, offset 0, flags = [DF], prot >> o SCTP (132), length 52, bad cksum 0 (->3fcc)!) >> 127.0.0.1.2500 > 127.0.0.1.41007: sctp >> 1) [DATA] (B)(E) [TSN: 183313026] [SID: 0] [SSEQ 1] [PPID 0x0] = [Payload: >> 0x0000: 6465 66 def >> 09:28:11.573780 IP (tos 0x0, ttl 64, id 12179, offset 0, flags = [none], proto SCT >> P (132), length 48, bad cksum 0 (->4cb5)!) >> 127.0.0.1.41007 > 127.0.0.1.2500: sctp >> 1) [SACK] [cum ack 183313026] [a_rwnd 1864135] [#gap acks 0] = [#dup tsns=20 >> 0]=20 >>=20 > Please note, that the first SACK is returned without the 200ms delay. = This is > required by the RFC and the above trace seems to show that. >> But, if server shuts its writing side down ("s" in argv[]), this >> laziness disappears. Again, the logic is too opaque and confusing. > What do you mean by this? >>=20 >> 64-bit (amd64) FreeBSD shows another behavior (both 9.1 and 9.2): in >> addition to setup delay (see above), the delay between 2nd and 3rd >> received packet (case SCTP_NODELAY isn't activated) could be longer >> than minimally needed one and spreads between a few hundreds of >> microseconds up to full 0.2 second delay shown on other platforms. >> In average, 1/8 of runs show this delay: >>=20 >> $ fgrep ghi ll | sort -rn -k2,2 -t=3D | uniq -c >> 1 got: ghi (with MSG_EOR) tdiff=3D200835 >> 1 got: ghi (with MSG_EOR) tdiff=3D200829 >> 1 got: ghi (with MSG_EOR) tdiff=3D200826 >> 1 got: ghi (with MSG_EOR) tdiff=3D200822 >> 1 got: ghi (with MSG_EOR) tdiff=3D200819 >> 1 got: ghi (with MSG_EOR) tdiff=3D200800 >> 1 got: ghi (with MSG_EOR) tdiff=3D200792 >> 1 got: ghi (with MSG_EOR) tdiff=3D199885 >> 1 got: ghi (with MSG_EOR) tdiff=3D163816 >> 1 got: ghi (with MSG_EOR) tdiff=3D55849 >> 1 got: ghi (with MSG_EOR) tdiff=3D1825 >> 21 got: ghi (with MSG_EOR) tdiff=3D2 >> 38 got: ghi (with MSG_EOR) tdiff=3D1 >>=20 >> It's definitely better than delay each run, as on other platforms >> (but the initial delay annoys roughly). > Without SCTP_NODELAY bundling can happen or not, it depends on timing. > It would be great, if you can provide a .pcap file for a transfer you > think shows some buggy behaviour. Then we can figure out what is going = on. >>=20 >> The testing code: >> =3D=3D=3D >> #include <sys/types.h> >> #include <sys/socket.h> >> #include <sys/poll.h> >> #include <sys/time.h> >> #include <netinet/in.h> >> #include <netinet/sctp.h> >> #include <arpa/inet.h> >> #include <fcntl.h> >> #include <unistd.h> >> #include <stdio.h> >> #include <string.h> >> #include <err.h> >>=20 >> #define PORT 2500 >>=20 >> int main(int argc, char *argv[]) >> { >> int s_li, s_ac, s_cl; >> struct sockaddr_in sia; >> struct iovec iov[1]; >> struct msghdr msg; >> socklen_t slen; >> struct timeval tv0, tv1; >> int tdiff; >> int i; >>=20 >> s_li =3D socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP); >> if (s_li < 0) >> err(1, "socket"); >> memset(&sia, 0, sizeof(sia)); >> sia.sin_family =3D AF_INET; >> sia.sin_addr.s_addr =3D htonl(0x7F000001); >> sia.sin_port =3D htons(PORT); >> if (bind(s_li, (struct sockaddr*)&sia, sizeof(sia)) < 0) >> err(1, "bind"); >> if (listen(s_li, 1) < 0) >> err(1, "listen"); >> s_cl =3D socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP); >> if (s_cl < 0) >> err(1, "socket"); >> if (connect(s_cl, (struct sockaddr*)&sia, sizeof(sia)) < 0) >> err(1, "connect"); >> slen =3D sizeof(sia); >> s_ac =3D accept(s_li, (struct sockaddr*) &sia, &slen); >> if (s_ac < 0) >> err(1, "accept"); >> for (i =3D 1; i < argc; ++i) { >> if (!strcmp(argv[i], "nn")) { >> const int one =3D 1; >> if (setsockopt(s_ac, IPPROTO_SCTP, SCTP_NODELAY, &one, = sizeof(one)) < 0) >> warn("setsockopt(SCTP_NODELAY)"); >> } >> } >> if (send(s_ac, "abc", 3, 0) !=3D 3) >> err(1, "send"); >> if (send(s_ac, "def", 3, MSG_EOR) !=3D 3) > MSG_EOR is nothing you provide at a send() call. The flag is only > returned by the recvmsg() call. >> err(1, "send"); >> if (send(s_ac, "ghi", 3, 0) !=3D 3) >> err(1, "send"); >> if (send(s_ac, "jkl", 3, MSG_EOR) !=3D 3) >> err(1, "send"); >> if (send(s_ac, "mno", 3, 0) !=3D 3) >> err(1, "send"); >> if (send(s_ac, "pqr", 3, MSG_EOR) !=3D 3) >> err(1, "send"); >> for (i =3D 1; i < argc; ++i) { >> if (!strcmp(argv[i], "s")) >> shutdown(s_ac, SHUT_WR); >> } >> for(;;) { >> char buf[1024]; >> memset(&msg, 0, sizeof(msg)); >> iov[0].iov_base =3D buf; iov[0].iov_len =3D sizeof(buf) - 1; >> msg.msg_iov =3D iov; msg.msg_iovlen =3D 1; >> gettimeofday(&tv0, NULL); >> ssize_t got =3D recvmsg(s_cl, &msg, 0); >> gettimeofday(&tv1, NULL); >> tdiff =3D (int)tv1.tv_usec - (int)tv0.tv_usec; >> if (tdiff < 0) >> tdiff +=3D 1000000; >> if (got =3D=3D 0) >> break; >> if (got =3D=3D -1) { >> perror("recvmsg"); >> break; >> } >> buf[got] =3D 0; >> printf("got: %s (%s MSG_EOR) tdiff=3D%d\n", >> buf, >> (msg.msg_flags & MSG_EOR) ? "with" : "without", >> tdiff); >> if (!strncmp(buf, "pqr", 3)) >> break; >> } >> return 0; >> } > OK. Here is what I would expect on the wire: >=20 > Without SCTP_NODELAY: >=20 >> INIT > < INIT_ACK >> COOKIE_ECHO > < COOKIE_ACK > < DATA(abc) >> SACK > < DATA(def);DATA(ghi);DATA(jkl);DATA(mno);DATA(pqr) >> SACK >> SHUTDOWN > < SHUTDOWN_ACK >> SHUTDOWN_COMPLETE >=20 > There should be no substantial delay between any messages above. >=20 > With SCTP_NODELAY >> INIT > < INIT_ACK >> COOKIE_ECHO > < COOKIE_ACK > < DATA(abc) > < DATA(def) > < DATA(ghi) > < DATA(mno) > < DATA(pqr) >> SHUTDOWN > < SHUTDOWN_ACK >> SHUTDOWN_COMPLETE >=20 > There will be three SACK somewhere between the DATA chunks depending = on > the timing. >=20 > There should be no substantial delay between any messages above. >=20 > I think if you see anything else, there is a bug. So do you see a = different > behavior on FreeBSD 9.2 (i386/amd64)? If yes, can you provide a .pcap = file? >=20 >=20 > Here is what I see on a 9.2 amd64 system: >=20 > tuexen@bsd9:~ % uname -a > FreeBSD bsd9.fh-muenster.de 9.2-RELEASE FreeBSD 9.2-RELEASE #0 = r255898: Thu Sep 26 22:50:31 UTC 2013 = root@bake.isc.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 > tuexen@bsd9:~ % ./valentin=20 > got: abc (with MSG_EOR) tdiff=3D3 > got: def (with MSG_EOR) tdiff=3D1 > got: ghi (with MSG_EOR) tdiff=3D1 > got: jkl (with MSG_EOR) tdiff=3D1 > got: mno (with MSG_EOR) tdiff=3D1 > got: pqr (with MSG_EOR) tdiff=3D0 > tuexen@bsd9:~ % ./valentin nn > got: abc (with MSG_EOR) tdiff=3D4 > got: def (with MSG_EOR) tdiff=3D2 > got: ghi (with MSG_EOR) tdiff=3D1 > got: jkl (with MSG_EOR) tdiff=3D1 > got: mno (with MSG_EOR) tdiff=3D1 > got: pqr (with MSG_EOR) tdiff=3D1 >=20 > Do you have any special routing setup? >=20 > Best regards > Michael >> // vim:ts=3D2:sts=3D2:sw=3D2:et:si: >> =3D=3D=3D >>=20 >>=20 >> -netch- >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to = "freebsd-net-unsubscribe@freebsd.org" >>=20 >=20 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?45DB7B10-68DE-41F2-A5E9-22AFFC65999E>