Date: Tue, 14 Jun 2011 12:23:03 +0300 From: Kostik Belousov <kostikbel@gmail.com> To: Mikolaj Golub <trociny@freebsd.org> Cc: freebsd-net@freebsd.org, Pawel Jakub Dawidek <pjd@freebsd.org> Subject: Re: Scenario to make recv(MSG_WAITALL) stuck Message-ID: <20110614092303.GG48734@deviant.kiev.zoral.com.ua> In-Reply-To: <86pqmhn1pf.fsf@kopusha.home.net> References: <86pqmhn1pf.fsf@kopusha.home.net>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --] On Mon, Jun 13, 2011 at 07:19:40PM +0300, Mikolaj Golub wrote: > Hi, > > Below is a scenario how to make recv(2) with MSG_WAITALL flag get stuck. > > (See http://people.freebsd.org/~trociny/test_MSG_WAITALL.4.c for the test code). > > Let's the size of the receive buffer is SOBUF_SIZE (e.g. 10000 bytes). > > On sender side do 2 send() requests: > > 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10); > > 2) data of size equal to SOBUF_SIZE. > > After this on receiver side do 2 recv() requests with MSG_WAITALL flag set: > > 1) recv() data of SOBUF_SIZE / 10 size; > > 2) recv() data of SOBUF_SIZE size; > > The second recv() will last for very long time. In tcpdump one can observe > that the window is permanently stuck at 0 and pending data is only sent via > TCP window probes (so one byte every few seconds). > > 18:09:14.784698 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [S], seq 1907676797, win 65535, options [mss 16344,nop,wscale 3,sackOK,TS val 22207 ecr 0], length 0 > 18:09:14.784729 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [S.], seq 2298857585, ack 1907676798, win 10000, options [mss 16344,nop,wscale 3,sackOK,TS val 2718467987 ecr 22207], length 0 > 18:09:14.784749 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 0 > 18:09:14.785168 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [P.], seq 1:1001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 1000 > 18:09:14.785264 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1001:10001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 9000 > 18:09:14.785280 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10001, win 0, options [nop,nop,TS val 2718467987 ecr 22207], length 0 > 18:09:19.784440 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10001:10002, ack 1, win 8960, options [nop,nop,TS val 22707 ecr 2718467987], length 1 > 18:09:19.784480 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10001, win 0, options [nop,nop,TS val 2718468487 ecr 22707], length 0 > 18:09:24.784439 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10001:10002, ack 1, win 8960, options [nop,nop,TS val 23207 ecr 2718468487], length 1 > 18:09:24.784472 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10002, win 0, options [nop,nop,TS val 2718468987 ecr 23207], length 0 > 18:09:29.784437 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10002:10003, ack 1, win 8960, options [nop,nop,TS val 23707 ecr 2718468987], length 1 > 18:09:29.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10003, win 0, options [nop,nop,TS val 2718469487 ecr 23707], length 0 > 18:09:34.784444 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10003:10004, ack 1, win 8960, options [nop,nop,TS val 24207 ecr 2718469487], length 1 > 18:09:34.784486 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10004, win 0, options [nop,nop,TS val 2718469987 ecr 24207], length 0 > 18:09:39.784443 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10004:10005, ack 1, win 8960, options [nop,nop,TS val 24707 ecr 2718469987], length 1 > 18:09:39.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10005, win 0, options [nop,nop,TS val 2718470487 ecr 24707], length 0 > 18:09:44.784442 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10005:10006, ack 1, win 8960, options [nop,nop,TS val 25207 ecr 2718470487], length 1 > 18:09:44.784477 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10006, win 0, options [nop,nop,TS val 2718470987 ecr 25207], length 0 > ... > > I first noticed this issue with HAST and suspect other people observed it with > HAST too. > > Below is explanation what is going on. > > We totaly filled the receiver buffer with one SOBUF_SIZE/10 size request and > partial SOBUF_SIZE request. When the first request was processed we got > SOBUF_SIZE/10 free space. It was just enogh to recive the rest of bytes for > the second request, and the reciving thread went in soreceive_generic->sbwait > here: > > /* > * If we have less data than requested, block awaiting more (subject > * to any timeout) if: > * 1. the current count is less than the low water mark, or > * 2. MSG_WAITALL is set, and it is possible to do the entire > * receive operation at once if we block (resid <= hiwat). > * 3. MSG_DONTWAIT is not set > * If MSG_WAITALL is set but resid is larger than the receive buffer, > * we have to do the receive in sections, and thus risk returning a > * short count if a timeout or signal occurs after we start. > */ > if (m == NULL || (((flags & MSG_DONTWAIT) == 0 && > so->so_rcv.sb_cc < uio->uio_resid) && > (so->so_rcv.sb_cc < so->so_rcv.sb_lowat || > ((flags & MSG_WAITALL) && uio->uio_resid <= so->so_rcv.sb_hiwat)) && > m->m_nextpkt == NULL && (pr->pr_flags & PR_ATOMIC) == 0)) { > ... > error = sbwait(&so->so_rcv); > > recvbuf is almost full but has enough space to satisfy MSG_WAITALL request > without draining data to user buffer, and soreceive waits for data. But the > window was closed when the buffer was filled and to avoid silly window > syndrome it opens only when available space is larger than sb_hiwat/4 or > maxseg: > > tcp_output(): > > /* > * Calculate receive window. Don't shrink window, > * but avoid silly window syndrome. > */ > if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && > recwin < (long)tp->t_maxseg) > recwin = 0; > > so it is stuck and pending data is only sent via TCP window probes. > > It looks like the fix could be to remove this condition to block if > MSG_WAITALL is set and it is possible to do the entire receive operation at > once, like in the patch: > > http://people.freebsd.org/~trociny/uipc_socket.c.soreceive_generic.MSG_DONTWAIT.patch > > This works for me but I am not sure this is a correct solution. > > Note, the issue is not reproduced with soreceive_stream. > I do not understand what then happens for the recvfrom(2) call ? Would it get some error, or 0 as return and no data, or something else ? Also, what is the MT_CONTROL chunk about ? [-- Attachment #2 --] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (FreeBSD) iEYEARECAAYFAk33KHcACgkQC3+MBN1Mb4iprACg1vS2OwYrzEl3p9lkyzEg0GuH 3PQAoIO+Pj62IonkyB2UzamxDS3TGX2Z =KRFM -----END PGP SIGNATURE-----
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110614092303.GG48734>
