Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Jun 2011 12:23:03 +0300
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Mikolaj Golub <trociny@freebsd.org>
Cc:        freebsd-net@freebsd.org, Pawel Jakub Dawidek <pjd@freebsd.org>
Subject:   Re: Scenario to make recv(MSG_WAITALL) stuck
Message-ID:  <20110614092303.GG48734@deviant.kiev.zoral.com.ua>
In-Reply-To: <86pqmhn1pf.fsf@kopusha.home.net>
References:  <86pqmhn1pf.fsf@kopusha.home.net>

next in thread | previous in thread | raw e-mail | index | archive | help

--3lc1OntGIaWzUKJL
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Jun 13, 2011 at 07:19:40PM +0300, Mikolaj Golub wrote:
> Hi,
>=20
> Below is a scenario how to make recv(2) with MSG_WAITALL flag get stuck.
>=20
> (See http://people.freebsd.org/~trociny/test_MSG_WAITALL.4.c for the test=
 code).
>=20
> Let's the size of the receive buffer is SOBUF_SIZE (e.g. 10000 bytes).
>=20
> On sender side do 2 send() requests:
>=20
> 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10);
>=20
> 2) data of size equal to SOBUF_SIZE.
>=20
> After this on receiver side do 2 recv() requests with MSG_WAITALL flag se=
t:
>=20
> 1) recv() data of SOBUF_SIZE / 10 size;
>=20
> 2) recv() data of SOBUF_SIZE size;
>=20
> The second recv() will last for very long time. In tcpdump one can observe
> that the window is permanently stuck at 0 and pending data is only sent v=
ia
> TCP window probes (so one byte every few seconds).
>=20
> 18:09:14.784698 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [S], seq 1907=
676797, win 65535, options [mss 16344,nop,wscale 3,sackOK,TS val 22207 ecr =
0], length 0
> 18:09:14.784729 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [S.], seq 229=
8857585, ack 1907676798, win 10000, options [mss 16344,nop,wscale 3,sackOK,=
TS val 2718467987 ecr 22207], length 0
> 18:09:14.784749 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], ack 1, w=
in 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 0
> 18:09:14.785168 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [P.], seq 1:1=
001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length=
 1000
> 18:09:14.785264 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1001=
:10001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], len=
gth 9000
> 18:09:14.785280 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 1000=
1, win 0, options [nop,nop,TS val 2718467987 ecr 22207], length 0
> 18:09:19.784440 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1000=
1:10002, ack 1, win 8960, options [nop,nop,TS val 22707 ecr 2718467987], le=
ngth 1
> 18:09:19.784480 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 1000=
1, win 0, options [nop,nop,TS val 2718468487 ecr 22707], length 0
> 18:09:24.784439 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1000=
1:10002, ack 1, win 8960, options [nop,nop,TS val 23207 ecr 2718468487], le=
ngth 1
> 18:09:24.784472 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 1000=
2, win 0, options [nop,nop,TS val 2718468987 ecr 23207], length 0
> 18:09:29.784437 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1000=
2:10003, ack 1, win 8960, options [nop,nop,TS val 23707 ecr 2718468987], le=
ngth 1
> 18:09:29.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 1000=
3, win 0, options [nop,nop,TS val 2718469487 ecr 23707], length 0
> 18:09:34.784444 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1000=
3:10004, ack 1, win 8960, options [nop,nop,TS val 24207 ecr 2718469487], le=
ngth 1
> 18:09:34.784486 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 1000=
4, win 0, options [nop,nop,TS val 2718469987 ecr 24207], length 0
> 18:09:39.784443 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1000=
4:10005, ack 1, win 8960, options [nop,nop,TS val 24707 ecr 2718469987], le=
ngth 1
> 18:09:39.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 1000=
5, win 0, options [nop,nop,TS val 2718470487 ecr 24707], length 0
> 18:09:44.784442 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1000=
5:10006, ack 1, win 8960, options [nop,nop,TS val 25207 ecr 2718470487], le=
ngth 1
> 18:09:44.784477 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 1000=
6, win 0, options [nop,nop,TS val 2718470987 ecr 25207], length 0
> ...
>=20
> I first noticed this issue with HAST and suspect other people observed it=
 with
> HAST too.
>=20
> Below is explanation what is going on.
>=20
> We totaly filled the receiver buffer with one SOBUF_SIZE/10 size request =
and
> partial SOBUF_SIZE request. When the first request was processed we got
> SOBUF_SIZE/10 free space. It was just enogh to recive the rest of bytes f=
or
> the second request, and the reciving thread went in soreceive_generic->sb=
wait
> here:
>=20
>         /*
>          * If we have less data than requested, block awaiting more (subj=
ect
>          * to any timeout) if:
>          *   1. the current count is less than the low water mark, or
>          *   2. MSG_WAITALL is set, and it is possible to do the entire
>          *      receive operation at once if we block (resid <=3D hiwat).
>          *   3. MSG_DONTWAIT is not set
>          * If MSG_WAITALL is set but resid is larger than the receive buf=
fer,
>          * we have to do the receive in sections, and thus risk returning=
 a
>          * short count if a timeout or signal occurs after we start.
>          */
>         if (m =3D=3D NULL || (((flags & MSG_DONTWAIT) =3D=3D 0 &&
>             so->so_rcv.sb_cc < uio->uio_resid) &&
>             (so->so_rcv.sb_cc < so->so_rcv.sb_lowat ||
>             ((flags & MSG_WAITALL) && uio->uio_resid <=3D so->so_rcv.sb_h=
iwat)) &&
>             m->m_nextpkt =3D=3D NULL && (pr->pr_flags & PR_ATOMIC) =3D=3D=
 0)) {
>                  ...
>                  error =3D sbwait(&so->so_rcv);
>=20
> recvbuf is almost full but has enough space to satisfy MSG_WAITALL request
> without draining data to user buffer, and soreceive waits for data. But t=
he
> window was closed when the buffer was filled and to avoid silly window
> syndrome it opens only when available space is larger than sb_hiwat/4 or
> maxseg:
>=20
> tcp_output():
>=20
>         /*
>          * Calculate receive window.  Don't shrink window,
>          * but avoid silly window syndrome.
>          */
>         if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
>             recwin < (long)tp->t_maxseg)
>                 recwin =3D 0;
>=20
> so it is stuck and pending data is only sent via TCP window probes.
>=20
> It looks like the fix could be to remove this condition to block if
> MSG_WAITALL is set and it is possible to do the entire receive operation =
at
> once, like in the patch:
>=20
> http://people.freebsd.org/~trociny/uipc_socket.c.soreceive_generic.MSG_DO=
NTWAIT.patch
>=20
> This works for me but I am not sure this is a correct solution.
>=20
> Note, the issue is not reproduced with soreceive_stream.
>=20
I do not understand what then happens for the recvfrom(2) call ?
Would it get some error, or 0 as return and no data, or something else ?

Also, what is the MT_CONTROL chunk about ?

--3lc1OntGIaWzUKJL
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (FreeBSD)

iEYEARECAAYFAk33KHcACgkQC3+MBN1Mb4iprACg1vS2OwYrzEl3p9lkyzEg0GuH
3PQAoIO+Pj62IonkyB2UzamxDS3TGX2Z
=KRFM
-----END PGP SIGNATURE-----

--3lc1OntGIaWzUKJL--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110614092303.GG48734>