From owner-freebsd-net@FreeBSD.ORG  Mon Jun 13 16:19:46 2011
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1B9761065670;
	Mon, 13 Jun 2011 16:19:46 +0000 (UTC)
	(envelope-from to.my.trociny@gmail.com)
Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com
	[209.85.161.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 4903D8FC08;
	Mon, 13 Jun 2011 16:19:44 +0000 (UTC)
Received: by fxm11 with SMTP id 11so4214848fxm.13
	for <multiple recipients>; Mon, 13 Jun 2011 09:19:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:from:to:cc:subject:sender:date:message-id
	:user-agent:mime-version:content-type;
	bh=fcvuK4pQB0uwAI1jGyBJI1L4cFh3hDkYJOvvUL6k304=;
	b=Yn49INjCAKib1Doe3iuh16WTW+R5feP8/oWuH7kd4o9BRoG4WLVQEXerRkanuVmKiQ
	EEiWRXWomypOZLwVkdALtuw3tFRa26FHYYxr0Tgf92/KbULk7UEdXGOoCSEQVSLgjy2c
	NLnDG1EtF3F4a2YfwYn0N6AcOFG6xhuXzImow=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=from:to:cc:subject:sender:date:message-id:user-agent:mime-version
	:content-type;
	b=aFnfka0I4Qt+/xxx42M2NNJjW3JUavO1+WhvRpeWKxXjBVzVoIIjUxp+qgKPL8ijFS
	18s6XTM3LTIex9M+Snx8qcynwlPg6qKu6kU4eJjS9TSDNXk4+wXsXd4wUgAUZewgcI3R
	DrcikNlayri44dbUXHFMFOX0ohQaZcqHLi/vQ=
Received: by 10.223.5.28 with SMTP id 28mr5277310fat.103.1307981983913;
	Mon, 13 Jun 2011 09:19:43 -0700 (PDT)
Received: from localhost ([95.69.172.154])
	by mx.google.com with ESMTPS id a18sm2122905fak.29.2011.06.13.09.19.41
	(version=TLSv1/SSLv3 cipher=OTHER);
	Mon, 13 Jun 2011 09:19:42 -0700 (PDT)
From: Mikolaj Golub <trociny@freebsd.org>
To: freebsd-net@freebsd.org
Sender: Mikolaj Golub <to.my.trociny@gmail.com>
Date: Mon, 13 Jun 2011 19:19:40 +0300
Message-ID: <86pqmhn1pf.fsf@kopusha.home.net>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Kostik Belousov <kib@FreeBSD.org>, Pawel Jakub Dawidek <pjd@FreeBSD.org>
Subject: Scenario to make recv(MSG_WAITALL) stuck
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Jun 2011 16:19:46 -0000

Hi,

Below is a scenario how to make recv(2) with MSG_WAITALL flag get stuck.

(See http://people.freebsd.org/~trociny/test_MSG_WAITALL.4.c for the test code).

Let's the size of the receive buffer is SOBUF_SIZE (e.g. 10000 bytes).

On sender side do 2 send() requests:

1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10);

2) data of size equal to SOBUF_SIZE.

After this on receiver side do 2 recv() requests with MSG_WAITALL flag set:

1) recv() data of SOBUF_SIZE / 10 size;

2) recv() data of SOBUF_SIZE size;

The second recv() will last for very long time. In tcpdump one can observe
that the window is permanently stuck at 0 and pending data is only sent via
TCP window probes (so one byte every few seconds).

18:09:14.784698 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [S], seq 1907676797, win 65535, options [mss 16344,nop,wscale 3,sackOK,TS val 22207 ecr 0], length 0
18:09:14.784729 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [S.], seq 2298857585, ack 1907676798, win 10000, options [mss 16344,nop,wscale 3,sackOK,TS val 2718467987 ecr 22207], length 0
18:09:14.784749 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 0
18:09:14.785168 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [P.], seq 1:1001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 1000
18:09:14.785264 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1001:10001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 9000
18:09:14.785280 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10001, win 0, options [nop,nop,TS val 2718467987 ecr 22207], length 0
18:09:19.784440 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10001:10002, ack 1, win 8960, options [nop,nop,TS val 22707 ecr 2718467987], length 1
18:09:19.784480 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10001, win 0, options [nop,nop,TS val 2718468487 ecr 22707], length 0
18:09:24.784439 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10001:10002, ack 1, win 8960, options [nop,nop,TS val 23207 ecr 2718468487], length 1
18:09:24.784472 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10002, win 0, options [nop,nop,TS val 2718468987 ecr 23207], length 0
18:09:29.784437 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10002:10003, ack 1, win 8960, options [nop,nop,TS val 23707 ecr 2718468987], length 1
18:09:29.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10003, win 0, options [nop,nop,TS val 2718469487 ecr 23707], length 0
18:09:34.784444 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10003:10004, ack 1, win 8960, options [nop,nop,TS val 24207 ecr 2718469487], length 1
18:09:34.784486 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10004, win 0, options [nop,nop,TS val 2718469987 ecr 24207], length 0
18:09:39.784443 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10004:10005, ack 1, win 8960, options [nop,nop,TS val 24707 ecr 2718469987], length 1
18:09:39.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10005, win 0, options [nop,nop,TS val 2718470487 ecr 24707], length 0
18:09:44.784442 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10005:10006, ack 1, win 8960, options [nop,nop,TS val 25207 ecr 2718470487], length 1
18:09:44.784477 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10006, win 0, options [nop,nop,TS val 2718470987 ecr 25207], length 0
...

I first noticed this issue with HAST and suspect other people observed it with
HAST too.

Below is explanation what is going on.

We totaly filled the receiver buffer with one SOBUF_SIZE/10 size request and
partial SOBUF_SIZE request. When the first request was processed we got
SOBUF_SIZE/10 free space. It was just enogh to recive the rest of bytes for
the second request, and the reciving thread went in soreceive_generic->sbwait
here:

        /*
         * If we have less data than requested, block awaiting more (subject
         * to any timeout) if:
         *   1. the current count is less than the low water mark, or
         *   2. MSG_WAITALL is set, and it is possible to do the entire
         *      receive operation at once if we block (resid <= hiwat).
         *   3. MSG_DONTWAIT is not set
         * If MSG_WAITALL is set but resid is larger than the receive buffer,
         * we have to do the receive in sections, and thus risk returning a
         * short count if a timeout or signal occurs after we start.
         */
        if (m == NULL || (((flags & MSG_DONTWAIT) == 0 &&
            so->so_rcv.sb_cc < uio->uio_resid) &&
            (so->so_rcv.sb_cc < so->so_rcv.sb_lowat ||
            ((flags & MSG_WAITALL) && uio->uio_resid <= so->so_rcv.sb_hiwat)) &&
            m->m_nextpkt == NULL && (pr->pr_flags & PR_ATOMIC) == 0)) {
                 ...
                 error = sbwait(&so->so_rcv);

recvbuf is almost full but has enough space to satisfy MSG_WAITALL request
without draining data to user buffer, and soreceive waits for data. But the
window was closed when the buffer was filled and to avoid silly window
syndrome it opens only when available space is larger than sb_hiwat/4 or
maxseg:

tcp_output():

        /*
         * Calculate receive window.  Don't shrink window,
         * but avoid silly window syndrome.
         */
        if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
            recwin < (long)tp->t_maxseg)
                recwin = 0;

so it is stuck and pending data is only sent via TCP window probes.

It looks like the fix could be to remove this condition to block if
MSG_WAITALL is set and it is possible to do the entire receive operation at
once, like in the patch:

http://people.freebsd.org/~trociny/uipc_socket.c.soreceive_generic.MSG_DONTWAIT.patch

This works for me but I am not sure this is a correct solution.

Note, the issue is not reproduced with soreceive_stream.

-- 
Mikolaj Golub