FreeBSD Mail Archives

Date:      Fri, 10 Jun 2011 19:43:46 +0300
From:      Mikolaj Golub <trociny@freebsd.org>
To:        Maxim Sobolev <sobomax@FreeBSD.org>
Cc:        vadim_nuclight@mail.ru, Kostik Belousov <kib@FreeBSD.org>, svn-src-all@FreeBSD.org, Pawel Jakub Dawidek <pjd@FreeBSD.org>
Subject:   Re: svn commit: r222688 - head/sbin/hastd
Message-ID:  <864o3xab7h.fsf@kopusha.home.net>
In-Reply-To: <4DF11183.3060806@FreeBSD.org> (Maxim Sobolev's message of "Thu,  09 Jun 2011 11:31:31 -0700")
References:  <201106041601.p54G1Ut7016697@svn.freebsd.org> <BA66495E-AED3-459F-A5CD-69B91DB359BC@lists.zabbadoz.net> <4DEA653F.7070503@FreeBSD.org> <201106061057.p56Av3u7037614@kernblitz.nuclight.avtf.net> <4DED1CC5.1070001@FreeBSD.org> <86wrgvkv67.fsf@kopusha.home.net> <4DF11183.3060806@FreeBSD.org>

--=-=-=

On Thu, 09 Jun 2011 11:31:31 -0700 Maxim Sobolev wrote:

 MS> On 6/9/2011 6:10 AM, Mikolaj Golub wrote:
 >>   >>>  Hmm, not sure what exactly is wrong? Sender does 3 writes to the TCP
 >>   >>>  socket - 32k, 32k and 1071 bytes, while receiver does one
 >>   >>>  recv(MSG_WAITALL) with the size of 66607. So I suspect sender's kernel
 >>   >>>  does deliver two 32k packets and fills up receiver's buffer or
 >>   >>>  something. And the remaining 1071 bytes stay somewhere in sender's
 >>   >>>  kernel indefinitely, while recv() cannot complete in receiver's. Using
 >>   >>>  the same size when doing recv() solves the issue for me.
 >>
 >> With MSG_WAITALL, if data to receive are larger than receive buffer, after
 >> receiving some part of data it is drained to user buffer and the protocol is
 >> notified (sending window update) that there is some space in the receive
 >> buffer. So, normally, there should not be an issue with the scenario described
 >> above. But there was a race in soreceive_generic(), I believe I have fixed in
 >> r222454, when the connection could stall in sbwait. Do you still observe the
 >> issue with only r222454 applied?

 MS> The patch makes things slightly better, but it appears that there are
 MS> still some "magic" buffer sizes that got stuck somewhere. Particularly
 MS> 66607 bytes in my case. You can probably easily reproduce the issue by
 MS> creating large disk with data of various kind (i.e. FreeBSD UFS with
 MS> source/object code for example), enabling compression and setting
 MS> block size to 128kb. Then at least if you run this scenario over WAN
 MS> it should stuck from time to time when hitting that "magic" size. One
 MS> can probably easily write simple test case in C with server part
 MS> sending 32k, 32k and 1071 bytes and receiver reading the whole message
 MS> with WAITALL. Unfortunately I am overloaded right now, so it's
 MS> unlikely that I would do it.

I am observing HAST connection getting stuck in my tests. I don't know if this
is the same case as yours, but investigating my case two problems have been
exposed:

1) Automatic receive buffer sizing does not work for half-opened sockets (when
one direction is shutdowned).

2) It looks like in combination small recv buffer + MSG_WAITALL it is possible
to get state when recv window after shrinking to 0 is not reopened. The recv
window is permanently stuck at 0 and pending data is only sent via TCP window
probes (so one byte every few seconds).

Concerning (1) hastd sends data in one direction, so other direction is closed
by shutdown(2). The receiving socket is in FIN_WAIT_2 state. As I see from
netstat output recvbuf (R-HIWA) is 71680 and does not increase.

Proto Recv-Q Send-Q Local Address          Foreign Address        R-MBUF S-MBUF R-CLUS S-CLUS R-HIWA S-HIWA R-LOWA S-LOWA R-BCNT S-BCNT R-BMAX S-BMAX   rexmt persist    keep    2msl  delack rcvtime
tcp4   58539      0 192.168.1.103.7772     192.168.1.103.61333        20      0     20      0  71680  43008      1   2048  82944      0 262144 262144    0.00    0.00 1799.42    0.00    0.00    0.58
tcp4       0  92160 192.168.1.103.61333    192.168.1.103.7772          0     27      0     24      0  92160      0   2048      0 103168      0 262144    0.00    4.52 1799.52    0.00    0.00    0.48

When I remove socket shutdown from HAST code, so the sockets are in
ESTABLISHED state, recvbuf grows almost by R-BMAX:

Proto Recv-Q Send-Q Local Address          Foreign Address        R-MBUF S-MBUF R-CLUS S-CLUS R-HIWA S-HIWA R-LOWA S-LOWA R-BCNT S-BCNT R-BMAX S-BMAX   rexmt persist    keep    2msl  delack rcvtime
tcp4       0      0 192.168.1.103.7772     192.168.1.103.18457         0      0      0      0 219136  43008      1   2048      0      0 262144 262144    0.00    0.00 1793.15    0.00    0.00    6.85
tcp4       0      0 192.168.1.103.18457    192.168.1.103.7772          0      0      0      0  71680 198656      1   2048      0      0 262144 262144    0.00    0.00 1793.25    0.00    0.00    6.75

and I can't reproduce the issue (2) in this case.

Maxim, could you please try the attached patch and see if it fixes the issue
for you? The patch removes the code that shutdowns unused direction (actually,
it reverts r220271).

Concerning the issue (2), below are results of preliminary investigation :-).

When the connection stalls tcpdump output looks like below:

09:56:53.600274 IP 192.168.1.103.51645 > 192.168.1.103.7772: Flags [.], seq 515310314:515310315, ack 1135313145, win 8896, options [nop,nop,TS val 461359 ecr 1842953560], length 1
09:56:53.600323 IP 192.168.1.103.7772 > 192.168.1.103.51645: Flags [.], ack 1, win 0, options [nop,nop,TS val 1842954060 ecr 461359], length 0
09:56:58.600265 IP 192.168.1.103.51645 > 192.168.1.103.7772: Flags [.], seq 1:2, ack 1, win 8896, options [nop,nop,TS val 461859 ecr 1842954060], length 1
09:56:58.600322 IP 192.168.1.103.7772 > 192.168.1.103.51645: Flags [.], ack 2, win 0, options [nop,nop,TS val 1842954560 ecr 461859], length 0
...

kgdb shows that the reciving thread is in soreceive_generic->sbwait:

        /*
         * If we have less data than requested, block awaiting more (subject
         * to any timeout) if:
         *   1. the current count is less than the low water mark, or
         *   2. MSG_WAITALL is set, and it is possible to do the entire
         *      receive operation at once if we block (resid <= hiwat).
         *   3. MSG_DONTWAIT is not set
         * If MSG_WAITALL is set but resid is larger than the receive buffer,
         * we have to do the receive in sections, and thus risk returning a
         * short count if a timeout or signal occurs after we start.
         */
        if (m == NULL || (((flags & MSG_DONTWAIT) == 0 &&
            so->so_rcv.sb_cc < uio->uio_resid) &&
            (so->so_rcv.sb_cc < so->so_rcv.sb_lowat ||
            ((flags & MSG_WAITALL) && uio->uio_resid <= so->so_rcv.sb_hiwat)) &&
            m->m_nextpkt == NULL && (pr->pr_flags & PR_ATOMIC) == 0)) {
                 ...
                 error = sbwait(&so->so_rcv);

And here are some parameters from gdb:

uio:
  uio_resid = 65536

so->so_rcv:
  sb_cc = 63318, 
  sb_hiwat = 71680, 
  sb_mbcnt = 89600, 
  sb_mcnt = 22, 
  sb_ccnt = 21, 
  sb_mbmax = 262144, 
  sb_lowat = 1, 

so->so_pcb->inp_ppcb:
  t_state = 9, 
  t_flags = 525300, 
  rcv_nxt = 515310391, 
  rcv_adv = 515310391, 
  rcv_wnd = 8264, 
  rcv_up = 515310390, 
  t_maxseg = 14336, 

The request (uio_resid) is 65536 bytes, 63318 bytes (sb_cc) were
recived, MSG_WAITALL is set and uio_resid less than buffer size (sb_hiwat), so
the thread is in sbwait() waiting for the rest of data (2218 bytes).

I suppose that although there is space in the buffer (according to sbspace()
recwin is sb_hiwat - sb_cc = 8362), this value is small to open window:

tcp_output():

        /*
         * Calculate receive window.  Don't shrink window,
         * but avoid silly window syndrome.
         */
        if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
            recwin < (long)tp->t_maxseg)
                recwin = 0;

So (if I have not missed something) we have a situation when recvbuf is almost
full but still has enough space to satisfy MSG_WAITALL request without
draining data to user buffer, so soreceive waits for data, but the sender
can't send them because window is 0, and window is not going to be increased
until the data is drained to user buffer...

I am going to check this writing a test case and send the results to net@.

-- 
Mikolaj Golub

--=-=-=
Content-Type: text/x-patch
Content-Disposition: attachment; filename=hastd.no_shutdown.patch

Index: sbin/hastd/secondary.c
===================================================================
--- sbin/hastd/secondary.c	(revision 222687)
+++ sbin/hastd/secondary.c	(working copy)
@@ -183,10 +183,6 @@ init_remote(struct hast_resource *res, struct nv *
 	unsigned char *map;
 	size_t mapsize;

-	/* Setup direction. */
-	if (proto_send(res->hr_remoteout, NULL, 0) == -1)
-		pjdlog_errno(LOG_WARNING, "Unable to set connection direction");
-
 	map = NULL;
 	mapsize = 0;
 	nvout = nv_alloc();
@@ -351,9 +347,6 @@ init_remote(struct hast_resource *res, struct nv *
 	if (map != NULL)
 		free(map);
 	nv_free(nvout);
-	/* Setup direction. */
-	if (proto_recv(res->hr_remotein, NULL, 0) == -1)
-		pjdlog_errno(LOG_WARNING, "Unable to set connection direction");
 	if (res->hr_secondary_localcnt > res->hr_primary_remotecnt &&
 	     res->hr_primary_localcnt > res->hr_secondary_remotecnt) {
 		/* Exit on split-brain. */
Index: sbin/hastd/primary.c
===================================================================
--- sbin/hastd/primary.c	(revision 222687)
+++ sbin/hastd/primary.c	(working copy)
@@ -726,11 +726,6 @@ init_remote(struct hast_resource *res, struct prot
 		(void)hast_activemap_flush(res);
 	}
 	nv_free(nvin);
-	/* Setup directions. */
-	if (proto_send(out, NULL, 0) == -1)
-		pjdlog_errno(LOG_WARNING, "Unable to set connection direction");
-	if (proto_recv(in, NULL, 0) == -1)
-		pjdlog_errno(LOG_WARNING, "Unable to set connection direction");
 	pjdlog_info("Connected to %s.", res->hr_remoteaddr);
 	if (inp != NULL && outp != NULL) {
 		*inp = in;

--=-=-=--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?864o3xab7h.fsf>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation