Date: Fri, 10 Jun 2011 19:43:46 +0300 From: Mikolaj Golub <trociny@freebsd.org> To: Maxim Sobolev <sobomax@FreeBSD.org> Cc: vadim_nuclight@mail.ru, Kostik Belousov <kib@FreeBSD.org>, svn-src-all@FreeBSD.org, Pawel Jakub Dawidek <pjd@FreeBSD.org> Subject: Re: svn commit: r222688 - head/sbin/hastd Message-ID: <864o3xab7h.fsf@kopusha.home.net> In-Reply-To: <4DF11183.3060806@FreeBSD.org> (Maxim Sobolev's message of "Thu, 09 Jun 2011 11:31:31 -0700") References: <201106041601.p54G1Ut7016697@svn.freebsd.org> <BA66495E-AED3-459F-A5CD-69B91DB359BC@lists.zabbadoz.net> <4DEA653F.7070503@FreeBSD.org> <201106061057.p56Av3u7037614@kernblitz.nuclight.avtf.net> <4DED1CC5.1070001@FreeBSD.org> <86wrgvkv67.fsf@kopusha.home.net> <4DF11183.3060806@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--=-=-= On Thu, 09 Jun 2011 11:31:31 -0700 Maxim Sobolev wrote: MS> On 6/9/2011 6:10 AM, Mikolaj Golub wrote: >> >>> Hmm, not sure what exactly is wrong? Sender does 3 writes to the TCP >> >>> socket - 32k, 32k and 1071 bytes, while receiver does one >> >>> recv(MSG_WAITALL) with the size of 66607. So I suspect sender's kernel >> >>> does deliver two 32k packets and fills up receiver's buffer or >> >>> something. And the remaining 1071 bytes stay somewhere in sender's >> >>> kernel indefinitely, while recv() cannot complete in receiver's. Using >> >>> the same size when doing recv() solves the issue for me. >> >> With MSG_WAITALL, if data to receive are larger than receive buffer, after >> receiving some part of data it is drained to user buffer and the protocol is >> notified (sending window update) that there is some space in the receive >> buffer. So, normally, there should not be an issue with the scenario described >> above. But there was a race in soreceive_generic(), I believe I have fixed in >> r222454, when the connection could stall in sbwait. Do you still observe the >> issue with only r222454 applied? MS> The patch makes things slightly better, but it appears that there are MS> still some "magic" buffer sizes that got stuck somewhere. Particularly MS> 66607 bytes in my case. You can probably easily reproduce the issue by MS> creating large disk with data of various kind (i.e. FreeBSD UFS with MS> source/object code for example), enabling compression and setting MS> block size to 128kb. Then at least if you run this scenario over WAN MS> it should stuck from time to time when hitting that "magic" size. One MS> can probably easily write simple test case in C with server part MS> sending 32k, 32k and 1071 bytes and receiver reading the whole message MS> with WAITALL. Unfortunately I am overloaded right now, so it's MS> unlikely that I would do it. I am observing HAST connection getting stuck in my tests. I don't know if this is the same case as yours, but investigating my case two problems have been exposed: 1) Automatic receive buffer sizing does not work for half-opened sockets (when one direction is shutdowned). 2) It looks like in combination small recv buffer + MSG_WAITALL it is possible to get state when recv window after shrinking to 0 is not reopened. The recv window is permanently stuck at 0 and pending data is only sent via TCP window probes (so one byte every few seconds). Concerning (1) hastd sends data in one direction, so other direction is closed by shutdown(2). The receiving socket is in FIN_WAIT_2 state. As I see from netstat output recvbuf (R-HIWA) is 71680 and does not increase. Proto Recv-Q Send-Q Local Address Foreign Address R-MBUF S-MBUF R-CLUS S-CLUS R-HIWA S-HIWA R-LOWA S-LOWA R-BCNT S-BCNT R-BMAX S-BMAX rexmt persist keep 2msl delack rcvtime tcp4 58539 0 192.168.1.103.7772 192.168.1.103.61333 20 0 20 0 71680 43008 1 2048 82944 0 262144 262144 0.00 0.00 1799.42 0.00 0.00 0.58 tcp4 0 92160 192.168.1.103.61333 192.168.1.103.7772 0 27 0 24 0 92160 0 2048 0 103168 0 262144 0.00 4.52 1799.52 0.00 0.00 0.48 When I remove socket shutdown from HAST code, so the sockets are in ESTABLISHED state, recvbuf grows almost by R-BMAX: Proto Recv-Q Send-Q Local Address Foreign Address R-MBUF S-MBUF R-CLUS S-CLUS R-HIWA S-HIWA R-LOWA S-LOWA R-BCNT S-BCNT R-BMAX S-BMAX rexmt persist keep 2msl delack rcvtime tcp4 0 0 192.168.1.103.7772 192.168.1.103.18457 0 0 0 0 219136 43008 1 2048 0 0 262144 262144 0.00 0.00 1793.15 0.00 0.00 6.85 tcp4 0 0 192.168.1.103.18457 192.168.1.103.7772 0 0 0 0 71680 198656 1 2048 0 0 262144 262144 0.00 0.00 1793.25 0.00 0.00 6.75 and I can't reproduce the issue (2) in this case. Maxim, could you please try the attached patch and see if it fixes the issue for you? The patch removes the code that shutdowns unused direction (actually, it reverts r220271). Concerning the issue (2), below are results of preliminary investigation :-). When the connection stalls tcpdump output looks like below: 09:56:53.600274 IP 192.168.1.103.51645 > 192.168.1.103.7772: Flags [.], seq 515310314:515310315, ack 1135313145, win 8896, options [nop,nop,TS val 461359 ecr 1842953560], length 1 09:56:53.600323 IP 192.168.1.103.7772 > 192.168.1.103.51645: Flags [.], ack 1, win 0, options [nop,nop,TS val 1842954060 ecr 461359], length 0 09:56:58.600265 IP 192.168.1.103.51645 > 192.168.1.103.7772: Flags [.], seq 1:2, ack 1, win 8896, options [nop,nop,TS val 461859 ecr 1842954060], length 1 09:56:58.600322 IP 192.168.1.103.7772 > 192.168.1.103.51645: Flags [.], ack 2, win 0, options [nop,nop,TS val 1842954560 ecr 461859], length 0 ... kgdb shows that the reciving thread is in soreceive_generic->sbwait: /* * If we have less data than requested, block awaiting more (subject * to any timeout) if: * 1. the current count is less than the low water mark, or * 2. MSG_WAITALL is set, and it is possible to do the entire * receive operation at once if we block (resid <= hiwat). * 3. MSG_DONTWAIT is not set * If MSG_WAITALL is set but resid is larger than the receive buffer, * we have to do the receive in sections, and thus risk returning a * short count if a timeout or signal occurs after we start. */ if (m == NULL || (((flags & MSG_DONTWAIT) == 0 && so->so_rcv.sb_cc < uio->uio_resid) && (so->so_rcv.sb_cc < so->so_rcv.sb_lowat || ((flags & MSG_WAITALL) && uio->uio_resid <= so->so_rcv.sb_hiwat)) && m->m_nextpkt == NULL && (pr->pr_flags & PR_ATOMIC) == 0)) { ... error = sbwait(&so->so_rcv); And here are some parameters from gdb: uio: uio_resid = 65536 so->so_rcv: sb_cc = 63318, sb_hiwat = 71680, sb_mbcnt = 89600, sb_mcnt = 22, sb_ccnt = 21, sb_mbmax = 262144, sb_lowat = 1, so->so_pcb->inp_ppcb: t_state = 9, t_flags = 525300, rcv_nxt = 515310391, rcv_adv = 515310391, rcv_wnd = 8264, rcv_up = 515310390, t_maxseg = 14336, The request (uio_resid) is 65536 bytes, 63318 bytes (sb_cc) were recived, MSG_WAITALL is set and uio_resid less than buffer size (sb_hiwat), so the thread is in sbwait() waiting for the rest of data (2218 bytes). I suppose that although there is space in the buffer (according to sbspace() recwin is sb_hiwat - sb_cc = 8362), this value is small to open window: tcp_output(): /* * Calculate receive window. Don't shrink window, * but avoid silly window syndrome. */ if (recwin < (long)(so->so_rcv.sb_hiwat / 4) && recwin < (long)tp->t_maxseg) recwin = 0; So (if I have not missed something) we have a situation when recvbuf is almost full but still has enough space to satisfy MSG_WAITALL request without draining data to user buffer, so soreceive waits for data, but the sender can't send them because window is 0, and window is not going to be increased until the data is drained to user buffer... I am going to check this writing a test case and send the results to net@. -- Mikolaj Golub --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=hastd.no_shutdown.patch Index: sbin/hastd/secondary.c =================================================================== --- sbin/hastd/secondary.c (revision 222687) +++ sbin/hastd/secondary.c (working copy) @@ -183,10 +183,6 @@ init_remote(struct hast_resource *res, struct nv * unsigned char *map; size_t mapsize; - /* Setup direction. */ - if (proto_send(res->hr_remoteout, NULL, 0) == -1) - pjdlog_errno(LOG_WARNING, "Unable to set connection direction"); - map = NULL; mapsize = 0; nvout = nv_alloc(); @@ -351,9 +347,6 @@ init_remote(struct hast_resource *res, struct nv * if (map != NULL) free(map); nv_free(nvout); - /* Setup direction. */ - if (proto_recv(res->hr_remotein, NULL, 0) == -1) - pjdlog_errno(LOG_WARNING, "Unable to set connection direction"); if (res->hr_secondary_localcnt > res->hr_primary_remotecnt && res->hr_primary_localcnt > res->hr_secondary_remotecnt) { /* Exit on split-brain. */ Index: sbin/hastd/primary.c =================================================================== --- sbin/hastd/primary.c (revision 222687) +++ sbin/hastd/primary.c (working copy) @@ -726,11 +726,6 @@ init_remote(struct hast_resource *res, struct prot (void)hast_activemap_flush(res); } nv_free(nvin); - /* Setup directions. */ - if (proto_send(out, NULL, 0) == -1) - pjdlog_errno(LOG_WARNING, "Unable to set connection direction"); - if (proto_recv(in, NULL, 0) == -1) - pjdlog_errno(LOG_WARNING, "Unable to set connection direction"); pjdlog_info("Connected to %s.", res->hr_remoteaddr); if (inp != NULL && outp != NULL) { *inp = in; --=-=-=--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?864o3xab7h.fsf>