Date: Sun, 25 Apr 2010 14:17:15 +0300 From: Mikolaj Golub <to.my.trociny@gmail.com> To: Mikolaj Golub <to.my.trociny@gmail.com> Cc: freebsd-fs <freebsd-fs@freebsd.org>, Pawel Jakub Dawidek <pjd@FreeBSD.org> Subject: Re: HAST: primary might get stuck when there are connectivity problems with secondary Message-ID: <86tyqzeq84.fsf@kopusha.onet> In-Reply-To: <868w8dgk4e.fsf@kopusha.onet> (Mikolaj Golub's message of "Sat\, 24 Apr 2010 14\:33\:53 %2B0300") References: <86r5m9dvqf.fsf@zhuzha.ua1> <20100423062950.GD1670@garage.freebsd.pl> <86k4rye33e.fsf@zhuzha.ua1> <20100424073031.GD3067@garage.freebsd.pl> <868w8dgk4e.fsf@kopusha.onet>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
> From the code I don't see how hast_proto_recv_hdr() may timeout if the
> connection is alive, have I missed something?
I did some experiments adding the code that sets SO_RCVTIMEO socket option
(see the attached patch). It fixes this issue. After timeout the worker on the
secondary is restarted with the error:
Apr 25 13:06:45 hastb hastd: [storage] (secondary) Unable to receive request header: Resource temporarily unavailable.
Apr 25 13:06:45 hastb hastd: [storage] (secondary) Worker process (pid=1243) exited ungracefully: status=19200.
On the other hand when the FS is idle (there is no I/O at all) we have the
worker restart too and the primary is not being connected to the secondary
until some I/O appears. So it might look not very nicely :-)
Also note, I had to modify proto_common_recv() to have timeout working. After
timeout recv() sets errno to EWOULDBLOCK, which has the same number as EAGAIN
in FreeBSD. The current proto_common_recv() restarts recv() if EAGAIN is
returned.
--
Mikolaj Golub
[-- Attachment #2 --]
Index: sbin/hastd/proto_common.c
===================================================================
--- sbin/hastd/proto_common.c (revision 207185)
+++ sbin/hastd/proto_common.c (working copy)
@@ -76,7 +76,7 @@ proto_common_recv(int fd, unsigned char *data, siz
do {
done = recv(fd, data, size, MSG_WAITALL);
- } while (done == -1 && errno == EAGAIN);
+ } while (done == -1 && errno == EINTR);
if (done == 0)
return (ENOTCONN);
else if (done < 0)
Index: sbin/hastd/proto_tcp4.c
===================================================================
--- sbin/hastd/proto_tcp4.c (revision 207185)
+++ sbin/hastd/proto_tcp4.c (working copy)
@@ -31,6 +31,7 @@
__FBSDID("$FreeBSD$");
#include <sys/param.h> /* MAXHOSTNAMELEN */
+#include <sys/time.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
@@ -203,7 +204,7 @@ tcp4_common_setup(const char *addr, void **ctxp, i
sizeof(val)) == -1) {
pjdlog_warning("Unable to set receive buffer size on %s", addr);
}
-
+
tctx->tc_side = side;
tctx->tc_magic = TCP4_CTX_MAGIC;
*ctxp = tctx;
@@ -214,8 +215,23 @@ tcp4_common_setup(const char *addr, void **ctxp, i
static int
tcp4_client(const char *addr, void **ctxp)
{
+ struct tcp4_ctx *tctx;
+ struct timeval tv;
+ int ret;
- return (tcp4_common_setup(addr, ctxp, TCP4_SIDE_CLIENT));
+ if ((ret = tcp4_common_setup(addr, ctxp, TCP4_SIDE_CLIENT)) != 0)
+ return (ret);
+
+ tctx = *ctxp;
+
+ tv.tv_sec = 300;
+ tv.tv_usec = 0;
+ if (setsockopt(tctx->tc_fd, SOL_SOCKET, SO_RCVTIMEO, &tv,
+ sizeof(tv)) == -1) {
+ pjdlog_warning("Unable to set receive timeout %s", addr);
+ }
+
+ return (0);
}
static int
@@ -273,6 +289,7 @@ tcp4_accept(void *ctx, void **newctxp)
{
struct tcp4_ctx *tctx = ctx;
struct tcp4_ctx *newtctx;
+ struct timeval tv;
socklen_t fromlen;
int ret;
@@ -294,6 +311,13 @@ tcp4_accept(void *ctx, void **newctxp)
return (ret);
}
+ tv.tv_sec = 300;
+ tv.tv_usec = 0;
+ if (setsockopt(newtctx->tc_fd, SOL_SOCKET, SO_RCVTIMEO, &tv,
+ sizeof(tv)) == -1) {
+ pjdlog_debug(2, "Unable to set receive timeout");
+ }
+
newtctx->tc_side = TCP4_SIDE_SERVER_WORK;
newtctx->tc_magic = TCP4_CTX_MAGIC;
*newctxp = newtctx;
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86tyqzeq84.fsf>
