Date: Wed, 21 Apr 2010 12:02:16 +0300 From: Mikolaj Golub <to.my.trociny@gmail.com> To: freebsd-fs <freebsd-fs@freebsd.org> Subject: HAST: primary might get stuck when there are connectivity problems with secondary Message-ID: <86r5m9dvqf.fsf@zhuzha.ua1>
next in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
I can make HAST primary get stuck making the secondary not accessible (network
packets are lost) for some period of time. I run HAST in VirtualBox hosts, so
to emulate network outage I just change bridge interface in VirtualBox
configuration.
Below are details for one example.
On the primary before the outage we have:
sockstat:
root hastd 1571 10 tcp4 172.20.66.201:41841 172.20.66.202:8457
root hastd 1571 11 tcp4 172.20.66.201:57596 172.20.66.202:8457
During the outage and after it sockstat shows the same, while netstat shows:
tcp4 0 0 172.20.66.201.57596 172.20.66.202.8457 ESTABLISHED
tcp4 0 8307 172.20.66.201.41841 172.20.66.202.8457 ESTABLISHED
(note non zero value for send buffer) and then later
tcp4 0 0 172.20.66.201.57596 172.20.66.202.8457 ESTABLISHED
tcp4 0 0 172.20.66.201.41841 172.20.66.202.8457 CLOSED
Restoring network after this changes nothing. Primary gets stuck. No messages
in the log and "dirty" in status output does not change:
[root@hasta ~]# hastctl status
storage:
role: primary
provname: storage
localpath: /dev/ad4
extentsize: 2097152
keepdirty: 64
remoteaddr: 172.20.66.202
replication: memsync
status: complete
dirty: 2097152 bytes
On the secondary we have all this time:
tcp4 0 0 172.20.66.202.8457 172.20.66.201.57596 ESTABLISHED
tcp4 0 0 172.20.66.202.8457 172.20.66.201.41841 ESTABLISHED
The last messages in log:
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Request received from the kernel: READ(13565952, 65536).
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Moving request to the send queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Got free request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Waiting for request from the kernel.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Got request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Moving request to the done queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: Taking request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: (0x28411bc0) Got request.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: (0x28411bc0) Moving request to the free queue.
Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: Taking request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Request received from the kernel: READ(1812529152, 65536).
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Moving request to the send queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: Taking free request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b00) Got free request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b00) Waiting for request from the kernel.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: (0x28411b80) Got request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: (0x28411b80) Moving request to the done queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: Taking request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: (0x28411b80) Got request.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: (0x28411b80) Moving request to the free queue.
Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: Taking request.
The backtrace of gotten stuck hastd is in the attach.
I interpret this in the following way. Although the network is down
hast_proto_send() in remote_send_thread() returns success (sent data are
stored in the kernel buffer). Then kernel tries to send data and eventually
fails after timeout and close the socket. hastd is not aware about this,
remote_send_thread() is blocked in "Taking request" at this time, sync thread
is waiting for status from the secondary about sent data but secondary does
not send it because it did not receive any data.
Restarting hastd on the secondary usually helps. A workaround is to set
net.inet.tcp.keepidle to some small value (e.g. 300 sec) on the
secondary. Then the secondary will notice much earlier that the peer has
closed the connection and will restart the worker itself:
Apr 21 11:52:21 hastb hastd: [storage] (secondary) Unable to receive request header: Connection reset by peer.
Apr 21 11:52:21 hastb hastd: [storage] (secondary) Worker process (pid=1398) exited ungracefully: status=19200.
--
Mikolaj Golub
[-- Attachment #2 --]
Thread 8 (Thread 28404140 (LWP 100078)):
#0 0x2823edd7 in __error () from /lib/libthr.so.3
#1 0x2823e9b8 in __error () from /lib/libthr.so.3
#2 0x284c3520 in ?? ()
#3 0x00000008 in ?? ()
#4 0x00000001 in ?? ()
#5 0x284c3500 in ?? ()
#6 0x00000000 in ?? ()
#7 0x280a5a00 in ?? ()
#8 0xbfbfe980 in ?? ()
#9 0x2823d31f in pthread_setcancelstate () from /lib/libthr.so.3
#10 0x2823cbbe in pthread_cond_signal () from /lib/libthr.so.3
#11 0x08058e78 in cv_wait (cv=0x8067e2c, lock=0x8067e28) at synch.h:125
#12 0x0805b75e in cv_timedwait (cv=0x8067e2c, lock=0x8067e28, timeout=0) at synch.h:135
#13 0x0805b72c in guard_thread (arg=0x284cab00) at /usr/src/sbin/hastd/primary.c:1787
#14 0x08058206 in hastd_primary (res=0x284cab00) at /usr/src/sbin/hastd/primary.c:773
#15 0x0804c4e8 in control_set_role (cfg=0x8066500, nvout=0x284eb0b0, role=2 '\002', res=0x284cab00,
name=0x28481442 "storage", no=0) at /usr/src/sbin/hastd/control.c:114
#16 0x0804cd01 in control_handle (cfg=0x8066500) at /usr/src/sbin/hastd/control.c:332
#17 0x0804f07c in main_loop () at /usr/src/sbin/hastd/hastd.c:425
#18 0x0804f3e8 in main (argc=0, argv=0xbfbfeda4) at /usr/src/sbin/hastd/hastd.c:521
Thread 7 (Thread 28404280 (LWP 100111)):
#0 0x28344773 in ioctl () from /lib/libc.so.7
#1 0x080588c4 in ggate_recv_thread (arg=0x284cab00) at /usr/src/sbin/hastd/primary.c:894
#2 0x2823428f in pthread_getprio () from /lib/libthr.so.3
#3 0x00000000 in ?? ()
Thread 6 (Thread 284043c0 (LWP 100112)):
#0 0x2823edd7 in __error () from /lib/libthr.so.3
#1 0x2823e9b8 in __error () from /lib/libthr.so.3
#2 0x284c32a0 in ?? ()
#3 0x00000008 in ?? ()
#4 0x00000001 in ?? ()
#5 0x284c3280 in ?? ()
#6 0x00000000 in ?? ()
#7 0xbf8fde94 in ?? ()
#8 0x28238db5 in pthread_rwlock_unlock () from /lib/libthr.so.3
#9 0x2823cbbe in pthread_cond_signal () from /lib/libthr.so.3
#10 0x08058e78 in cv_wait (cv=0x284c9080, lock=0x284c9078) at synch.h:125
#11 0x08058f37 in local_send_thread (arg=0x284cab00) at /usr/src/sbin/hastd/primary.c:1032
#12 0x2823428f in pthread_getprio () from /lib/libthr.so.3
#13 0x00000000 in ?? ()
Thread 5 (Thread 28404500 (LWP 100113)):
#0 0x2823edd7 in __error () from /lib/libthr.so.3
#1 0x2823e9b8 in __error () from /lib/libthr.so.3
#2 0x284c33a0 in ?? ()
#3 0x00000008 in ?? ()
#4 0x00000001 in ?? ()
#5 0x284c3380 in ?? ()
#6 0x00000000 in ?? ()
#7 0x00000000 in ?? ()
#8 0xd2ede389 in ?? ()
#9 0x2823d31f in pthread_setcancelstate () from /lib/libthr.so.3
#10 0x2823cbbe in pthread_cond_signal () from /lib/libthr.so.3
#11 0x08058e78 in cv_wait (cv=0x284c9084, lock=0x284c907c) at synch.h:125
#12 0x0805950f in remote_send_thread (arg=0x284cab00) at /usr/src/sbin/hastd/primary.c:1117
#13 0x2823428f in pthread_getprio () from /lib/libthr.so.3
#14 0x00000000 in ?? ()
Thread 4 (Thread 28404640 (LWP 100114)):
#0 0x282faa57 in recvfrom () from /lib/libc.so.7
#1 0x28280be2 in recv () from /lib/libc.so.7
#2 0x0805c287 in proto_common_recv (fd=11, data=0xbf6fbf27 "", size=5)
at /usr/src/sbin/hastd/proto_common.c:78
#3 0x0805d4f0 in tcp4_recv (ctx=0x2847f220, data=0xbf6fbf27 "", size=5)
at /usr/src/sbin/hastd/proto_tcp4.c:325
#4 0x0805bdf1 in proto_recv (conn=0x284eb150, data=0xbf6fbf27, size=5) at /usr/src/sbin/hastd/proto.c:198
#5 0x0804ddae in hast_proto_recv_hdr (conn=0x284eb150, nvp=0xbf6fbf7c) at /usr/src/sbin/hastd/hast_proto.c:298
#6 0x08059ef9 in remote_recv_thread (arg=0x284cab00) at /usr/src/sbin/hastd/primary.c:1282
#7 0x2823428f in pthread_getprio () from /lib/libthr.so.3
#8 0x00000000 in ?? ()
Thread 3 (Thread 28404780 (LWP 100115)):
#0 0x2823edd7 in __error () from /lib/libthr.so.3
#1 0x2823e9b8 in __error () from /lib/libthr.so.3
#2 0x284c34a0 in ?? ()
#3 0x00000008 in ?? ()
#4 0x00000001 in ?? ()
#4 0x00000001 in ?? ()
#5 0x284c3480 in ?? ()
#6 0x00000000 in ?? ()
#7 0x00000000 in ?? ()
#8 0x00000000 in ?? ()
#9 0x2823d31f in pthread_setcancelstate () from /lib/libthr.so.3
#10 0x2823cbbe in pthread_cond_signal () from /lib/libthr.so.3
#11 0x08058e78 in cv_wait (cv=0x8067e14, lock=0x8067e10) at synch.h:125
#12 0x0805a40b in ggate_send_thread (arg=0x284cab00) at /usr/src/sbin/hastd/primary.c:1383
#13 0x2823428f in pthread_getprio () from /lib/libthr.so.3
#14 0x00000000 in ?? ()
Thread 2 (Thread 284048c0 (LWP 100116)):
#0 0x2823edd7 in __error () from /lib/libthr.so.3
#1 0x2823e9b8 in __error () from /lib/libthr.so.3
#2 0x284c31a0 in ?? ()
#3 0x00000008 in ?? ()
#4 0x00000001 in ?? ()
#5 0x284c3180 in ?? ()
#6 0x00000000 in ?? ()
#7 0x00000000 in ?? ()
#8 0xbf4f9ea8 in ?? ()
#9 0x2823d31f in pthread_setcancelstate () from /lib/libthr.so.3
#10 0x2823cbbe in pthread_cond_signal () from /lib/libthr.so.3
#11 0x08058e78 in cv_wait (cv=0x8067e20, lock=0x8067e1c) at synch.h:125
#12 0x0805a7cc in sync_thread (arg=0x284cab00) at /usr/src/sbin/hastd/primary.c:1472
#13 0x2823428f in pthread_getprio () from /lib/libthr.so.3
#14 0x00000000 in ?? ()
Thread 1 (Thread 28404a00 (LWP 100117)):
#0 0x282faa55 in recvfrom () from /lib/libc.so.7
#1 0x28280be2 in recv () from /lib/libc.so.7
#2 0x0805c287 in proto_common_recv (fd=9, data=0xbf3f8f47 "*", size=5)
at /usr/src/sbin/hastd/proto_common.c:78
#3 0x0805c6ae in sp_recv (ctx=0x284eb100, data=0xbf3f8f47 "*", size=5)
at /usr/src/sbin/hastd/proto_socketpair.c:177
#4 0x0805bdf1 in proto_recv (conn=0x284eb0f0, data=0xbf3f8f47, size=5) at /usr/src/sbin/hastd/proto.c:198
#5 0x0804ddae in hast_proto_recv_hdr (conn=0x284eb0f0, nvp=0xbf3f8f80) at /usr/src/sbin/hastd/hast_proto.c:298
#6 0x0804ce27 in ctrl_thread (arg=0x284cab00) at /usr/src/sbin/hastd/control.c:373
#7 0x2823428f in pthread_getprio () from /lib/libthr.so.3
#8 0x00000000 in ?? ()
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86r5m9dvqf.fsf>
