Date: Sat, 02 Oct 2010 15:20:58 +0300 From: Mikolaj Golub <to.my.trociny@gmail.com> To: freebsd-fs@freebsd.org Subject: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain Message-ID: <86hbh44wgl.fsf@kopusha.home.net>
next in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
After recent changes in hastd (I think r213006: Fix descriptor leaks) if
split-brain occurs hastd will abort in child_cleanup() on assertion
(res->hr_event != NULL).
Oct 2 17:24:17 lolek hastd[39334]: [storage] (init) Role changed to secondary.
Oct 2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.0:8457.
Oct 2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:17367 to tcp4://172.20.68.11:8457.
Oct 2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:17367: resource=storage
Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Initial connection from tcp4://172.20.68.12:17367.
Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Incoming connection from tcp4://172.20.68.12:17367 configured.
Oct 2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.0:8457.
Oct 2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:13769 to tcp4://172.20.68.11:8457.
Oct 2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:13769: resource=storage
Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Outgoing connection to tcp4://172.20.68.12:13769 configured.
Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Obtained info about /dev/ad4.
Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Locked /dev/ad4.
Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Split-brain detected, exiting.
Oct 2 17:24:17 lolek hastd[39334]: Unable to receive event header: Socket is not connected.
Oct 2 17:24:28 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.0:8457.
Oct 2 17:24:28 lolek hastd[39334]: Connection from tcp4://172.20.68.12:59760 to tcp4://172.20.68.11:8457.
Oct 2 17:24:28 lolek hastd[39334]: tcp4://172.20.68.12:59760: resource=storage
Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Initial connection from tcp4://172.20.68.12:59760.
Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process exists (pid=39339), stopping it.
Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process exited ungracefully (pid=39339, exitcode=78).
Oct 2 17:24:28 lolek kernel: pid 39334 (hastd), uid 0: exited on signal 6 (core dumped)
(gdb) bt
#0 0x28348d87 in kill () from /lib/libc.so.7
#1 0x280e1017 in raise () from /lib/libthr.so.3
#2 0x2834787a in abort () from /lib/libc.so.7
#3 0x2832fc86 in __assert () from /lib/libc.so.7
#4 0x0805f300 in proto_close (conn=0x0) at /usr/src/sbin/hastd/proto.c:287
#5 0x0804c445 in child_cleanup (res=0x284eb500) at /usr/src/sbin/hastd/control.c:61
#6 0x0804fc6d in listen_accept () at /usr/src/sbin/hastd/hastd.c:526
#7 0x0805059a in main_loop () at /usr/src/sbin/hastd/hastd.c:673
#8 0x08050a7f in main (argc=0, argv=0xbfbfed80) at /usr/src/sbin/hastd/hastd.c:784
(gdb) fr 5
#5 0x0804c445 in child_cleanup (res=0x284eb500) at /usr/src/sbin/hastd/control.c:61
61 proto_close(res->hr_event);
(gdb) list
56 child_cleanup(struct hast_resource *res)
57 {
58
59 proto_close(res->hr_ctrl);
60 res->hr_ctrl = NULL;
61 proto_close(res->hr_event);
62 res->hr_event = NULL;
63 res->hr_workerpid = 0;
64 }
65
So we have double close of res->hr_event. The first time it is closed when
parent detects that worker exited in main_loop(), and the second time when a
new connection from primary comes and the parent does cleanup after previously
terminated child before starting new one.
The straightforward fix is to check res->hr_event before closing, like in the
patch below.
--
Mikolaj Golub
[-- Attachment #2 --]
Index: sbin/hastd/control.c
===================================================================
--- sbin/hastd/control.c (revision 213357)
+++ sbin/hastd/control.c (working copy)
@@ -58,8 +58,10 @@ child_cleanup(struct hast_resource *res)
proto_close(res->hr_ctrl);
res->hr_ctrl = NULL;
- proto_close(res->hr_event);
- res->hr_event = NULL;
+ if (res->hr_event != NULL) {
+ proto_close(res->hr_event);
+ res->hr_event = NULL;
+ }
res->hr_workerpid = 0;
}
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86hbh44wgl.fsf>
