Date: Mon, 4 Oct 2010 22:58:54 +0200 From: Pawel Jakub Dawidek <pjd@FreeBSD.org> To: Mikolaj Golub <to.my.trociny@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain Message-ID: <20101004205854.GJ7322@garage.freebsd.pl> In-Reply-To: <86hbh44wgl.fsf@kopusha.home.net> References: <86hbh44wgl.fsf@kopusha.home.net>
next in thread | previous in thread | raw e-mail | index | archive | help
--fUvfsPTz/SzOZDdw Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Oct 02, 2010 at 03:20:58PM +0300, Mikolaj Golub wrote: > Hi, >=20 > After recent changes in hastd (I think r213006: Fix descriptor leaks) if > split-brain occurs hastd will abort in child_cleanup() on assertion > (res->hr_event !=3D NULL). >=20 > Oct 2 17:24:17 lolek hastd[39334]: [storage] (init) Role changed to seco= ndary. > Oct 2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.= 0:8457. > Oct 2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:1= 7367 to tcp4://172.20.68.11:8457. > Oct 2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:17367: resource= =3Dstorage > Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Initial connect= ion from tcp4://172.20.68.12:17367. > Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Incoming connec= tion from tcp4://172.20.68.12:17367 configured. > Oct 2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.= 0:8457. > Oct 2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:1= 3769 to tcp4://172.20.68.11:8457. > Oct 2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:13769: resource= =3Dstorage > Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Outgoing connec= tion to tcp4://172.20.68.12:13769 configured. > Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Obtained info a= bout /dev/ad4. > Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Locked /dev/ad4. > Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Split-brain det= ected, exiting. > Oct 2 17:24:17 lolek hastd[39334]: Unable to receive event header: Socke= t is not connected. > Oct 2 17:24:28 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.= 0:8457. > Oct 2 17:24:28 lolek hastd[39334]: Connection from tcp4://172.20.68.12:5= 9760 to tcp4://172.20.68.11:8457. > Oct 2 17:24:28 lolek hastd[39334]: tcp4://172.20.68.12:59760: resource= =3Dstorage > Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Initial connect= ion from tcp4://172.20.68.12:59760. > Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process = exists (pid=3D39339), stopping it. > Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process = exited ungracefully (pid=3D39339, exitcode=3D78). > Oct 2 17:24:28 lolek kernel: pid 39334 (hastd), uid 0: exited on signal = 6 (core dumped) >=20 > (gdb) bt > #0 0x28348d87 in kill () from /lib/libc.so.7 > #1 0x280e1017 in raise () from /lib/libthr.so.3 > #2 0x2834787a in abort () from /lib/libc.so.7 > #3 0x2832fc86 in __assert () from /lib/libc.so.7 > #4 0x0805f300 in proto_close (conn=3D0x0) at /usr/src/sbin/hastd/proto.c= :287 > #5 0x0804c445 in child_cleanup (res=3D0x284eb500) at /usr/src/sbin/hastd= /control.c:61 > #6 0x0804fc6d in listen_accept () at /usr/src/sbin/hastd/hastd.c:526 > #7 0x0805059a in main_loop () at /usr/src/sbin/hastd/hastd.c:673 > #8 0x08050a7f in main (argc=3D0, argv=3D0xbfbfed80) at /usr/src/sbin/has= td/hastd.c:784 > (gdb) fr 5 > #5 0x0804c445 in child_cleanup (res=3D0x284eb500) at /usr/src/sbin/hastd= /control.c:61 > 61 proto_close(res->hr_event); > (gdb) list > 56 child_cleanup(struct hast_resource *res) > 57 { > 58 > 59 proto_close(res->hr_ctrl); > 60 res->hr_ctrl =3D NULL; > 61 proto_close(res->hr_event); > 62 res->hr_event =3D NULL; > 63 res->hr_workerpid =3D 0; > 64 } > 65 >=20 > So we have double close of res->hr_event. The first time it is closed when > parent detects that worker exited in main_loop(), and the second time whe= n a > new connection from primary comes and the parent does cleanup after previ= ously > terminated child before starting new one. >=20 > The straightforward fix is to check res->hr_event before closing, like in= the > patch below. This is correct fix. We close event socketpair early to prevent spaming with debug logs as SIGCHLD is delivered a bit later. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --fUvfsPTz/SzOZDdw Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAkyqQA0ACgkQForvXbEpPzSW8gCffC16IuebCBwq6wRvBRaoEejg jysAoJVHo8SIqIkEmPKrp/y2mu/CexkC =FmBS -----END PGP SIGNATURE----- --fUvfsPTz/SzOZDdw--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101004205854.GJ7322>