Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 4 Oct 2010 22:58:54 +0200
From:      Pawel Jakub Dawidek <pjd@FreeBSD.org>
To:        Mikolaj Golub <to.my.trociny@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain
Message-ID:  <20101004205854.GJ7322@garage.freebsd.pl>
In-Reply-To: <86hbh44wgl.fsf@kopusha.home.net>
References:  <86hbh44wgl.fsf@kopusha.home.net>

next in thread | previous in thread | raw e-mail | index | archive | help

--fUvfsPTz/SzOZDdw
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Oct 02, 2010 at 03:20:58PM +0300, Mikolaj Golub wrote:
> Hi,
>=20
> After recent changes in hastd (I think r213006: Fix descriptor leaks) if
> split-brain occurs hastd will abort in child_cleanup() on assertion
> (res->hr_event !=3D NULL).
>=20
> Oct  2 17:24:17 lolek hastd[39334]: [storage] (init) Role changed to seco=
ndary.
> Oct  2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.=
0:8457.
> Oct  2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:1=
7367 to tcp4://172.20.68.11:8457.
> Oct  2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:17367: resource=
=3Dstorage
> Oct  2 17:24:17 lolek hastd[39334]: [storage] (secondary) Initial connect=
ion from tcp4://172.20.68.12:17367.
> Oct  2 17:24:17 lolek hastd[39334]: [storage] (secondary) Incoming connec=
tion from tcp4://172.20.68.12:17367 configured.
> Oct  2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.=
0:8457.
> Oct  2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:1=
3769 to tcp4://172.20.68.11:8457.
> Oct  2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:13769: resource=
=3Dstorage
> Oct  2 17:24:17 lolek hastd[39334]: [storage] (secondary) Outgoing connec=
tion to tcp4://172.20.68.12:13769 configured.
> Oct  2 17:24:17 lolek hastd[39339]: [storage] (secondary) Obtained info a=
bout /dev/ad4.
> Oct  2 17:24:17 lolek hastd[39339]: [storage] (secondary) Locked /dev/ad4.
> Oct  2 17:24:17 lolek hastd[39339]: [storage] (secondary) Split-brain det=
ected, exiting.
> Oct  2 17:24:17 lolek hastd[39334]: Unable to receive event header: Socke=
t is not connected.
> Oct  2 17:24:28 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.=
0:8457.
> Oct  2 17:24:28 lolek hastd[39334]: Connection from tcp4://172.20.68.12:5=
9760 to tcp4://172.20.68.11:8457.
> Oct  2 17:24:28 lolek hastd[39334]: tcp4://172.20.68.12:59760: resource=
=3Dstorage
> Oct  2 17:24:28 lolek hastd[39334]: [storage] (secondary) Initial connect=
ion from tcp4://172.20.68.12:59760.
> Oct  2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process =
exists (pid=3D39339), stopping it.
> Oct  2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process =
exited ungracefully (pid=3D39339, exitcode=3D78).
> Oct  2 17:24:28 lolek kernel: pid 39334 (hastd), uid 0: exited on signal =
6 (core dumped)
>=20
> (gdb) bt
> #0  0x28348d87 in kill () from /lib/libc.so.7
> #1  0x280e1017 in raise () from /lib/libthr.so.3
> #2  0x2834787a in abort () from /lib/libc.so.7
> #3  0x2832fc86 in __assert () from /lib/libc.so.7
> #4  0x0805f300 in proto_close (conn=3D0x0) at /usr/src/sbin/hastd/proto.c=
:287
> #5  0x0804c445 in child_cleanup (res=3D0x284eb500) at /usr/src/sbin/hastd=
/control.c:61
> #6  0x0804fc6d in listen_accept () at /usr/src/sbin/hastd/hastd.c:526
> #7  0x0805059a in main_loop () at /usr/src/sbin/hastd/hastd.c:673
> #8  0x08050a7f in main (argc=3D0, argv=3D0xbfbfed80) at /usr/src/sbin/has=
td/hastd.c:784
> (gdb) fr 5
> #5  0x0804c445 in child_cleanup (res=3D0x284eb500) at /usr/src/sbin/hastd=
/control.c:61
> 61              proto_close(res->hr_event);
> (gdb) list
> 56      child_cleanup(struct hast_resource *res)
> 57      {
> 58
> 59              proto_close(res->hr_ctrl);
> 60              res->hr_ctrl =3D NULL;
> 61              proto_close(res->hr_event);
> 62              res->hr_event =3D NULL;
> 63              res->hr_workerpid =3D 0;
> 64      }
> 65
>=20
> So we have double close of res->hr_event. The first time it is closed when
> parent detects that worker exited in main_loop(), and the second time whe=
n a
> new connection from primary comes and the parent does cleanup after previ=
ously
> terminated child before starting new one.
>=20
> The straightforward fix is to check res->hr_event before closing, like in=
 the
> patch below.

This is correct fix. We close event socketpair early to prevent spaming
with debug logs as SIGCHLD is delivered a bit later.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!

--fUvfsPTz/SzOZDdw
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAkyqQA0ACgkQForvXbEpPzSW8gCffC16IuebCBwq6wRvBRaoEejg
jysAoJVHo8SIqIkEmPKrp/y2mu/CexkC
=FmBS
-----END PGP SIGNATURE-----

--fUvfsPTz/SzOZDdw--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101004205854.GJ7322>