From owner-freebsd-fs@FreeBSD.ORG Mon Oct 4 20:59:27 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B66B81065674 for ; Mon, 4 Oct 2010 20:59:27 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id 409C88FC1C for ; Mon, 4 Oct 2010 20:59:26 +0000 (UTC) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id DABD745DD8; Mon, 4 Oct 2010 22:59:25 +0200 (CEST) Received: from localhost (chello089077043238.chello.pl [89.77.43.238]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 7B88545CDC; Mon, 4 Oct 2010 22:59:20 +0200 (CEST) Date: Mon, 4 Oct 2010 22:58:54 +0200 From: Pawel Jakub Dawidek To: Mikolaj Golub Message-ID: <20101004205854.GJ7322@garage.freebsd.pl> References: <86hbh44wgl.fsf@kopusha.home.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="fUvfsPTz/SzOZDdw" Content-Disposition: inline In-Reply-To: <86hbh44wgl.fsf@kopusha.home.net> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 9.0-CURRENT amd64 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-0.6 required=4.5 tests=BAYES_00,RCVD_IN_SORBS_DUL autolearn=no version=3.0.4 Cc: freebsd-fs@freebsd.org Subject: Re: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Oct 2010 20:59:27 -0000 --fUvfsPTz/SzOZDdw Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Oct 02, 2010 at 03:20:58PM +0300, Mikolaj Golub wrote: > Hi, >=20 > After recent changes in hastd (I think r213006: Fix descriptor leaks) if > split-brain occurs hastd will abort in child_cleanup() on assertion > (res->hr_event !=3D NULL). >=20 > Oct 2 17:24:17 lolek hastd[39334]: [storage] (init) Role changed to seco= ndary. > Oct 2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.= 0:8457. > Oct 2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:1= 7367 to tcp4://172.20.68.11:8457. > Oct 2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:17367: resource= =3Dstorage > Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Initial connect= ion from tcp4://172.20.68.12:17367. > Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Incoming connec= tion from tcp4://172.20.68.12:17367 configured. > Oct 2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.= 0:8457. > Oct 2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:1= 3769 to tcp4://172.20.68.11:8457. > Oct 2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:13769: resource= =3Dstorage > Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Outgoing connec= tion to tcp4://172.20.68.12:13769 configured. > Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Obtained info a= bout /dev/ad4. > Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Locked /dev/ad4. > Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Split-brain det= ected, exiting. > Oct 2 17:24:17 lolek hastd[39334]: Unable to receive event header: Socke= t is not connected. > Oct 2 17:24:28 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.= 0:8457. > Oct 2 17:24:28 lolek hastd[39334]: Connection from tcp4://172.20.68.12:5= 9760 to tcp4://172.20.68.11:8457. > Oct 2 17:24:28 lolek hastd[39334]: tcp4://172.20.68.12:59760: resource= =3Dstorage > Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Initial connect= ion from tcp4://172.20.68.12:59760. > Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process = exists (pid=3D39339), stopping it. > Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process = exited ungracefully (pid=3D39339, exitcode=3D78). > Oct 2 17:24:28 lolek kernel: pid 39334 (hastd), uid 0: exited on signal = 6 (core dumped) >=20 > (gdb) bt > #0 0x28348d87 in kill () from /lib/libc.so.7 > #1 0x280e1017 in raise () from /lib/libthr.so.3 > #2 0x2834787a in abort () from /lib/libc.so.7 > #3 0x2832fc86 in __assert () from /lib/libc.so.7 > #4 0x0805f300 in proto_close (conn=3D0x0) at /usr/src/sbin/hastd/proto.c= :287 > #5 0x0804c445 in child_cleanup (res=3D0x284eb500) at /usr/src/sbin/hastd= /control.c:61 > #6 0x0804fc6d in listen_accept () at /usr/src/sbin/hastd/hastd.c:526 > #7 0x0805059a in main_loop () at /usr/src/sbin/hastd/hastd.c:673 > #8 0x08050a7f in main (argc=3D0, argv=3D0xbfbfed80) at /usr/src/sbin/has= td/hastd.c:784 > (gdb) fr 5 > #5 0x0804c445 in child_cleanup (res=3D0x284eb500) at /usr/src/sbin/hastd= /control.c:61 > 61 proto_close(res->hr_event); > (gdb) list > 56 child_cleanup(struct hast_resource *res) > 57 { > 58 > 59 proto_close(res->hr_ctrl); > 60 res->hr_ctrl =3D NULL; > 61 proto_close(res->hr_event); > 62 res->hr_event =3D NULL; > 63 res->hr_workerpid =3D 0; > 64 } > 65 >=20 > So we have double close of res->hr_event. The first time it is closed when > parent detects that worker exited in main_loop(), and the second time whe= n a > new connection from primary comes and the parent does cleanup after previ= ously > terminated child before starting new one. >=20 > The straightforward fix is to check res->hr_event before closing, like in= the > patch below. This is correct fix. We close event socketpair early to prevent spaming with debug logs as SIGCHLD is delivered a bit later. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --fUvfsPTz/SzOZDdw Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAkyqQA0ACgkQForvXbEpPzSW8gCffC16IuebCBwq6wRvBRaoEejg jysAoJVHo8SIqIkEmPKrp/y2mu/CexkC =FmBS -----END PGP SIGNATURE----- --fUvfsPTz/SzOZDdw--