From owner-freebsd-fs@FreeBSD.ORG Sat Oct 2 12:20:58 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BDFF21065673 for ; Sat, 2 Oct 2010 12:20:58 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-fx0-f54.google.com (mail-fx0-f54.google.com [209.85.161.54]) by mx1.freebsd.org (Postfix) with ESMTP id 4EDC58FC14 for ; Sat, 2 Oct 2010 12:20:57 +0000 (UTC) Received: by fxm9 with SMTP id 9so3272503fxm.13 for ; Sat, 02 Oct 2010 05:20:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:subject:date :message-id:user-agent:mime-version:content-type; bh=Rv1UxpXhIk1+YoVWCmSgiyy9MRK4ZWbyawBKD4LvuZI=; b=tl2rbxgHDdd/24IxeYDJ4/Bku5tPPxca00PwJ1iahS9Bpxg7cTZ93BFIKKnqgtGEGi DQGKczEr8tf3jz7c/Vpqnmwdeb4OhsfCHXWsDJ2iRnpEUYv0i9DAXMR3rKfLRu+sYXqb A0RMIlnZL/zOcHBcIEEQtcNYvL/nhNXBuJp2U= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:subject:date:message-id:user-agent:mime-version :content-type; b=WINb/0ytegrnguDI7dQwlEiK+F2C5SUX1kpuhpF8pTzx3ZwmQERQGWkwqcwY1CYOoO 6ZA/pipsf76UUi6S9HVbv0/vRRrqlgwzqEMkoQpGI+HR9xcnVKkeWA3fqWrf8rxpwmUa qd2Y1tecVaNyWDkAT/FJmRDk47cSbhWTPZfX0= Received: by 10.223.125.70 with SMTP id x6mr6497403far.85.1286022056865; Sat, 02 Oct 2010 05:20:56 -0700 (PDT) Received: from localhost ([95.69.162.97]) by mx.google.com with ESMTPS id a6sm1165582faa.20.2010.10.02.05.20.55 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 02 Oct 2010 05:20:56 -0700 (PDT) From: Mikolaj Golub To: freebsd-fs@freebsd.org Date: Sat, 02 Oct 2010 15:20:58 +0300 Message-ID: <86hbh44wgl.fsf@kopusha.home.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Subject: hastd: assertion (res->hr_event != NULL) fails in secondary on split-brain X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Oct 2010 12:20:58 -0000 --=-=-= Hi, After recent changes in hastd (I think r213006: Fix descriptor leaks) if split-brain occurs hastd will abort in child_cleanup() on assertion (res->hr_event != NULL). Oct 2 17:24:17 lolek hastd[39334]: [storage] (init) Role changed to secondary. Oct 2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.0:8457. Oct 2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:17367 to tcp4://172.20.68.11:8457. Oct 2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:17367: resource=storage Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Initial connection from tcp4://172.20.68.12:17367. Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Incoming connection from tcp4://172.20.68.12:17367 configured. Oct 2 17:24:17 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.0:8457. Oct 2 17:24:17 lolek hastd[39334]: Connection from tcp4://172.20.68.12:13769 to tcp4://172.20.68.11:8457. Oct 2 17:24:17 lolek hastd[39334]: tcp4://172.20.68.12:13769: resource=storage Oct 2 17:24:17 lolek hastd[39334]: [storage] (secondary) Outgoing connection to tcp4://172.20.68.12:13769 configured. Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Obtained info about /dev/ad4. Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Locked /dev/ad4. Oct 2 17:24:17 lolek hastd[39339]: [storage] (secondary) Split-brain detected, exiting. Oct 2 17:24:17 lolek hastd[39334]: Unable to receive event header: Socket is not connected. Oct 2 17:24:28 lolek hastd[39334]: Accepting connection to tcp4://0.0.0.0:8457. Oct 2 17:24:28 lolek hastd[39334]: Connection from tcp4://172.20.68.12:59760 to tcp4://172.20.68.11:8457. Oct 2 17:24:28 lolek hastd[39334]: tcp4://172.20.68.12:59760: resource=storage Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Initial connection from tcp4://172.20.68.12:59760. Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process exists (pid=39339), stopping it. Oct 2 17:24:28 lolek hastd[39334]: [storage] (secondary) Worker process exited ungracefully (pid=39339, exitcode=78). Oct 2 17:24:28 lolek kernel: pid 39334 (hastd), uid 0: exited on signal 6 (core dumped) (gdb) bt #0 0x28348d87 in kill () from /lib/libc.so.7 #1 0x280e1017 in raise () from /lib/libthr.so.3 #2 0x2834787a in abort () from /lib/libc.so.7 #3 0x2832fc86 in __assert () from /lib/libc.so.7 #4 0x0805f300 in proto_close (conn=0x0) at /usr/src/sbin/hastd/proto.c:287 #5 0x0804c445 in child_cleanup (res=0x284eb500) at /usr/src/sbin/hastd/control.c:61 #6 0x0804fc6d in listen_accept () at /usr/src/sbin/hastd/hastd.c:526 #7 0x0805059a in main_loop () at /usr/src/sbin/hastd/hastd.c:673 #8 0x08050a7f in main (argc=0, argv=0xbfbfed80) at /usr/src/sbin/hastd/hastd.c:784 (gdb) fr 5 #5 0x0804c445 in child_cleanup (res=0x284eb500) at /usr/src/sbin/hastd/control.c:61 61 proto_close(res->hr_event); (gdb) list 56 child_cleanup(struct hast_resource *res) 57 { 58 59 proto_close(res->hr_ctrl); 60 res->hr_ctrl = NULL; 61 proto_close(res->hr_event); 62 res->hr_event = NULL; 63 res->hr_workerpid = 0; 64 } 65 So we have double close of res->hr_event. The first time it is closed when parent detects that worker exited in main_loop(), and the second time when a new connection from primary comes and the parent does cleanup after previously terminated child before starting new one. The straightforward fix is to check res->hr_event before closing, like in the patch below. -- Mikolaj Golub --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=control.c.patch Index: sbin/hastd/control.c =================================================================== --- sbin/hastd/control.c (revision 213357) +++ sbin/hastd/control.c (working copy) @@ -58,8 +58,10 @@ child_cleanup(struct hast_resource *res) proto_close(res->hr_ctrl); res->hr_ctrl = NULL; - proto_close(res->hr_event); - res->hr_event = NULL; + if (res->hr_event != NULL) { + proto_close(res->hr_event); + res->hr_event = NULL; + } res->hr_workerpid = 0; } --=-=-=--