From owner-freebsd-fs@FreeBSD.ORG Sun Sep 19 09:52:02 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0E9C91065673 for ; Sun, 19 Sep 2010 09:52:02 +0000 (UTC) (envelope-from to.my.trociny@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 94AF78FC1A for ; Sun, 19 Sep 2010 09:52:01 +0000 (UTC) Received: by eyx24 with SMTP id 24so1695564eyx.13 for ; Sun, 19 Sep 2010 02:52:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:subject:date :message-id:user-agent:mime-version:content-type; bh=Xpnvsz1Q2MWnjP5P14fpF50mWBxgLp2+CPc9I+HpfkM=; b=OLRzFtVi0YswC3HGmD0Lj4//Ilg9Xxx+8P5KtwGBiVVnF5B50C+TP9TbnaAId8QsDQ zCE2rtczL/+a/WAeNPiA3KIcRUiVVvBnhnV0zqgy+3WYGJZI2zWjQuiegTdbxyP/6STD d6jy5jL/sx1SCrbVBa7QqqLj0JdgS/ANteTKo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:subject:date:message-id:user-agent:mime-version :content-type; b=HvGIPMVGRQXx4ZrhCNVuMv4unRyGcKQ+Zpn7dzbo5KJWr0SxAq4BwH46DcPDP8TUE+ w840wiferwVYHPVU2z/V6/7lAt4v5etg+D83qc0sMO0OjW/6Ac+QJBHReHTaw5X6UHxx JdJVEnLat8RUNbzyqf2W/uHiKrAk/3uZQ02XU= Received: by 10.213.46.208 with SMTP id k16mr5751638ebf.20.1284889920252; Sun, 19 Sep 2010 02:52:00 -0700 (PDT) Received: from localhost ([95.69.173.68]) by mx.google.com with ESMTPS id u9sm8972675eeh.23.2010.09.19.02.51.58 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sun, 19 Sep 2010 02:51:59 -0700 (PDT) From: Mikolaj Golub To: freebsd-fs@freebsd.org Date: Sun, 19 Sep 2010 12:51:54 +0300 Message-ID: <86d3saawn9.fsf@kopusha.home.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: hastd: possible race when a worker is starting X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Sep 2010 09:52:02 -0000 Hi, After recent changes in hastd we have the following scenario when a worker process is started: 1) worker process is created; 2) remote connections are established and CONNECT event is sent to parent; 3) threads (among them control thread) are started. Consider a sitution when hastctl request is recevied by parent between (1) and (2). The parent sees that worker is running and sends it a request and waits for a response. The request should be processed by control thread, but it is not started yet. Instead the worker sends CONNECT event and waits for a response from the parent. We have a deadlock. I was able to reproduce this scenario running a script that runs 'hastctl status' requests in background and periodically kills the worker to start a new one: ---------------------------------------------------------------------- #!/bin/sh hastctl role primary storage while true; do hastctl status storage > /dev/null done & while sleep 1; do pid=`ps auxww | awk '/hastd: storage \(primary\)/ {print $2}'` if [ -n "${pid}" ]; then kill "${pid}" sleep 1; npid=`ps auxww | awk '/hastd: storage \(primary\)/ {print $2}'` if [ "${pid}" = "${npid}" ]; then echo "Process ${pid} was not killed" >&2 exit 1 fi fi done ---------------------------------------------------------------------- [root@lolek ~]# ./test_event2.sh Process 17559 was not killed [root@lolek ~]# ps auxww |grep hast root 17563 2.0 0.3 10924 1764 1 S 1:03PM 0:00.01 hastctl status storage root 3334 0.0 0.5 11244 2488 ?? Ss 12:13PM 0:02.06 /sbin/hastd -ddd root 17559 0.0 0.5 11244 2520 ?? S 1:03PM 0:00.01 hastd: storage (primary) (hastd) root 17561 0.0 0.3 10924 1764 1 S 1:03PM 0:00.01 hastctl status storage [root@lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 3334 (gdb) bt #0 0x28300ed5 in recvfrom () from /lib/libc.so.7 #1 0x28286f52 in recv () from /lib/libc.so.7 #2 0x0805f237 in proto_common_recv (fd=41, data=0xbfbfeaa7 "(", size=5) at /usr/src/sbin/hastd/proto_common.c:77 #3 0x0805f68d in sp_recv (ctx=0x2850e550, data=0xbfbfeaa7 "(", size=5) at /usr/src/sbin/hastd/proto_socketpair.c:185 #4 0x0805ec61 in proto_recv (conn=0x2850e540, data=0xbfbfeaa7, size=5) at /usr/src/sbin/hastd/proto.c:207 #5 0x0804e42e in hast_proto_recv_hdr (conn=0x2850e540, nvp=0xbfbfeae4) at /usr/src/sbin/hastd/hast_proto.c:308 #6 0x0804c83f in control_status_worker (res=0x284eb500, nvout=0x2850e4d0, no=0) at /usr/src/sbin/hastd/control.c:160 #7 0x0804cc9f in control_status (cfg=0x28419600, nvout=0x2850e4d0, res=0x284eb500, name=0x2851042c "storage", no=0) at /usr/src/sbin/hastd/control.c:248 #8 0x0804cfbb in control_handle (cfg=0x28419600) at /usr/src/sbin/hastd/control.c:348 #9 0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682 #10 0x0805076d in main (argc=0, argv=0xbfbfecd0) at /usr/src/sbin/hastd/hastd.c:792 [root@lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 17559 (gdb) bt #0 0x28300ed5 in recvfrom () from /lib/libc.so.7 #1 0x28286f52 in recv () from /lib/libc.so.7 #2 0x0805f237 in proto_common_recv (fd=43, data=0xbfbfe9e7 "(", size=5) at /usr/src/sbin/hastd/proto_common.c:77 #3 0x0805f68d in sp_recv (ctx=0x2850e570, data=0xbfbfe9e7 "(", size=5) at /usr/src/sbin/hastd/proto_socketpair.c:185 #4 0x0805ec61 in proto_recv (conn=0x2850e560, data=0xbfbfe9e7, size=5) at /usr/src/sbin/hastd/proto.c:207 #5 0x0804e42e in hast_proto_recv_hdr (conn=0x2850e560, nvp=0xbfbfea24) at /usr/src/sbin/hastd/hast_proto.c:308 #6 0x0804db3c in event_send (res=0x284eb500, event=1) at /usr/src/sbin/hastd/event.c:69 #7 0x0805a1d1 in init_remote (res=0x284eb500, inp=0x0, outp=0x0) at /usr/src/sbin/hastd/primary.c:671 #8 0x0805a7f4 in hastd_primary (res=0x284eb500) at /usr/src/sbin/hastd/primary.c:810 #9 0x0804c6d5 in control_set_role_common (cfg=0x28419600, nvout=0x2850e4d0, role=2 '\002', res=0x284eb500, name=0x28510442 "storage", no=0) at /usr/src/sbin/hastd/control.c:117 #10 0x0804cf91 in control_handle (cfg=0x28419600) at /usr/src/sbin/hastd/control.c:344 #11 0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682 #12 0x0805076d in main (argc=0, argv=0xbfbfecd0) at /usr/src/sbin/hastd/hastd.c:792 This is for primary worker but I believe the same issue is with secondary. The fix would be e.g. to start control thread before sending any event. -- Mikolaj Golub