From owner-freebsd-fs@FreeBSD.ORG Mon Sep 20 13:25:14 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2B117106566C for ; Mon, 20 Sep 2010 13:25:14 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id 793098FC08 for ; Mon, 20 Sep 2010 13:25:13 +0000 (UTC) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 18C3745D8D; Mon, 20 Sep 2010 15:25:12 +0200 (CEST) Received: from localhost (pdawidek.whl [10.0.1.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id 7CE2845C9C; Mon, 20 Sep 2010 15:25:06 +0200 (CEST) Date: Mon, 20 Sep 2010 15:24:50 +0200 From: Pawel Jakub Dawidek To: Mikolaj Golub Message-ID: <20100920132450.GE2410@garage.freebsd.pl> References: <86d3saawn9.fsf@kopusha.home.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="EXKGNeO8l0xGFBjy" Content-Disposition: inline In-Reply-To: <86d3saawn9.fsf@kopusha.home.net> User-Agent: Mutt/1.4.2.3i X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 9.0-CURRENT amd64 X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-5.9 required=4.5 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.0.4 Cc: freebsd-fs@freebsd.org Subject: Re: hastd: possible race when a worker is starting X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Sep 2010 13:25:14 -0000 --EXKGNeO8l0xGFBjy Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Sep 19, 2010 at 12:51:54PM +0300, Mikolaj Golub wrote: > Hi, >=20 > After recent changes in hastd we have the following scenario when a worker > process is started: >=20 > 1) worker process is created; > 2) remote connections are established and CONNECT event is sent to parent; > 3) threads (among them control thread) are started. >=20 > Consider a sitution when hastctl request is recevied by parent between (1= ) and > (2). The parent sees that worker is running and sends it a request and wa= its > for a response. The request should be processed by control thread, but it= is > not started yet. Instead the worker sends CONNECT event and waits for a > response from the parent. We have a deadlock. >=20 > I was able to reproduce this scenario running a script that runs 'hastctl > status' requests in background and periodically kills the worker to start= a > new one: >=20 > ---------------------------------------------------------------------- >=20 > #!/bin/sh >=20 > hastctl role primary storage >=20 > while true; do > hastctl status storage > /dev/null > done & >=20 > while sleep 1; do > pid=3D`ps auxww | awk '/hastd: storage \(primary\)/ {print $2}'` > if [ -n "${pid}" ]; then > kill "${pid}" > sleep 1; > npid=3D`ps auxww | awk '/hastd: storage \(primary\)/ {pri= nt $2}'` > if [ "${pid}" =3D "${npid}" ]; then > echo "Process ${pid} was not killed" >&2 > exit 1 > fi > fi > done >=20 > ---------------------------------------------------------------------- >=20 > [root@lolek ~]# ./test_event2.sh=20 > Process 17559 was not killed >=20 > [root@lolek ~]# ps auxww |grep hast > root 17563 2.0 0.3 10924 1764 1 S 1:03PM 0:00.01 hastctl = status storage > root 3334 0.0 0.5 11244 2488 ?? Ss 12:13PM 0:02.06 /sbin/ha= std -ddd > root 17559 0.0 0.5 11244 2520 ?? S 1:03PM 0:00.01 hastd: s= torage (primary) (hastd) > root 17561 0.0 0.3 10924 1764 1 S 1:03PM 0:00.01 hastctl = status storage >=20 > [root@lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 3334 >=20 > (gdb) bt > #0 0x28300ed5 in recvfrom () from /lib/libc.so.7 > #1 0x28286f52 in recv () from /lib/libc.so.7 > #2 0x0805f237 in proto_common_recv (fd=3D41, data=3D0xbfbfeaa7 "(", size= =3D5) > at /usr/src/sbin/hastd/proto_common.c:77 > #3 0x0805f68d in sp_recv (ctx=3D0x2850e550, data=3D0xbfbfeaa7 "(", size= =3D5) > at /usr/src/sbin/hastd/proto_socketpair.c:185 > #4 0x0805ec61 in proto_recv (conn=3D0x2850e540, data=3D0xbfbfeaa7, size= =3D5) > at /usr/src/sbin/hastd/proto.c:207 > #5 0x0804e42e in hast_proto_recv_hdr (conn=3D0x2850e540, nvp=3D0xbfbfeae= 4) > at /usr/src/sbin/hastd/hast_proto.c:308 > #6 0x0804c83f in control_status_worker (res=3D0x284eb500, nvout=3D0x2850= e4d0, no=3D0) > at /usr/src/sbin/hastd/control.c:160 > #7 0x0804cc9f in control_status (cfg=3D0x28419600, nvout=3D0x2850e4d0, r= es=3D0x284eb500,=20 > name=3D0x2851042c "storage", no=3D0) at /usr/src/sbin/hastd/control.c= :248 > #8 0x0804cfbb in control_handle (cfg=3D0x28419600) at /usr/src/sbin/hast= d/control.c:348 > #9 0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682 > #10 0x0805076d in main (argc=3D0, argv=3D0xbfbfecd0) at /usr/src/sbin/has= td/hastd.c:792 >=20 > [root@lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 17559 >=20 > (gdb) bt > #0 0x28300ed5 in recvfrom () from /lib/libc.so.7 > #1 0x28286f52 in recv () from /lib/libc.so.7 > #2 0x0805f237 in proto_common_recv (fd=3D43, data=3D0xbfbfe9e7 "(", size= =3D5) > at /usr/src/sbin/hastd/proto_common.c:77 > #3 0x0805f68d in sp_recv (ctx=3D0x2850e570, data=3D0xbfbfe9e7 "(", size= =3D5) > at /usr/src/sbin/hastd/proto_socketpair.c:185 > #4 0x0805ec61 in proto_recv (conn=3D0x2850e560, data=3D0xbfbfe9e7, size= =3D5) > at /usr/src/sbin/hastd/proto.c:207 > #5 0x0804e42e in hast_proto_recv_hdr (conn=3D0x2850e560, nvp=3D0xbfbfea2= 4) > at /usr/src/sbin/hastd/hast_proto.c:308 > #6 0x0804db3c in event_send (res=3D0x284eb500, event=3D1) at /usr/src/sb= in/hastd/event.c:69 > #7 0x0805a1d1 in init_remote (res=3D0x284eb500, inp=3D0x0, outp=3D0x0) > at /usr/src/sbin/hastd/primary.c:671 > #8 0x0805a7f4 in hastd_primary (res=3D0x284eb500) at /usr/src/sbin/hastd= /primary.c:810 > #9 0x0804c6d5 in control_set_role_common (cfg=3D0x28419600, nvout=3D0x28= 50e4d0, role=3D2 '\002',=20 > res=3D0x284eb500, name=3D0x28510442 "storage", no=3D0) at /usr/src/sb= in/hastd/control.c:117 > #10 0x0804cf91 in control_handle (cfg=3D0x28419600) at /usr/src/sbin/hast= d/control.c:344 > #11 0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682 > #12 0x0805076d in main (argc=3D0, argv=3D0xbfbfecd0) at /usr/src/sbin/has= td/hastd.c:792 >=20 > This is for primary worker but I believe the same issue is with secondary. >=20 > The fix would be e.g. to start control thread before sending any event. Thanks for the report and analysis. Could you try this patch: http://people.freebsd.org/~pjd/patches/hastd.patch --=20 Pawel Jakub Dawidek http://www.wheelsystems.com pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --EXKGNeO8l0xGFBjy Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (FreeBSD) iEYEARECAAYFAkyXYKIACgkQForvXbEpPzQYHwCgpmfaPYBLx6+OKpuLh3PZFZsU nn8Ani81FaDawZ/BP53CMWs8e6A7fiG4 =wG3l -----END PGP SIGNATURE----- --EXKGNeO8l0xGFBjy--