Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 20 Sep 2010 15:24:50 +0200
From:      Pawel Jakub Dawidek <pjd@FreeBSD.org>
To:        Mikolaj Golub <to.my.trociny@gmail.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: hastd: possible race when a worker is starting
Message-ID:  <20100920132450.GE2410@garage.freebsd.pl>
In-Reply-To: <86d3saawn9.fsf@kopusha.home.net>
References:  <86d3saawn9.fsf@kopusha.home.net>

next in thread | previous in thread | raw e-mail | index | archive | help

--EXKGNeO8l0xGFBjy
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Sep 19, 2010 at 12:51:54PM +0300, Mikolaj Golub wrote:
> Hi,
>=20
> After recent changes in hastd we have the following scenario when a worker
> process is started:
>=20
> 1) worker process is created;
> 2) remote connections are established and CONNECT event is sent to parent;
> 3) threads (among them control thread) are started.
>=20
> Consider a sitution when hastctl request is recevied by parent between (1=
) and
> (2). The parent sees that worker is running and sends it a request and wa=
its
> for a response. The request should be processed by control thread, but it=
 is
> not started yet. Instead the worker sends CONNECT event and waits for a
> response from the parent. We have a deadlock.
>=20
> I was able to reproduce this scenario running a script that runs 'hastctl
> status' requests in background and periodically kills the worker to start=
 a
> new one:
>=20
> ----------------------------------------------------------------------
>=20
> #!/bin/sh
>=20
> hastctl role primary storage
>=20
> while true; do
>         hastctl status storage > /dev/null
> done &
>=20
> while sleep 1; do
>         pid=3D`ps auxww | awk '/hastd: storage \(primary\)/ {print $2}'`
>         if [ -n "${pid}" ]; then
>                 kill "${pid}"
>                 sleep 1;
>                 npid=3D`ps auxww | awk '/hastd: storage \(primary\)/ {pri=
nt $2}'`
>                 if [ "${pid}" =3D "${npid}" ]; then
>                         echo "Process ${pid} was not killed" >&2
>                         exit 1
>                 fi
>         fi
> done
>=20
> ----------------------------------------------------------------------
>=20
> [root@lolek ~]# ./test_event2.sh=20
> Process 17559 was not killed
>=20
> [root@lolek ~]# ps auxww |grep hast
> root  17563   2.0  0.3  10924   1764   1  S     1:03PM   0:00.01 hastctl =
status storage
> root   3334   0.0  0.5  11244   2488  ??  Ss   12:13PM   0:02.06 /sbin/ha=
std -ddd
> root  17559   0.0  0.5  11244   2520  ??  S     1:03PM   0:00.01 hastd: s=
torage (primary) (hastd)
> root  17561   0.0  0.3  10924   1764   1  S     1:03PM   0:00.01 hastctl =
status storage
>=20
> [root@lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 3334
>=20
> (gdb) bt
> #0  0x28300ed5 in recvfrom () from /lib/libc.so.7
> #1  0x28286f52 in recv () from /lib/libc.so.7
> #2  0x0805f237 in proto_common_recv (fd=3D41, data=3D0xbfbfeaa7 "(", size=
=3D5)
>     at /usr/src/sbin/hastd/proto_common.c:77
> #3  0x0805f68d in sp_recv (ctx=3D0x2850e550, data=3D0xbfbfeaa7 "(", size=
=3D5)
>     at /usr/src/sbin/hastd/proto_socketpair.c:185
> #4  0x0805ec61 in proto_recv (conn=3D0x2850e540, data=3D0xbfbfeaa7, size=
=3D5)
>     at /usr/src/sbin/hastd/proto.c:207
> #5  0x0804e42e in hast_proto_recv_hdr (conn=3D0x2850e540, nvp=3D0xbfbfeae=
4)
>     at /usr/src/sbin/hastd/hast_proto.c:308
> #6  0x0804c83f in control_status_worker (res=3D0x284eb500, nvout=3D0x2850=
e4d0, no=3D0)
>     at /usr/src/sbin/hastd/control.c:160
> #7  0x0804cc9f in control_status (cfg=3D0x28419600, nvout=3D0x2850e4d0, r=
es=3D0x284eb500,=20
>     name=3D0x2851042c "storage", no=3D0) at /usr/src/sbin/hastd/control.c=
:248
> #8  0x0804cfbb in control_handle (cfg=3D0x28419600) at /usr/src/sbin/hast=
d/control.c:348
> #9  0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682
> #10 0x0805076d in main (argc=3D0, argv=3D0xbfbfecd0) at /usr/src/sbin/has=
td/hastd.c:792
>=20
> [root@lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 17559
>=20
> (gdb) bt
> #0  0x28300ed5 in recvfrom () from /lib/libc.so.7
> #1  0x28286f52 in recv () from /lib/libc.so.7
> #2  0x0805f237 in proto_common_recv (fd=3D43, data=3D0xbfbfe9e7 "(", size=
=3D5)
>     at /usr/src/sbin/hastd/proto_common.c:77
> #3  0x0805f68d in sp_recv (ctx=3D0x2850e570, data=3D0xbfbfe9e7 "(", size=
=3D5)
>     at /usr/src/sbin/hastd/proto_socketpair.c:185
> #4  0x0805ec61 in proto_recv (conn=3D0x2850e560, data=3D0xbfbfe9e7, size=
=3D5)
>     at /usr/src/sbin/hastd/proto.c:207
> #5  0x0804e42e in hast_proto_recv_hdr (conn=3D0x2850e560, nvp=3D0xbfbfea2=
4)
>     at /usr/src/sbin/hastd/hast_proto.c:308
> #6  0x0804db3c in event_send (res=3D0x284eb500, event=3D1) at /usr/src/sb=
in/hastd/event.c:69
> #7  0x0805a1d1 in init_remote (res=3D0x284eb500, inp=3D0x0, outp=3D0x0)
>     at /usr/src/sbin/hastd/primary.c:671
> #8  0x0805a7f4 in hastd_primary (res=3D0x284eb500) at /usr/src/sbin/hastd=
/primary.c:810
> #9  0x0804c6d5 in control_set_role_common (cfg=3D0x28419600, nvout=3D0x28=
50e4d0, role=3D2 '\002',=20
>     res=3D0x284eb500, name=3D0x28510442 "storage", no=3D0) at /usr/src/sb=
in/hastd/control.c:117
> #10 0x0804cf91 in control_handle (cfg=3D0x28419600) at /usr/src/sbin/hast=
d/control.c:344
> #11 0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682
> #12 0x0805076d in main (argc=3D0, argv=3D0xbfbfecd0) at /usr/src/sbin/has=
td/hastd.c:792
>=20
> This is for primary worker but I believe the same issue is with secondary.
>=20
> The fix would be e.g. to start control thread before sending any event.

Thanks for the report and analysis. Could you try this patch:

	http://people.freebsd.org/~pjd/patches/hastd.patch

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd@FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!

--EXKGNeO8l0xGFBjy
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)

iEYEARECAAYFAkyXYKIACgkQForvXbEpPzQYHwCgpmfaPYBLx6+OKpuLh3PZFZsU
nn8Ani81FaDawZ/BP53CMWs8e6A7fiG4
=wG3l
-----END PGP SIGNATURE-----

--EXKGNeO8l0xGFBjy--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100920132450.GE2410>