Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 19 Sep 2010 12:51:54 +0300
From:      Mikolaj Golub <to.my.trociny@gmail.com>
To:        freebsd-fs@freebsd.org
Subject:   hastd: possible race when a worker is starting
Message-ID:  <86d3saawn9.fsf@kopusha.home.net>

next in thread | raw e-mail | index | archive | help
Hi,

After recent changes in hastd we have the following scenario when a worker
process is started:

1) worker process is created;
2) remote connections are established and CONNECT event is sent to parent;
3) threads (among them control thread) are started.

Consider a sitution when hastctl request is recevied by parent between (1) and
(2). The parent sees that worker is running and sends it a request and waits
for a response. The request should be processed by control thread, but it is
not started yet. Instead the worker sends CONNECT event and waits for a
response from the parent. We have a deadlock.

I was able to reproduce this scenario running a script that runs 'hastctl
status' requests in background and periodically kills the worker to start a
new one:

----------------------------------------------------------------------

#!/bin/sh

hastctl role primary storage

while true; do
        hastctl status storage > /dev/null
done &

while sleep 1; do
        pid=`ps auxww | awk '/hastd: storage \(primary\)/ {print $2}'`
        if [ -n "${pid}" ]; then
                kill "${pid}"
                sleep 1;
                npid=`ps auxww | awk '/hastd: storage \(primary\)/ {print $2}'`
                if [ "${pid}" = "${npid}" ]; then
                        echo "Process ${pid} was not killed" >&2
                        exit 1
                fi
        fi
done

----------------------------------------------------------------------

[root@lolek ~]# ./test_event2.sh 
Process 17559 was not killed

[root@lolek ~]# ps auxww |grep hast
root  17563   2.0  0.3  10924   1764   1  S     1:03PM   0:00.01 hastctl status storage
root   3334   0.0  0.5  11244   2488  ??  Ss   12:13PM   0:02.06 /sbin/hastd -ddd
root  17559   0.0  0.5  11244   2520  ??  S     1:03PM   0:00.01 hastd: storage (primary) (hastd)
root  17561   0.0  0.3  10924   1764   1  S     1:03PM   0:00.01 hastctl status storage

[root@lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 3334

(gdb) bt
#0  0x28300ed5 in recvfrom () from /lib/libc.so.7
#1  0x28286f52 in recv () from /lib/libc.so.7
#2  0x0805f237 in proto_common_recv (fd=41, data=0xbfbfeaa7 "(", size=5)
    at /usr/src/sbin/hastd/proto_common.c:77
#3  0x0805f68d in sp_recv (ctx=0x2850e550, data=0xbfbfeaa7 "(", size=5)
    at /usr/src/sbin/hastd/proto_socketpair.c:185
#4  0x0805ec61 in proto_recv (conn=0x2850e540, data=0xbfbfeaa7, size=5)
    at /usr/src/sbin/hastd/proto.c:207
#5  0x0804e42e in hast_proto_recv_hdr (conn=0x2850e540, nvp=0xbfbfeae4)
    at /usr/src/sbin/hastd/hast_proto.c:308
#6  0x0804c83f in control_status_worker (res=0x284eb500, nvout=0x2850e4d0, no=0)
    at /usr/src/sbin/hastd/control.c:160
#7  0x0804cc9f in control_status (cfg=0x28419600, nvout=0x2850e4d0, res=0x284eb500, 
    name=0x2851042c "storage", no=0) at /usr/src/sbin/hastd/control.c:248
#8  0x0804cfbb in control_handle (cfg=0x28419600) at /usr/src/sbin/hastd/control.c:348
#9  0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682
#10 0x0805076d in main (argc=0, argv=0xbfbfecd0) at /usr/src/sbin/hastd/hastd.c:792

[root@lolek ~]# gdb /usr/obj/usr/src/sbin/hastd/hastd 17559

(gdb) bt
#0  0x28300ed5 in recvfrom () from /lib/libc.so.7
#1  0x28286f52 in recv () from /lib/libc.so.7
#2  0x0805f237 in proto_common_recv (fd=43, data=0xbfbfe9e7 "(", size=5)
    at /usr/src/sbin/hastd/proto_common.c:77
#3  0x0805f68d in sp_recv (ctx=0x2850e570, data=0xbfbfe9e7 "(", size=5)
    at /usr/src/sbin/hastd/proto_socketpair.c:185
#4  0x0805ec61 in proto_recv (conn=0x2850e560, data=0xbfbfe9e7, size=5)
    at /usr/src/sbin/hastd/proto.c:207
#5  0x0804e42e in hast_proto_recv_hdr (conn=0x2850e560, nvp=0xbfbfea24)
    at /usr/src/sbin/hastd/hast_proto.c:308
#6  0x0804db3c in event_send (res=0x284eb500, event=1) at /usr/src/sbin/hastd/event.c:69
#7  0x0805a1d1 in init_remote (res=0x284eb500, inp=0x0, outp=0x0)
    at /usr/src/sbin/hastd/primary.c:671
#8  0x0805a7f4 in hastd_primary (res=0x284eb500) at /usr/src/sbin/hastd/primary.c:810
#9  0x0804c6d5 in control_set_role_common (cfg=0x28419600, nvout=0x2850e4d0, role=2 '\002', 
    res=0x284eb500, name=0x28510442 "storage", no=0) at /usr/src/sbin/hastd/control.c:117
#10 0x0804cf91 in control_handle (cfg=0x28419600) at /usr/src/sbin/hastd/control.c:344
#11 0x08050357 in main_loop () at /usr/src/sbin/hastd/hastd.c:682
#12 0x0805076d in main (argc=0, argv=0xbfbfecd0) at /usr/src/sbin/hastd/hastd.c:792

This is for primary worker but I believe the same issue is with secondary.

The fix would be e.g. to start control thread before sending any event.

-- 
Mikolaj Golub



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86d3saawn9.fsf>