Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 2 Nov 2000 14:10:53 -0500
From:      Andreas Schweitzer <andy@physast.uga.edu>
To:        freebsd-net@freebsd.org
Subject:   recv/recvfrom and select are inconsistent on sockets - it hangs
Message-ID:  <20001102141053.A27160@bender.physast.uga.edu>

next in thread | raw e-mail | index | archive | help
Hi,

I guess this should eventually become a PR, but it
is very hard to reproduce unless one has exactly our
configuration - so far, at least.

Also, I guess I'm asking where I should look next.

Our problem :
We have 10 machines running FreeBSD 4.2BETA#0 (it also
happens in 4.1-RELEASE, where it started) and MPICH 1.2.1
(again, the version is not that important, we tried different ones).

Occasionally, the prgrams just hang. It is always at the same
spot in our code. So it is reproducable, but we could not
boil it down to a small and easy demo-program. Also, it
happens not with all numbers of nodes. E.g. it works with 2 and 6
nodes, but not with 4 and 10.

Here is what gdb says when attached to the running process :
(gdb) bt
#0  0x285102d8 in recvfrom () from /usr/lib/libc.so.4
#1  0x284fd2e7 in recv () from /usr/lib/libc.so.4
#2  0x843e012 in sock_msg_avail_on_fd ()
#3  0x843dc49 in socket_recv ()
#4  0x844862a in recv_message ()
#5  0x84484e5 in p4_recv ()
#6  0x844cb31 in MPID_CH_Check_incoming ()
#7  0x843f8e7 in MPID_RecvComplete ()
#8  0x8440888 in MPID_RecvDatatype ()
#9  0x842de8d in PMPI_Recv ()
#10 0x8437286 in intra_Bcast ()
#11 0x843432f in PMPI_Bcast ()
#12 0x842c1ae in pmpi_bcast_ ()
#13 0x8362d93 in s3r2t_ ()
#14 0x8078eda in main ()
#15 0x804a67d in _start ()

It will sit forever in recv/recvfrom, although a previous select
indicated the presence of data ! Here is the source code from the
MPICH library code (the sock_msg_avail_on_fd routine):

    SYSCALL_P4(nfds, select(p4_global->max_connections, &read_fds, 0, 0, &tv));
     
    if (nfds == -1)
    {        
        p4_dprintfl(20,"sock_msg_avail_on_fd selected on %d\n", fd);
        p4_error("sock_msg_avail_on_fd select", nfds);
    }
    if (nfds)                   /* true even for eof */
    {
        /* see if data is on the socket or merely an eof condition */
        /* this should not loop long because the select succeeded */
        while ((rc = recv(fd, tempbuf, 1, MSG_PEEK)) == -1)
            ;

        if (rc == 0)    /* if eof */
        {
.
.
.

It may or may not be bad style and we hacked a fix into it to make this particular part
work. But we keep encountering similar problems.

Bottom line for this one : select indicates data, but recv does not get it.

I'm sure this is not enough to figure out the problem already, so
what else would you need ?

The kernel has not a lot of modifications. We disabled drivers we don't
have (not all, just a quick run through), set the CPU accordingly and
did the following :
maxusers        64
options         MAXDSIZ="(512*1024*1024)"
options         DFLDSIZ="(512*1024*1024)"
options         NMBCLUSTERS=8192

Thanks
Andreas

-- 
Department of Physics & Astronomy  and  Center for Simulational Physics
University of Georgia                          Phone ++1 (706) 542 5043
Athens, GA 30602-2451                            Fax ++1 (706) 542 2492
USA                               http://dilbert.physast.uga.edu/~andy/

NEW ! WWW page for phoenix :

                   http://phoenix.physast.uga.edu


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20001102141053.A27160>