Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 2 Nov 2000 11:21:22 -0800
From:      Alfred Perlstein <bright@wintelcom.net>
To:        Andreas Schweitzer <andy@physast.uga.edu>
Cc:        freebsd-net@FreeBSD.ORG
Subject:   Re: recv/recvfrom and select are inconsistent on sockets - it hangs
Message-ID:  <20001102112121.U20567@fw.wintelcom.net>
In-Reply-To: <20001102141053.A27160@bender.physast.uga.edu>; from andy@physast.uga.edu on Thu, Nov 02, 2000 at 02:10:53PM -0500
References:  <20001102141053.A27160@bender.physast.uga.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
* Andreas Schweitzer <andy@physast.uga.edu> [001102 11:11] wrote:
> Hi,
> 
> I guess this should eventually become a PR, but it
> is very hard to reproduce unless one has exactly our
> configuration - so far, at least.
> 
> Also, I guess I'm asking where I should look next.
> 
> Our problem :
> We have 10 machines running FreeBSD 4.2BETA#0 (it also
> happens in 4.1-RELEASE, where it started) and MPICH 1.2.1
> (again, the version is not that important, we tried different ones).
> 
> Occasionally, the prgrams just hang. It is always at the same
> spot in our code. So it is reproducable, but we could not
> boil it down to a small and easy demo-program. Also, it
> happens not with all numbers of nodes. E.g. it works with 2 and 6
> nodes, but not with 4 and 10.
> 
> Here is what gdb says when attached to the running process :
> (gdb) bt
> #0  0x285102d8 in recvfrom () from /usr/lib/libc.so.4
> #1  0x284fd2e7 in recv () from /usr/lib/libc.so.4
> #2  0x843e012 in sock_msg_avail_on_fd ()
> #3  0x843dc49 in socket_recv ()
> #4  0x844862a in recv_message ()
> #5  0x84484e5 in p4_recv ()
> #6  0x844cb31 in MPID_CH_Check_incoming ()
> #7  0x843f8e7 in MPID_RecvComplete ()
> #8  0x8440888 in MPID_RecvDatatype ()
> #9  0x842de8d in PMPI_Recv ()
> #10 0x8437286 in intra_Bcast ()
> #11 0x843432f in PMPI_Bcast ()
> #12 0x842c1ae in pmpi_bcast_ ()
> #13 0x8362d93 in s3r2t_ ()
> #14 0x8078eda in main ()
> #15 0x804a67d in _start ()
> 
> It will sit forever in recv/recvfrom, although a previous select
> indicated the presence of data ! Here is the source code from the
> MPICH library code (the sock_msg_avail_on_fd routine):
> 
>     SYSCALL_P4(nfds, select(p4_global->max_connections, &read_fds, 0, 0, &tv));
>      
>     if (nfds == -1)
>     {        
>         p4_dprintfl(20,"sock_msg_avail_on_fd selected on %d\n", fd);
>         p4_error("sock_msg_avail_on_fd select", nfds);
>     }
>     if (nfds)                   /* true even for eof */
>     {
>         /* see if data is on the socket or merely an eof condition */
>         /* this should not loop long because the select succeeded */
>         while ((rc = recv(fd, tempbuf, 1, MSG_PEEK)) == -1)
>             ;

This is terrible!

> 
>         if (rc == 0)    /* if eof */
>         {
> .
> .
> .
> 
> It may or may not be bad style and we hacked a fix into it to
> make this particular part work. But we keep encountering similar
> problems.

First off, what was the hack you used to fix this?

What other problems?

What is the actuall errno you see come back from recv?

It's possible that you've corrupted some internal pointers such that
revc is returning EBADF/ENOTSOCK/EFAULT which would cause an inifinite
loop.

Are you blocking in recv? or looping on that call?

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20001102112121.U20567>