Date: Thu, 2 Nov 2000 14:10:53 -0500 From: Andreas Schweitzer <andy@physast.uga.edu> To: freebsd-net@freebsd.org Subject: recv/recvfrom and select are inconsistent on sockets - it hangs Message-ID: <20001102141053.A27160@bender.physast.uga.edu>
next in thread | raw e-mail | index | archive | help
Hi, I guess this should eventually become a PR, but it is very hard to reproduce unless one has exactly our configuration - so far, at least. Also, I guess I'm asking where I should look next. Our problem : We have 10 machines running FreeBSD 4.2BETA#0 (it also happens in 4.1-RELEASE, where it started) and MPICH 1.2.1 (again, the version is not that important, we tried different ones). Occasionally, the prgrams just hang. It is always at the same spot in our code. So it is reproducable, but we could not boil it down to a small and easy demo-program. Also, it happens not with all numbers of nodes. E.g. it works with 2 and 6 nodes, but not with 4 and 10. Here is what gdb says when attached to the running process : (gdb) bt #0 0x285102d8 in recvfrom () from /usr/lib/libc.so.4 #1 0x284fd2e7 in recv () from /usr/lib/libc.so.4 #2 0x843e012 in sock_msg_avail_on_fd () #3 0x843dc49 in socket_recv () #4 0x844862a in recv_message () #5 0x84484e5 in p4_recv () #6 0x844cb31 in MPID_CH_Check_incoming () #7 0x843f8e7 in MPID_RecvComplete () #8 0x8440888 in MPID_RecvDatatype () #9 0x842de8d in PMPI_Recv () #10 0x8437286 in intra_Bcast () #11 0x843432f in PMPI_Bcast () #12 0x842c1ae in pmpi_bcast_ () #13 0x8362d93 in s3r2t_ () #14 0x8078eda in main () #15 0x804a67d in _start () It will sit forever in recv/recvfrom, although a previous select indicated the presence of data ! Here is the source code from the MPICH library code (the sock_msg_avail_on_fd routine): SYSCALL_P4(nfds, select(p4_global->max_connections, &read_fds, 0, 0, &tv)); if (nfds == -1) { p4_dprintfl(20,"sock_msg_avail_on_fd selected on %d\n", fd); p4_error("sock_msg_avail_on_fd select", nfds); } if (nfds) /* true even for eof */ { /* see if data is on the socket or merely an eof condition */ /* this should not loop long because the select succeeded */ while ((rc = recv(fd, tempbuf, 1, MSG_PEEK)) == -1) ; if (rc == 0) /* if eof */ { . . . It may or may not be bad style and we hacked a fix into it to make this particular part work. But we keep encountering similar problems. Bottom line for this one : select indicates data, but recv does not get it. I'm sure this is not enough to figure out the problem already, so what else would you need ? The kernel has not a lot of modifications. We disabled drivers we don't have (not all, just a quick run through), set the CPU accordingly and did the following : maxusers 64 options MAXDSIZ="(512*1024*1024)" options DFLDSIZ="(512*1024*1024)" options NMBCLUSTERS=8192 Thanks Andreas -- Department of Physics & Astronomy and Center for Simulational Physics University of Georgia Phone ++1 (706) 542 5043 Athens, GA 30602-2451 Fax ++1 (706) 542 2492 USA http://dilbert.physast.uga.edu/~andy/ NEW ! WWW page for phoenix : http://phoenix.physast.uga.edu To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20001102141053.A27160>