From owner-freebsd-net Thu Nov 2 12:40:59 2000 Delivered-To: freebsd-net@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id 6D93C37B479 for ; Thu, 2 Nov 2000 12:40:56 -0800 (PST) Received: (from bright@localhost) by fw.wintelcom.net (8.10.0/8.10.0) id eA2KetD09146; Thu, 2 Nov 2000 12:40:55 -0800 (PST) Date: Thu, 2 Nov 2000 12:40:55 -0800 From: Alfred Perlstein To: Andreas Schweitzer Cc: freebsd-net@FreeBSD.ORG Subject: Re: recv/recvfrom and select are inconsistent on sockets - it hangs Message-ID: <20001102124054.V20567@fw.wintelcom.net> References: <20001102141053.A27160@bender.physast.uga.edu> <20001102112121.U20567@fw.wintelcom.net> <20001102144039.B27160@bender.physast.uga.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.4i In-Reply-To: <20001102144039.B27160@bender.physast.uga.edu>; from andy@physast.uga.edu on Thu, Nov 02, 2000 at 02:40:39PM -0500 Sender: owner-freebsd-net@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org * Andreas Schweitzer [001102 11:40] wrote: > > > It will sit forever in recv/recvfrom, although a previous select > > > indicated the presence of data ! Here is the source code from the > > > MPICH library code (the sock_msg_avail_on_fd routine): > > > > > > while ((rc = recv(fd, tempbuf, 1, MSG_PEEK)) == -1) > > > ; > > > > This is terrible! > > Agreed - it's from the MPICH library, not our code > (/usr/ports/net/mpich/work/mpich-1.2.1/mpid/ch_p4/p4/lib/p4_sock_sr.c) Someone needs to "lay some smack" on these guys. > > First off, what was the hack you used to fix this? > > /* begin Halloween Hack */ > > if((rc = recv(fd, tempbuf, 1, MSG_PEEK)) == -1) return(0); > > /* end Halloween Hack */ Ok, it's pretty possible that what I said is true, basically something may have become corrupt and the error is looping. > > if (nfds) /* true even for eof */ > { > /* see if data is on the socket or merely an eof condition */ > /* this should not loop long because the select succeeded */ > while ((rc = recv(fd, tempbuf, 1, MSG_PEEK)) == -1) ; > > It may just as bad. And it works, because this routine is looped over > as well. > > > What other problems? > > Problems that a program hangs when reading from sockets. OK, i thought you meant there were unrelated problems... > > What is the actuall errno you see come back from recv? > > We did not check this yet, I'll try that next. Please do, it would help a lot. > > It's possible that you've corrupted some internal pointers such that > > revc is returning EBADF/ENOTSOCK/EFAULT which would cause an inifinite > > loop. > > Possible. But it's all rather deep in the guts of MPICH and how it talks > to the OS. It's possibly a MPICH bug (with the code I've seen so far I don't doubt it), however corrupting a libraries state is pretty easy and something that's also very possible. > > Are you blocking in recv? or looping on that call? > > As far as I understand the MPI routine, it does not much more than > that snippet, and another routine loops around it. > > A general comment : it may very well be that the MPI code is > "not very clean". But we were thinking that some internals in the > OS may not be the way they should be. And that it may even be > fixable by simply tuning some parameters - I just have no idea where to > look. I really can't say without errno, but it seems like it's a bug in your code or MPICH, not FreeBSD, get me the errno that's set and we'll have a definite answer. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message