From owner-freebsd-current@FreeBSD.ORG Fri Jun 29 22:48:10 2007 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6068E16A421 for ; Fri, 29 Jun 2007 22:48:10 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.208.78.105]) by mx1.freebsd.org (Postfix) with ESMTP id 438CF13C469 for ; Fri, 29 Jun 2007 22:48:10 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu [127.0.0.1]) by troutmask.apl.washington.edu (8.14.1/8.13.8) with ESMTP id l5TMlPHX072567; Fri, 29 Jun 2007 15:47:25 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.14.1/8.13.8/Submit) id l5TMlPWq072566; Fri, 29 Jun 2007 15:47:25 -0700 (PDT) (envelope-from sgk) Date: Fri, 29 Jun 2007 15:47:25 -0700 From: Steve Kargl To: David Malone Message-ID: <20070629224725.GA72396@troutmask.apl.washington.edu> References: <20070629163247.GA6373@troutmask.apl.washington.edu> <200706292227.aa62881@salmon.maths.tcd.ie> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200706292227.aa62881@salmon.maths.tcd.ie> User-Agent: Mutt/1.4.2.2i Cc: freebsd-current@freebsd.org Subject: Re: SYNCOOKIE authentication problems X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Jun 2007 22:48:10 -0000 On Fri, Jun 29, 2007 at 10:27:06PM +0100, David Malone wrote: > > Jun 29 09:21:58 node11 kernel: TCP: [192.168.0.12]:54528 to [192.168.0.11]:526 > > OK - I can see the packets corresponding to this error by doing something > like: > > % tcpdump -S -r synfinrstdata -n port 62391 and port 60621 (output elided). > The start of this looks like a perfectly normal TCP connection - > it opens normally, transfers about 12 bytes in one direction and > then closes. Strangley, 192.168.0.11 then sends two FIN packets, > followed by a reset. The error message produced by the kernel should > have produced a reset in response, but I'm not sure I can see quite > enough to see what happened. > > We could try to get all of the packets in the connection by doing: > > tcpdump -i whatever_interface -w /tmp/fulldump -s 80 I'm doing this now. It seems that putting bge0 in promiscous mode has provided some stability. fulldump is currently at 2.4 GB. > > poll({4/POLLIN 5/POLLIN 6/POLLIN 7/POLLIN 9/POLLIN 10/POLLIN 11/POLLIN 13/POLL > > It looks like MPI is looking only for file discriptors to become > ready for reading. I'd guess one of the file discriptors is in an > error state, but MPI isn't checking for theat, so it is spinning. > I've both OpenMPI and MPICH2 implementation. Neither handles a disappearing process in an elegant manner. They simply assume that network is robust and 100% reliable. -- Steve