From owner-freebsd-current@FreeBSD.ORG Thu Jun 28 16:50:36 2007 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1B0B316A421 for ; Thu, 28 Jun 2007 16:50:36 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.208.78.105]) by mx1.freebsd.org (Postfix) with ESMTP id D8EE113C455 for ; Thu, 28 Jun 2007 16:50:35 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu [127.0.0.1]) by troutmask.apl.washington.edu (8.14.1/8.13.8) with ESMTP id l5SGo1SB059093; Thu, 28 Jun 2007 09:50:01 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.14.1/8.13.8/Submit) id l5SGnvXe059089; Thu, 28 Jun 2007 09:49:57 -0700 (PDT) (envelope-from sgk) Date: Thu, 28 Jun 2007 09:49:57 -0700 From: Steve Kargl To: Julian Elischer Message-ID: <20070628164957.GA59038@troutmask.apl.washington.edu> References: <20070628014311.GA50012@troutmask.apl.washington.edu> <20070628105039.GE11335@void.codelabs.ru> <20070628151215.GA58165@troutmask.apl.washington.edu> <4683D063.5040105@elischer.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4683D063.5040105@elischer.org> User-Agent: Mutt/1.4.2.2i Cc: freebsd-current@freebsd.org Subject: Re: SYNCOOKIE authentication problems X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Jun 2007 16:50:36 -0000 On Thu, Jun 28, 2007 at 08:14:43AM -0700, Julian Elischer wrote: > Steve Kargl wrote: > >On Thu, Jun 28, 2007 at 02:50:40PM +0400, Eygene Ryabinkin wrote: > >>Steve, good day. > >> > >>Wed, Jun 27, 2007 at 06:43:11PM -0700, Steve Kargl wrote: > >>>Any advice on how to isolate or avoid? > >>> > >>>Jun 27 18:31:19 node11 kernel: TCP: [192.168.0.11]:59661 to > >>>[192.168.0.11]:63266 tcpflags 0x10; syncache_expand: Segment failed > >>>SYNCOOKIE authentication, segment rejected (probably spoofed) > >>According to Andre Oppermann, these are harmless: > >> http://lists.freebsd.org/pipermail/freebsd-net/2007-June/014401.html > >> > >>But I am expiriencing some problems related to the other messages > >>like 'tcp_input: Listen socket: Spurious RST, segment rejected'. > >>Though it seems not to be your case, but my problems are documented > >>in the aforementioned thread. Just in case you're curious... > >>-- > > > >Andre certainly knows more about TCP/IP than I, but no, these > >are not harmless. Everytime one of these messages appears > >on the console, my MPI application hangs and must be restarted. > >My large numerical simulations randomly die anywhere from > >15 minutes to 25 hours after launching the job. > > is the app on that machine or another machine? > It's a message passing interface MPI application. I have 6 nodes in a cluster. Each node has 4 CPUs. Each node gets 4 processes. There are a total of 24 processes, and communication between nodes is over a GigE network. This is top(1) output on node11 last pid: 2919; load averages: 4.76, 4.56, 4.55 up 0+12:48:23 09:45:04 34 processes: 5 running, 29 sleeping CPU states: 23.6% user, 0.0% nice, 66.4% system, 10.0% interrupt, 0.0% idle Mem: 4587M Active, 588M Inact, 263M Wired, 596K Cache, 214M Buf, 10G Free Swap: 17G Total, 17G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 896 kargl 1 130 0 1567M 1428M CPU3 2 659:31 86.77% AVL_PS_mpi 897 kargl 1 130 0 1201M 1061M CPU2 3 655:01 86.33% AVL_PS_mpi 898 kargl 1 130 0 1201M 1061M RUN 1 653:25 86.18% AVL_PS_mpi 899 kargl 1 139 0 1201M 1061M RUN 2 655:00 85.89% AVL_PS_mpi When I get the SYNCOOKIE authentication error message, CPU state shows 0% user and 99.9% system. All 4 processes show WCPU 99.99%. This occurs on all the nodes. AFAICT, the processes are spinning waiting for info from other processes. This info never comes -- Steve