Date: Thu, 28 Jun 2007 09:49:57 -0700 From: Steve Kargl <sgk@troutmask.apl.washington.edu> To: Julian Elischer <julian@elischer.org> Cc: freebsd-current@freebsd.org Subject: Re: SYNCOOKIE authentication problems Message-ID: <20070628164957.GA59038@troutmask.apl.washington.edu> In-Reply-To: <4683D063.5040105@elischer.org> References: <20070628014311.GA50012@troutmask.apl.washington.edu> <20070628105039.GE11335@void.codelabs.ru> <20070628151215.GA58165@troutmask.apl.washington.edu> <4683D063.5040105@elischer.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 28, 2007 at 08:14:43AM -0700, Julian Elischer wrote: > Steve Kargl wrote: > >On Thu, Jun 28, 2007 at 02:50:40PM +0400, Eygene Ryabinkin wrote: > >>Steve, good day. > >> > >>Wed, Jun 27, 2007 at 06:43:11PM -0700, Steve Kargl wrote: > >>>Any advice on how to isolate or avoid? > >>> > >>>Jun 27 18:31:19 node11 kernel: TCP: [192.168.0.11]:59661 to > >>>[192.168.0.11]:63266 tcpflags 0x10<ACK>; syncache_expand: Segment failed > >>>SYNCOOKIE authentication, segment rejected (probably spoofed) > >>According to Andre Oppermann, these are harmless: > >> http://lists.freebsd.org/pipermail/freebsd-net/2007-June/014401.html > >> > >>But I am expiriencing some problems related to the other messages > >>like 'tcp_input: Listen socket: Spurious RST, segment rejected'. > >>Though it seems not to be your case, but my problems are documented > >>in the aforementioned thread. Just in case you're curious... > >>-- > > > >Andre certainly knows more about TCP/IP than I, but no, these > >are not harmless. Everytime one of these messages appears > >on the console, my MPI application hangs and must be restarted. > >My large numerical simulations randomly die anywhere from > >15 minutes to 25 hours after launching the job. > > is the app on that machine or another machine? > It's a message passing interface MPI application. I have 6 nodes in a cluster. Each node has 4 CPUs. Each node gets 4 processes. There are a total of 24 processes, and communication between nodes is over a GigE network. This is top(1) output on node11 last pid: 2919; load averages: 4.76, 4.56, 4.55 up 0+12:48:23 09:45:04 34 processes: 5 running, 29 sleeping CPU states: 23.6% user, 0.0% nice, 66.4% system, 10.0% interrupt, 0.0% idle Mem: 4587M Active, 588M Inact, 263M Wired, 596K Cache, 214M Buf, 10G Free Swap: 17G Total, 17G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 896 kargl 1 130 0 1567M 1428M CPU3 2 659:31 86.77% AVL_PS_mpi 897 kargl 1 130 0 1201M 1061M CPU2 3 655:01 86.33% AVL_PS_mpi 898 kargl 1 130 0 1201M 1061M RUN 1 653:25 86.18% AVL_PS_mpi 899 kargl 1 139 0 1201M 1061M RUN 2 655:00 85.89% AVL_PS_mpi When I get the SYNCOOKIE authentication error message, CPU state shows 0% user and 99.9% system. All 4 processes show WCPU 99.99%. This occurs on all the nodes. AFAICT, the processes are spinning waiting for info from other processes. This info never comes -- Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070628164957.GA59038>