From owner-freebsd-current@FreeBSD.ORG  Thu Dec 14 18:08:37 2006
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: freebsd-current@freebsd.org
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 2050516A494
	for <freebsd-current@freebsd.org>; Thu, 14 Dec 2006 18:08:37 +0000 (UTC)
	(envelope-from rrs@cisco.com)
Received: from sj-iport-4.cisco.com (sj-iport-4.cisco.com [171.68.10.86])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5892143F8B
	for <freebsd-current@freebsd.org>; Thu, 14 Dec 2006 17:59:02 +0000 (GMT)
	(envelope-from rrs@cisco.com)
Received: from sj-dkim-6.cisco.com ([171.68.10.81])
	by sj-iport-4.cisco.com with ESMTP; 14 Dec 2006 10:00:31 -0800
Received: from sj-core-4.cisco.com (sj-core-4.cisco.com [171.68.223.138])
	by sj-dkim-6.cisco.com (8.12.11/8.12.11) with ESMTP id kBEI0VrL008082
	for <freebsd-current@freebsd.org>; Thu, 14 Dec 2006 10:00:31 -0800
Received: from xbh-sjc-221.amer.cisco.com (xbh-sjc-221.cisco.com
	[128.107.191.63])
	by sj-core-4.cisco.com (8.12.10/8.12.6) with ESMTP id kBEI0B7D007580
	for <freebsd-current@freebsd.org>; Thu, 14 Dec 2006 10:00:30 -0800 (PST)
Received: from xfe-sjc-212.amer.cisco.com ([171.70.151.187]) by
	xbh-sjc-221.amer.cisco.com with Microsoft SMTPSVC(6.0.3790.1830); 
	Thu, 14 Dec 2006 10:00:27 -0800
Received: from [127.0.0.1] ([171.68.225.134]) by xfe-sjc-212.amer.cisco.com
	with Microsoft SMTPSVC(6.0.3790.1830); 
	Thu, 14 Dec 2006 10:00:27 -0800
Message-ID: <45819115.9060702@cisco.com>
Date: Thu, 14 Dec 2006 12:59:49 -0500
From: Randall Stewart <rrs@cisco.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050920
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: freebsd-current@freebsd.org
References: <457EA7E3.2010302@cisco.com>
In-Reply-To: <457EA7E3.2010302@cisco.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
X-OriginalArrivalTime: 14 Dec 2006 18:00:27.0254 (UTC)
	FILETIME=[BC4E7160:01C71FA9]
DKIM-Signature: v=0.5; a=rsa-sha256; q=dns/txt; l=6244; t=1166119231;
	x=1166983231; c=relaxed/simple; s=sjdkim6002;
	h=Content-Type:From:Subject:Content-Transfer-Encoding:MIME-Version;
	d=cisco.com; i=rrs@cisco.com;
	z=From:=20Randall=20Stewart=20<rrs@cisco.com>
	|Subject:=20Re=3A=20curious=20results |Sender:=20;
	bh=eduQ0N025dEKzrkMH7FZjqqWBmyaiJPvCMrQoCYz/5w=;
	b=igQpx9RAUvk+DusbJjSd8oQhDNFW4rPiqtNxwtB/sPaSH2dk4tCLce2Mu/BG0WFQRlicRTky
	4VSVMbH2OQvPWGOKQnUl3kqEHBpcXAnu163iQ/vpA4YtWUKe0p64ruGN;
Authentication-Results: sj-dkim-6; header.From=rrs@cisco.com; dkim=pass (sig
	from cisco.com/sjdkim6002 verified; ); 
Subject: Re: curious results
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Dec 2006 18:08:37 -0000

All:

Ok, the second problem I noted.. where the system freezes... I can
reproduce pretty regularly... the problem is I can't seem
to gain any information from it.

If I hit ANY key it starts back up again... for example if I
leave it on the screen running top.. if I strke
Ctl- to do a Ctl-alt-esc to drop to ddb... bam the
time updates (jumping an hour in the last instance)...

So when I look at the core.. everything looks normal..

I am seeing at the same instance
"Limiting closed port RST response from 257 to 200 packets/sec"
and some
"calcru: runtime went backwards... " messages as well..

Any suggestions on how I can gather any data from this event...

Oh one other note.. the machine cannot be pinged when it
reaches this state... so a network interupt will not revive
it ..

Very wierd..

R

Randall Stewart wrote:
> All:
> 
> I have two machines I am testing with... a Intel Xeon
> 2.8 Gig w/hyperthreading... and a Intel P4D dual core.
> 
> Now I am testing SCTP and how it interacts with SMP.. or
> that is my intention. I have a snapshot of the MPI code
> that one of my friends at UBC has been working on
> with Argonne Labs... This uses SCTP :-)
> 
> He has written a serious of tests which all now pass
> (after a LOT of bugs and LOR's.. all kinds of fun :D)
> 
> Now a seperate test he has written is something called
> mywaitall. Basically you setup a number of processes,
> all of them get up and settled in. Then they coordinate
> (near as I can tell) sharing SCTP port and address info
> with each other via TCP. Then they switch over and
> use the SCTP one-2-many model.. sending data to each other
> setting up implicit connections.
> 
> This means that running -np 10.. I have 10 endpoints with
> 90 associations ... I am doing this only on the local
> host side.
> 
> I run this in a
> while true
> do
> mpiexec -np 10 ./mywaitall
> echo "-------"
> done
> 
> 
> Now on the xeon machine I see a very curious failure. After
> about a day of running this. I get two endpoints stuck
> one has data to be transmitted the other is waiting for
> it.. (the way the program works is they all send/rcv some
> info and then say goodbye to each other).
> 
> Now I am seeing loss because the app version I have is
> buggy... the author did not handle the sending in non-blocking
> mode correctly. He thought he got a -1/EAGAIN.. instead of
> a 0/0 back.. so he ends up in a tight loop doing
> while (sent > -1)
>    ret = dothesend()
>    if(ret < 0)
>       break // error
>    sent += ret;
> 
> 
> Which means we peg the CPU sending with full send windows.. He
> has fixed this.. but I keep testing with the buggy version since
> it finds somemany unique problems :-)
> 
> But back to my scenario. Now I have, in the past, fixed many
> bugs like this that were an SCTP problem :-) but this one
> I don't think so anymore..
> 
> When I find and look at the assoc's in question the sender
> has some outstanding chunks (4 in the last instance) and
> its timer is running as far as it is concerned. Here is
> the actual callout structure:
> 
> $6 = {c_links = {sle = {sle_next = 0x0},
>                  tqe = {tqe_next = 0x0,
>                  tqe_prev = 0xc6dd02a8}},
>        c_time = 264796819,
>        c_arg = 0xc27201ec,
>        c_func = 0xc0748458 <sctp_timeout_handler>,
>        c_mtx = 0x0,
>        c_flags = 22}
> 
> Now there is another part to the structure (the c_arg) and if
> I look in there I see things like it being started 1 second
> before (which one would excpect... I save the ticks of
> when it was started). I also have a stopped_from field
> that gets set any time someone does a stop of the timer
> and when the callout is called it sets it to various
> values. The time structure is opaque to the SCTP code so
> it does not play with these values.. and when you look at
> the ticks, its long past expiration..
> 
> Note that the 22 indicates NO_MTX | PENDING and ACTIVE.
> And yet the linked lists in c_links is NOT set to anything
> like I normally see these dudes..
> 
> Now I did put a extra global SCTP lock in before starting/stopping
> the timer. This did make it take 2-3 days to hit this case.. but
> it still happens....
> 
> Has anyone seen this ?? I have looked at the timer code and I
> do see a mutex spin lock.. but I can't see how SCTP would be
> causing this... I am stumped .. any suggestions would be welcome ;-)
> 
> --------------------
> 
> My second problem is even more bizzare.. if thats possible...:-D
> 
> The other p4d runs along fine for a day or so .. and then it will
> just stop.. and I mean stop.. if you have a top window up (I have
> x off.. to panic it when I want :-D) you see the time frozen. No
> updates... it just freezes...
> 
> If you type in anything.. the machine picks up again and starts
> running as if nothing had happened. The last time I created
> this the time had been frozen for at least 12 hours before
> I got to it :-D
> 
> I dropped directly into KDB and pulled a crash dump...
> 
> Looking at all of the SCTP assoc's there was NOTHING
> happening.. no data in flight..  nothing...
> 
> Now in the past type any key, change to another set
> of windows ... and ta-da.. off it runs..
> 
> I do see a few TCP timeout alarms in the app (remember
> the app talks TCP to setup the SCTP stuff)...
> 
> Very wierd...
> 
> Any ideas or suggestions would be welcome.
> 
> I just did an update in prep of doing a patch (currently
> passed to gnn for approval).. so my cores are invalid.. but
> I can recreate them pretty easily .. it just takes a
> day or so :-)
> 
> I can also let anyone that is interested in when the event
> occurs of problem one to the machine... and let them
> puruse the timers or whatever of the running kernel.. or
> take a crash dump and let you look at that..
> 
> If anyone has heard of anything like this I would appreciate
> some pointers.. it could be something SCTP is doing... at
> least the timer one..
> 
> Thanks for any suggestions..
> 
> R
> 
> 


-- 
Randall Stewart
NSSTG - Cisco Systems Inc.
803-345-0369 <or> 803-317-4952 (cell)