From owner-freebsd-current@FreeBSD.ORG Thu Dec 14 18:08:37 2006 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2050516A494 for ; Thu, 14 Dec 2006 18:08:37 +0000 (UTC) (envelope-from rrs@cisco.com) Received: from sj-iport-4.cisco.com (sj-iport-4.cisco.com [171.68.10.86]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5892143F8B for ; Thu, 14 Dec 2006 17:59:02 +0000 (GMT) (envelope-from rrs@cisco.com) Received: from sj-dkim-6.cisco.com ([171.68.10.81]) by sj-iport-4.cisco.com with ESMTP; 14 Dec 2006 10:00:31 -0800 Received: from sj-core-4.cisco.com (sj-core-4.cisco.com [171.68.223.138]) by sj-dkim-6.cisco.com (8.12.11/8.12.11) with ESMTP id kBEI0VrL008082 for ; Thu, 14 Dec 2006 10:00:31 -0800 Received: from xbh-sjc-221.amer.cisco.com (xbh-sjc-221.cisco.com [128.107.191.63]) by sj-core-4.cisco.com (8.12.10/8.12.6) with ESMTP id kBEI0B7D007580 for ; Thu, 14 Dec 2006 10:00:30 -0800 (PST) Received: from xfe-sjc-212.amer.cisco.com ([171.70.151.187]) by xbh-sjc-221.amer.cisco.com with Microsoft SMTPSVC(6.0.3790.1830); Thu, 14 Dec 2006 10:00:27 -0800 Received: from [127.0.0.1] ([171.68.225.134]) by xfe-sjc-212.amer.cisco.com with Microsoft SMTPSVC(6.0.3790.1830); Thu, 14 Dec 2006 10:00:27 -0800 Message-ID: <45819115.9060702@cisco.com> Date: Thu, 14 Dec 2006 12:59:49 -0500 From: Randall Stewart User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050920 X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-current@freebsd.org References: <457EA7E3.2010302@cisco.com> In-Reply-To: <457EA7E3.2010302@cisco.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 14 Dec 2006 18:00:27.0254 (UTC) FILETIME=[BC4E7160:01C71FA9] DKIM-Signature: v=0.5; a=rsa-sha256; q=dns/txt; l=6244; t=1166119231; x=1166983231; c=relaxed/simple; s=sjdkim6002; h=Content-Type:From:Subject:Content-Transfer-Encoding:MIME-Version; d=cisco.com; i=rrs@cisco.com; z=From:=20Randall=20Stewart=20 |Subject:=20Re=3A=20curious=20results |Sender:=20; bh=eduQ0N025dEKzrkMH7FZjqqWBmyaiJPvCMrQoCYz/5w=; b=igQpx9RAUvk+DusbJjSd8oQhDNFW4rPiqtNxwtB/sPaSH2dk4tCLce2Mu/BG0WFQRlicRTky 4VSVMbH2OQvPWGOKQnUl3kqEHBpcXAnu163iQ/vpA4YtWUKe0p64ruGN; Authentication-Results: sj-dkim-6; header.From=rrs@cisco.com; dkim=pass (sig from cisco.com/sjdkim6002 verified; ); Subject: Re: curious results X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Dec 2006 18:08:37 -0000 All: Ok, the second problem I noted.. where the system freezes... I can reproduce pretty regularly... the problem is I can't seem to gain any information from it. If I hit ANY key it starts back up again... for example if I leave it on the screen running top.. if I strke Ctl- to do a Ctl-alt-esc to drop to ddb... bam the time updates (jumping an hour in the last instance)... So when I look at the core.. everything looks normal.. I am seeing at the same instance "Limiting closed port RST response from 257 to 200 packets/sec" and some "calcru: runtime went backwards... " messages as well.. Any suggestions on how I can gather any data from this event... Oh one other note.. the machine cannot be pinged when it reaches this state... so a network interupt will not revive it .. Very wierd.. R Randall Stewart wrote: > All: > > I have two machines I am testing with... a Intel Xeon > 2.8 Gig w/hyperthreading... and a Intel P4D dual core. > > Now I am testing SCTP and how it interacts with SMP.. or > that is my intention. I have a snapshot of the MPI code > that one of my friends at UBC has been working on > with Argonne Labs... This uses SCTP :-) > > He has written a serious of tests which all now pass > (after a LOT of bugs and LOR's.. all kinds of fun :D) > > Now a seperate test he has written is something called > mywaitall. Basically you setup a number of processes, > all of them get up and settled in. Then they coordinate > (near as I can tell) sharing SCTP port and address info > with each other via TCP. Then they switch over and > use the SCTP one-2-many model.. sending data to each other > setting up implicit connections. > > This means that running -np 10.. I have 10 endpoints with > 90 associations ... I am doing this only on the local > host side. > > I run this in a > while true > do > mpiexec -np 10 ./mywaitall > echo "-------" > done > > > Now on the xeon machine I see a very curious failure. After > about a day of running this. I get two endpoints stuck > one has data to be transmitted the other is waiting for > it.. (the way the program works is they all send/rcv some > info and then say goodbye to each other). > > Now I am seeing loss because the app version I have is > buggy... the author did not handle the sending in non-blocking > mode correctly. He thought he got a -1/EAGAIN.. instead of > a 0/0 back.. so he ends up in a tight loop doing > while (sent > -1) > ret = dothesend() > if(ret < 0) > break // error > sent += ret; > > > Which means we peg the CPU sending with full send windows.. He > has fixed this.. but I keep testing with the buggy version since > it finds somemany unique problems :-) > > But back to my scenario. Now I have, in the past, fixed many > bugs like this that were an SCTP problem :-) but this one > I don't think so anymore.. > > When I find and look at the assoc's in question the sender > has some outstanding chunks (4 in the last instance) and > its timer is running as far as it is concerned. Here is > the actual callout structure: > > $6 = {c_links = {sle = {sle_next = 0x0}, > tqe = {tqe_next = 0x0, > tqe_prev = 0xc6dd02a8}}, > c_time = 264796819, > c_arg = 0xc27201ec, > c_func = 0xc0748458 , > c_mtx = 0x0, > c_flags = 22} > > Now there is another part to the structure (the c_arg) and if > I look in there I see things like it being started 1 second > before (which one would excpect... I save the ticks of > when it was started). I also have a stopped_from field > that gets set any time someone does a stop of the timer > and when the callout is called it sets it to various > values. The time structure is opaque to the SCTP code so > it does not play with these values.. and when you look at > the ticks, its long past expiration.. > > Note that the 22 indicates NO_MTX | PENDING and ACTIVE. > And yet the linked lists in c_links is NOT set to anything > like I normally see these dudes.. > > Now I did put a extra global SCTP lock in before starting/stopping > the timer. This did make it take 2-3 days to hit this case.. but > it still happens.... > > Has anyone seen this ?? I have looked at the timer code and I > do see a mutex spin lock.. but I can't see how SCTP would be > causing this... I am stumped .. any suggestions would be welcome ;-) > > -------------------- > > My second problem is even more bizzare.. if thats possible...:-D > > The other p4d runs along fine for a day or so .. and then it will > just stop.. and I mean stop.. if you have a top window up (I have > x off.. to panic it when I want :-D) you see the time frozen. No > updates... it just freezes... > > If you type in anything.. the machine picks up again and starts > running as if nothing had happened. The last time I created > this the time had been frozen for at least 12 hours before > I got to it :-D > > I dropped directly into KDB and pulled a crash dump... > > Looking at all of the SCTP assoc's there was NOTHING > happening.. no data in flight.. nothing... > > Now in the past type any key, change to another set > of windows ... and ta-da.. off it runs.. > > I do see a few TCP timeout alarms in the app (remember > the app talks TCP to setup the SCTP stuff)... > > Very wierd... > > Any ideas or suggestions would be welcome. > > I just did an update in prep of doing a patch (currently > passed to gnn for approval).. so my cores are invalid.. but > I can recreate them pretty easily .. it just takes a > day or so :-) > > I can also let anyone that is interested in when the event > occurs of problem one to the machine... and let them > puruse the timers or whatever of the running kernel.. or > take a crash dump and let you look at that.. > > If anyone has heard of anything like this I would appreciate > some pointers.. it could be something SCTP is doing... at > least the timer one.. > > Thanks for any suggestions.. > > R > > -- Randall Stewart NSSTG - Cisco Systems Inc. 803-345-0369 803-317-4952 (cell)