From owner-freebsd-current@FreeBSD.ORG Mon Nov 2 21:48:34 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D11EE106568B for ; Mon, 2 Nov 2009 21:48:34 +0000 (UTC) (envelope-from weldon@excelsusphoto.com) Received: from mx0.excelsus.net (emmett.excelsus.com [74.93.113.252]) by mx1.freebsd.org (Postfix) with ESMTP id 7AFAC8FC17 for ; Mon, 2 Nov 2009 21:48:33 +0000 (UTC) Received: (qmail 89846 invoked by uid 89); 2 Nov 2009 21:48:32 -0000 Received: from unknown (HELO localhost) (127.0.0.1) by localhost.excelsus.com with SMTP; 2 Nov 2009 21:48:32 -0000 Date: Mon, 2 Nov 2009 16:48:31 -0500 (EST) From: Weldon S Godfrey 3 X-X-Sender: weldon@emmett.excelsus.com To: freebsd-current@freebsd.org In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Mon, 02 Nov 2009 22:00:41 +0000 Subject: Re: FreeBSD 8.0 - network stack crashes? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Nov 2009 21:48:34 -0000 If memory serves me right, sometime around 4:11pm, Weldon S Godfrey 3 told me: > > > If memory serves me right, sometime around 10:52am, Weldon S Godfrey 3 told > me: > >> >> Up until yesterday, we have been running FreeBSD-CURRENT of 12/08. We >> started to see a couple months ago some very odd network behavior. Something >> happens to the stack that causes processes accessing the network to just >> hang. After the problem happens, usually (but not always), you can't ssh >> in. Always, you can't ssh or telnet out, and nothing can access the NFS >> shares on the server. You can ping everything from the server. You can't >> even do a route add, you can't ssh if you use just the IP address (although >> pinging with hostnames it doesn't have cached or in hosts table resolves). >> When you try to ssh out, do a route add from the box, the process just >> hangs. You can't control C it at all, it hangs forever. There is nothing >> in dmesg or messages to indicate an issue. I try to up/down the interfaces. >> In CURRENT-12/08, it may allow things to work for like 30s. >> >> We upgraded to 8.0-RC2 yesterday and, at first, the problem appeared to >> happen a lot more often. We expected that was related with the increase in >> network performance. At least in 8.0-RC2, I did see a large amount of input >> errors with netstat -in on the heavily loaded interface before it started >> the locking up behavior. I have replaced the ethernet cable and move ports. >> The Catalyst 3650 never records any errors. The problem would reoccur in >> about 5 minutes once our load kicked in this morning. >> >> >> One change in this upgrade, we switched from NFS v2 to v3. When we >> downgraded to the previous OS, we stayed at v3. The problem was just about >> as bad with v3 with the 12/08 OS >> >> We went back to RC2 with NFS v2 and appeared to stabilize to a degree. >> It ran for about an hour and a half and then the issue came up >> >> We are currently back to the 12/08 version using NFS2 and watching things. >> >> We are using a Dell PowerEdge 2950-iii, the problem happens when using the >> onboard nics using the bce driver and with an Intel card using the em driver >> >> I am hunting down any MTU/duplex/speed problems that could cause it (haven't >> found any so far). Of course, any problems on the network wouldn't >> (ideally) freak out the network stack on the server). I don't know how to >> troubleshoot this further on the server since I am not getting any problems >> indicated in logging, panics, cores, etc. >> >> Any help is appreciated. >> > > > I have swapped out the computer, switch, ethernet card, 3ware card. We are > running on 8.0-CURRENT 12/08 that was what we where using with a lot less > issues. No help. > > If it happens again, I am going to try to do a netif restart and routing > restart. Although I believe I tried that at the begining and it did not help. > BTW.. doing a netif / routing restart doesn't help