From owner-freebsd-net@FreeBSD.ORG Sat Oct 16 00:58:08 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DBC66106566B for ; Sat, 16 Oct 2010 00:58:08 +0000 (UTC) (envelope-from pyunyh@gmail.com) Received: from mail-pv0-f182.google.com (mail-pv0-f182.google.com [74.125.83.182]) by mx1.freebsd.org (Postfix) with ESMTP id A56AD8FC0A for ; Sat, 16 Oct 2010 00:58:08 +0000 (UTC) Received: by pvg7 with SMTP id 7so290814pvg.13 for ; Fri, 15 Oct 2010 17:58:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:received:from:date:to:cc :subject:message-id:reply-to:references:mime-version:content-type :content-disposition:in-reply-to:user-agent; bh=pSs6xpYmi7RICkCVbMqq08nwDvXNSA5Cknp8CCrodhE=; b=sQgxGbgSX5ewqaNEQ5upnxH4VHlsF0X3W7c+oZoIMUf6PWYeAM3PNFREPxezb26v1k 1Urah2dlYz1Liwqt6akW28stVbHEreiaRHgIUj2pdeKEKwYgFF3Eux45Vp7y9oiyDVuZ 5PcwikKJwhwTohdlVaCQJ9MK4xm1vwOWwMRC8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:date:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=Rf1cqM+Xk6rkLrGF5nBJJCjyreuj7yJhf5MX7lAgsncRuOWSVG+Di5GXO6sp7VLsP0 43IWkpGUXYntL7qGwZtoWjB0rnv7YBnaHJRIN57xBeMV4Uv9tgRwEbgaRrs3i531fM9w f6cz0siOeNU7LwHKgPwzCLFrhOLwtsjb6uWQo= Received: by 10.142.139.5 with SMTP id m5mr1216660wfd.250.1287190686966; Fri, 15 Oct 2010 17:58:06 -0700 (PDT) Received: from pyunyh@gmail.com ([174.35.1.224]) by mx.google.com with ESMTPS id t38sm16761633wfc.9.2010.10.15.17.58.03 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 15 Oct 2010 17:58:04 -0700 (PDT) Received: by pyunyh@gmail.com (sSMTP sendmail emulation); Fri, 15 Oct 2010 17:56:24 -0700 From: Pyun YongHyeon Date: Fri, 15 Oct 2010 17:56:24 -0700 To: Melissa Jenkins Message-ID: <20101016005624.GH26174@michelle.cdnetworks.com> References: <5C261F16-6530-47EE-B1C1-BA38CD6D8B01@littlebluecar.co.uk> <20100902194940.GH21940@michelle.cdnetworks.com> <20100904005349.GP21940@michelle.cdnetworks.com> <9BBD5E0C-06D3-4FA5-B85C-5256DA3AD483@littlebluecar.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9BBD5E0C-06D3-4FA5-B85C-5256DA3AD483@littlebluecar.co.uk> User-Agent: Mutt/1.4.2.3i Cc: freebsd-net@freebsd.org Subject: Re: NFE adapter 'hangs' X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: pyunyh@gmail.com List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Oct 2010 00:58:08 -0000 On Fri, Oct 15, 2010 at 01:25:08PM +0100, Melissa Jenkins wrote: > > On 4 Sep 2010, at 01:53, Pyun YongHyeon wrote: > > > On Fri, Sep 03, 2010 at 07:59:26AM +0100, Melissa Jenkins wrote: > >> > >> Thank you for your very quick response :) > >> > > > > [...] > > > >>> Also I'd like to know whether both RX and TX are dead or only one > >>> RX/TX path is hung. Can you see incoming traffic with tcpdump when > >>> you think the controller is in stuck? > >> > >> Yes, though not very much. The traffic to 4800 is every second so you can see in the following trace when it stops > >> > >> 07:10:42.287163 IP 192.168.1.203 > 224.0.0.240: pfsync 108 > >> 07:10:42.911995 > >> 07:10:43.112073 STP 802.1d, Config, Flags [Topology change], bridge-id 8000.c4:7d:4f:a9:ac:30.8008, length 43 > >> 07:10:43.148659 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length 60 > >> 07:10:43.148684 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP 192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60 > >> 07:10:43.148689 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP 192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60 > >> 07:10:43.148918 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length 48 > > > > [...] > > > >> a bit later on, still broken, a slight odd message: > >> 07:11:43.079720 IP 172.31.1.129 > 172.31.1.213: GREv0, length 52: IP 192.168.1.129.60446 > 192.168.1.213.179: tcp 12 [bad hdr length 16 - too short, < 20] > >> 07:11:44.210794 IP 172.31.1.129 > 172.31.1.203: GREv0, length 84: IP 192.168.1.129.64744 > 192.168.1.203.4800: UDP, length 52 > >> 07:11:44.210831 IP 172.31.1.129 > 172.31.1.213: GREv0, length 84: IP 192.168.1.129.64744 > 192.168.1.213.4800: UDP, length 52 > >> > >> Now this really is odd, I don't recognise either of those MAC addresses, though the SQL shown is used on this machine ( > >> 07:12:13.054393 45:43:54:20:41:63 > 00:00:03:53:45:4c, ethertype Unknown (0x6374), length 60: > >> 0x0000: 556e 6971 7565 4964 2046 524f 4d20 7261 UniqueId.FROM.ra > >> 0x0010: 6461 6363 7420 2057 4845 5245 2043 616c dacct..WHERE.Cal > >> 0x0020: 6c69 6e67 5374 6174 696f 6e49 6420 lingStationId. > > > > Hmm, it seems you're using really complex setup. It's very hard to > > narrow down guilty ones under these environments. Could you setup > > simple network configuration that reproduces the issue? One of > > possible cause would be wrong(garbled) data might be passed up to > > upper stack. But I have no idea why you see GRE packets with > > truncated TCP header(172.31.1.129 > 172.31.1.213). > > How about disabling TX/RX checksum offloading as well as TSO? > > > > [...] > > > >> > >> I then restarted the interface (nfe down/up, route restart) > >> > >> From dmesg at the time (slight obfuscated) > >> Sep 3 07:10:19 manch2 bgpd[89612]: neighbor XX: received notification: HoldTimer expired, unknown subcode 0 > >> Sep 3 07:10:49 manch2 bgpd[89612]: neighbor XX connect: Host is down > >> # at this point I took the interface down & up and reloaded the routing tables > >> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN > >> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN > >> Sep 3 07:12:07 manch2 kernel: nfe0: link state changed to DOWN > >> Sep 3 07:12:07 manch2 kernel: carp0: link state changed to DOWN > >> Sep 3 07:12:11 manch2 kernel: nfe0: link state changed to UP > >> Sep 3 07:12:11 manch2 kernel: carp0: link state changed to DOWN > >> Sep 3 07:12:14 manch2 kernel: carp0: link state changed to UP > > > > Hmm, it does not look right, carp0 showed link DOWN message four > > times in a row. > > By the way, are you using IPMI on MCP55? nfe(4) is not ready to > > handle MAC operation with IPMI. > > > Turning off tx & rc checksum offloading seems to have resolved the problem: > > ifconfig nfe0 -txcsum -rxcsum > > Seems to have stopped both the corruption and the interface hanging. I ran it for about 16 hours on the FreeBSD 8 box. It also appears to have fixed the problem on my FreeBSD 7 machine as well. > Hmm, could you try the patch at the following URL? http://people.freebsd.org/~yongari/nfe/nfe.mcp55.txcsum.patch The patch ensures that the first fragment of mbuf holds ip/tcp/udp header including options. If this patch fix the issue then it means there is an issue in TX checksum offloading on MCP55. But I'm still not sure whether it makes any difference because there was no report on broken TX checksum offloading on nfe(4). At least I don't remember that kind of report so far. Note, the patch was not tested at all, I have no longer access to nfe(4) controllers so please make sure to test it first before applying the patch. > I didn't try turning off TSO. > Ok, your tcpdump shows garbled packets for non-TSO frames so the patch above does no special handling for TSO case. > Thank you for your suggestion & help! > Mel > >