From owner-freebsd-current@FreeBSD.ORG Mon May 12 06:56:27 2008 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DB2801065670 for ; Mon, 12 May 2008 06:56:27 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id BC0668FC14 for ; Mon, 12 May 2008 06:56:27 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id 6858F5B75; Sun, 11 May 2008 23:56:26 -0700 (PDT) To: Matthew Dillon In-reply-to: Your message of "Sun, 11 May 2008 12:07:34 PDT." <200805111907.m4BJ7YE7005447@apollo.backplane.com> Date: Sun, 11 May 2008 23:56:25 -0700 From: Bakul Shah Message-Id: <20080512065626.6858F5B75@mail.bitblocks.com> Cc: freebsd-current@freebsd.org, Julian Elischer Subject: Re: tcp over slow links broken? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 May 2008 06:56:27 -0000 On Sun, 11 May 2008 12:07:34 PDT Matthew Dillon wrote: > Hmm. It looks like C has gone deaf, not B. B is retransmitting from > sequence 4744 which is the last sequence that C acked. C is then not > acking any further packets. Yes indeed. > 14:22:42.411144 IP B.55535 > C.ssh: . 7664:9124(1460) ack 2016 win 65535 > 14:22:42.411259 IP B.55535 > C.ssh: . 9124:10584(1460) ack 2016 win 65535 > 14:22:42.468350 IP C.ssh > B.55535: . ack 4744 win 65535 > 14:22:42.490556 IP C.ssh > B.55535: . ack 4744 win 65535 > 14:22:42.830171 IP B.55535 > C.ssh: . 4744:6204(1460) ack 2016 win 65535 ... > > This sounds like a packet filter state issue. My guess is that > PF running on B is getting confused. Either PF is getting confused, > or the packet is getting munged somehow to the point where PF refuses > to bridge it. I had already tried this. > The A->C path (the one that is working) is going through PF's NAT rules. > The B->C path is probably going through a different set of PF rules. > > I suggest capturing a trace on C to see if C is actually receiving > B's retransmissions. Finally this evening thanks to my friend Rob Warnock's help this got narrowed down quite a bit. We captured a trace on C and saw that it was not seeing the [4744:6204) data range packet or any of its retransmits. But this was a perfectly valid packet on B (verified with tcpdump -v + manual header checksumming). Then Rob recalled having run across mbuf alignment issues in the past so to check for that I swapped NICs around and the problem stayed with the NIC, an old DEC 21140 card! So this was not related to pf or a slow link but most likely due to mbuf misalignment (IIRC de requires aligned mbufs). There is just one commit on if_de.c during past April. Perhaps this is due to a side effect of that (bpf is not given a packet during device attach) or perhaps some change elsewhere. Thanks for your & Julian's help!