From owner-freebsd-stable@FreeBSD.ORG Thu Nov 11 21:37:23 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A21D106564A; Thu, 11 Nov 2010 21:37:23 +0000 (UTC) (envelope-from oberman@es.net) Received: from mailgw.es.net (mail1.es.net [IPv6:2001:400:201:1::2]) by mx1.freebsd.org (Postfix) with ESMTP id 149648FC17; Thu, 11 Nov 2010 21:37:23 +0000 (UTC) Received: from ptavv.es.net (ptavv.es.net [IPv6:2001:400:910::29]) by mailgw.es.net (8.14.3/8.14.3) with ESMTP id oABLbM8p009216 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Thu, 11 Nov 2010 13:37:22 -0800 Received: from ptavv.es.net (localhost [127.0.0.1]) by ptavv.es.net (Tachyon Server) with ESMTP id 55BC91CC12; Thu, 11 Nov 2010 13:37:22 -0800 (PST) To: pyunyh@gmail.com In-reply-to: Your message of "Thu, 11 Nov 2010 13:04:36 PST." <20101111210436.GD17566@michelle.cdnetworks.com> Date: Thu, 11 Nov 2010 13:37:22 -0800 From: "Kevin Oberman" Message-Id: <20101111213722.55BC91CC12@ptavv.es.net> Cc: freebsd-stable@freebsd.org, Kirill Yelizarov , net@freebsd.org Subject: Re: icmp packets on em larger than 1472 [SEC=UNCLASSIFIED] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Nov 2010 21:37:23 -0000 > From: Pyun YongHyeon > Date: Thu, 11 Nov 2010 13:04:36 -0800 > > On Thu, Nov 11, 2010 at 08:10:57AM -0800, Kevin Oberman wrote: > > > Date: Wed, 10 Nov 2010 23:49:56 -0800 (PST) > > > From: Kirill Yelizarov > > > > > > > > > > > > --- On Thu, 11/11/10, Kevin Oberman wrote: > > > > > > > From: Kevin Oberman > > > > Subject: Re: icmp packets on em larger than 1472 [SEC=UNCLASSIFIED] > > > > To: "Wilkinson, Alex" > > > > Cc: freebsd-stable@freebsd.org > > > > Date: Thursday, November 11, 2010, 8:26 AM > > > > > Date: Thu, 11 Nov 2010 13:01:26 > > > > +0800 > > > > > From: "Wilkinson, Alex" > > > > > Sender: owner-freebsd-stable@freebsd.org > > > > > > > > > > > > > > >? ???0n Wed, Nov 10, 2010 at > > > > 04:21:12AM -0800, Kirill Yelizarov wrote: > > > > > > > > > >? ???>All my em cards running > > > > 8.1 stable don't reply to icmp echo requests packets larger > > > > than 1472 bytes. > > > > >? ???> > > > > >? ???>On stable 7.2 the same > > > > hardware works as expected: > > > > >? ???># ping -s 1500 > > > > 192.168.64.99 > > > > >? ???>PING 192.168.64.99 > > > > (192.168.64.99): 1500 data bytes > > > > >? ???>1508 bytes from > > > > 192.168.64.99: icmp_seq=0 ttl=63 time=1.249 ms > > > > >? ???>1508 bytes from > > > > 192.168.64.99: icmp_seq=1 ttl=63 time=1.158 ms > > > > >? ???> > > > > >? ???>Here is the dump on em > > > > interface > > > > >? ???>15:06:31.452043 IP > > > > 192.168.66.65 > *****: ICMP echo request, id 28729, seq > > > > 5, length 1480 > > > > >? ???>15:06:31.452047 IP > > > > 192.168.66.65 > ****: icmp > > > > >? ???>15:06:31.452069 IP **** > > > > > 192.168.66.65: ICMP echo reply, id 28729, seq 5, length > > > > 1480 > > > > >? ???>15:06:31.452071 IP *** > > > > > 192.168.66.65: icmp > > > > >? ???> > > > > >? ???>Same ping from same source > > > > (it's a 8.1 stable with fxp interface) to em card running > > > > 8.1 stable > > > > >? ???>#pciconf -lv > > > > >? > > > > ???>em0@pci0:3:4:0:??? > > > > class=0x020000 card=0x10798086 chip=0x10798086 rev=0x03 > > > > hdr=0x00 > > > > >? ???>? ? vendor? > > > > ???= 'Intel Corporation' > > > > >? ???>? ? device? > > > > ???= 'Dual Port Gigabit Ethernet Controller > > > > (82546EB)' > > > > >? ???>? ? class? > > > > ? ? = network > > > > >? ???>? ? > > > > subclass???= ethernet > > > > >? ???> > > > > >? ???># ping -s 1472 > > > > 192.168.64.200 > > > > >? ???>PING 192.168.64.200 > > > > (192.168.64.200): 1472 data bytes > > > > >? ???>1480 bytes from > > > > 192.168.64.200: icmp_seq=0 ttl=63 time=0.848 ms > > > > >? ???>^C > > > > >? ???> > > > > >? ???># ping -s 1473 > > > > 192.168.64.200 > > > > >? ???>PING 192.168.64.200 > > > > (192.168.64.200): 1473 data bytes > > > > >? ???>^C > > > > >? ???>--- 192.168.64.200 ping > > > > statistics --- > > > > >? ???>4 packets transmitted, 0 > > > > packets received, 100.0% packet loss > > > > > > > > > > works fine for me: > > > > > > > > > > FreeBSD 8.1-STABLE #0 r213395 > > > > > > > > > > em0@pci0:0:25:0:class=0x020000 card=0x3035103c > > > > chip=0x10de8086 rev=0x02 hdr=0x00 > > > > >? ???vendor? > > > > ???= 'Intel Corporation' > > > > >? ???device? > > > > ???= 'Intel Gigabit network connection > > > > (82567LM-3 )' > > > > >? ???class? ? ? = > > > > network > > > > >? ???subclass???= > > > > ethernet > > > > > > > > > > #ping -s 1473 host > > > > > PING host(192.168.1.1): 1473 data bytes > > > > > 1481 bytes from 192.168.1.1: icmp_seq=0 ttl=253 > > > > time=31.506 ms > > > > > 1481 bytes from 192.168.1.1: icmp_seq=1 ttl=253 > > > > time=31.493 ms > > > > > 1481 bytes from 192.168.1.1: icmp_seq=2 ttl=253 > > > > time=31.550 ms > > > > > ^C > > > > > > > > The reason the '-s 1500' worked was that the packets were > > > > fragmented. If > > > > I add the '-D' option, '-s 1473' fails on v7 and v8. Are > > > > the V8 systems > > > > where you see if failing without the '-D' on the same > > > > network segment? > > > > If not, it is likely that an intervening device is refusing > > > > to fragment > > > > the packet. (Some routers deliberately don't fragment ICMP > > > > Echos Request > > > > packets.) > > > > > > If i set -D -s 1473 sender side refuses to ping and that is > > > correct. All mentioned above machines are behind the same router and > > > switch. Same hardware running v7 is working while v8 is not. And i > > > never saw such problems before. Also correct me if i'm wrong but the > > > dump shows that the packet arrived. I'll try driver from head and will > > > post here results. > > > > I did a bit more looking at this today and I see that something bogus is > > going on and it MAY be the em driver. > > > > I tried 1473 data byte pings without the DF flag. I then captured the > > packets on both ends (where the sending system has a bge (Broadcom GE) > > and the responding end has an em (Intel) card. > > > > What I saw was the fragmented IP packets all being received by the > > system with the em interface and an ICMP Echo Reply being sent back, > > again fragmented. I saw the reply on both ends, so both interfaces were > > able to fragment an over-sized packet, transmit the two pieces, and > > receive the two pieces. The em device could re-assemble them properly, > > but the bge device does not seem to re-assemble them correctly or else > > has a problem with ICMP packets bigger then MTU size. > > > > When I send from the em system, I see the packets and fragments all > > arrive in good form, but the system never sends out a reply. Since this > > is a kernel function, it may be a driver, but I suspect that it is in > > the IP stack since I am seeing the problem with a Broadcom card and I > > see the data all arriving. > > > > Most ethernet controllers including bge(4) have a function to > specify how much RX buffer space would be allocated to receive a > frame. When controller receive a frame that has larger size than > the size specified in RX buffer space, it would drop the frame. > Because the oversized frame was silently dropped in driver layer > upper stack has no chance to reply back ICMP responses with > fragmentation needed bit for frames that set don't fragment bit. > This is where correct MTU configuration play an important role in > driver layer. If you want to handle oversized frame you also have > to set correct MTU of interface. However all controllers should be > able to receive standard MTU sized frame including VLAN tag so no > special configuration is needed when you handle standard MTU sized > frames. Some old controllers can't handle VLAN oversized frame such > that you would have no way to send or receive them. > > em(4) controllers have different receiving logic where it allows > chaining multiple oversized frames into a single frame. So up to > certain point, which depends on the size of jumbo frame controller > supports, em(4) can receive these oversized frames regardless of > MTU configuration with the help of driver. The chaining is done in > driver layer and that would add additional overhead(chaining + > multiple mbuf allocation) but it has its own advantages. > > I was not able to to reproduce the issue with em(4)/bge(4) on > CURRENT and these drivers worked as expected. I don't have any systems running CURRENT at the moment, so I can't check it out. I hope it is fixed there, but it needs to be fixed in STABLE. Not fragmenting packets that will not fit in a standard frame is a very serious issues as, when the frame is dropped, the source re-transmits the same over-sized frame. Of course, this should not happen if the interface is set to an MTU of 1500 as the higher layers should never pass a block of data larger than 1480 bytes to the IP layer. That's the only reason this had not already been noticed. -- R. Kevin Oberman, Network Engineer Energy Sciences Network (ESnet) Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab) E-mail: oberman@es.net Phone: +1 510 486-8634 Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751