From owner-freebsd-current Mon Sep 21 19:27:34 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id TAA19674 for freebsd-current-outgoing; Mon, 21 Sep 1998 19:27:34 -0700 (PDT) (envelope-from owner-freebsd-current@FreeBSD.ORG) Received: from skynet.ctr.columbia.edu (skynet.ctr.columbia.edu [128.59.64.70]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id TAA19519; Mon, 21 Sep 1998 19:26:32 -0700 (PDT) (envelope-from wpaul@skynet.ctr.columbia.edu) Received: (from wpaul@localhost) by skynet.ctr.columbia.edu (8.6.12/8.6.9) id WAA25016; Mon, 21 Sep 1998 22:31:15 -0400 From: Bill Paul Message-Id: <199809220231.WAA25016@skynet.ctr.columbia.edu> Subject: Strange behavior with ARP and IP fragmentation To: current@FreeBSD.ORG, freebsd-net@FreeBSD.ORG Date: Mon, 21 Sep 1998 22:31:13 -0400 (EDT) Cc: wollman@FreeBSD.ORG X-Mailer: ELM [version 2.4 PL24] Content-Type: text Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Hello: For those who don't know, I've been working on yet another fast ethernet driver lately for the RealTek 8139 chip. This chip sucks, but that's not why I'm writing. Today, while running some tests, I noticed some odd IP fragmentation behavior which I thought was due to a bug in my driver code, but I've since been able to duplicate the problem on another machine with a 3c509 card using the ep driver. This has me a little confused. Here's the deal: one of the tests I do involves sending ICMP datagrams with ping using various payload sizes (using the -s flag). By using a packet size larger than 1500 bytes, I can get the system to queue up a small number of ethernet frames fairly quickly and observe the result. This lets me see if the driver is transmitting rapidly queued sequences of frames correctly. I use the -c flag with ping to limit the number of packets so that I can check short bursts of frames rather than a huge stream. (Watching a massive bunch of frames fly through tcpdump at 100Mbps makes it hard to spot glitches.) One thing I do a lot is this: # ifconfig 10.0.0.2 netmask 0xffffff00 up # ping -c 1 -s 4096 10.0.0.1 10.0.0.1 is another machine attached to the interface under test using a crossover cable. I run tcpdump on this host to monitor traffic from the first machine so I can see what the NIC is sending. Assuming the system has just been booted, the 10.0.0.2 host will not yet have an ARP entry for the 10.0.0.1 host, so the sequence should go something like this: 10.0.0.2: sends an ARP request for 10.0.0.1 10.0.0.1: sends an ARP reply to 10.0.0.2 10.0.0.2: sends the first fragment of an ICMP echo request which should be about 1514 bytes long. The ICMP packet is fragmented since 4096 bytes is larger than the interface MTU of 1500 bytes. 10.0.0.2: sends the next fragment, also of 1514 bytes 10.0.0.2: sends the last fragment, somewhere in the neigborhood of 1068 bytes 10.0.0.1: sends the first fragment of an ICMP echo reply. Again, the fragmentation occurs because the reply is also 4096 bytes. 10.0.0.1: sends the next frag 10.0.0.1: sends the last frag At this point, ping reports that the reply was received and all is happy and there is much rejoicing. Not. What I observed is that the ARP request and ARP reply proceed as expected, but the first portion of the ICMP packet transmitted is in fact the last fragment. The first two fragments have been vanished into the void. Since the ICMP echo request is contained in the first fragment, the host on the other side discards the fragment and never sends a reply. The result is that 'ping -c 1 -s 4096 10.0.0.1' just sits there and no reply is ever received. On the other hand, sending a second ICMP request immediately after the first does work. Below is a tcpdump capture of an actual exchange between two machines. Harpsichord is a Micron Pentium Pro 200Mhz machine with a 3Com 3c509 ethernet adapter running FreeBSD 2.2.6. Sax is an IBM RS/6000 model 390 running AIX 4.1.4. First, I run tcpdump on harpsichord to capture the session: [/homes/rwpaul]:harpsichord{1}# tcpdump -n -e -i ep0 host sax and harpsichord tcpdump: listening on ep0 Now I type 'ping -c 1 -s 4096 sax' on harpsichord. Note: there is no ARP entry for sax on harpsichord at this point. The resulting exchange is shown below: 21:41:03.105011 0:60:97:6c:6f:b0 ff:ff:ff:ff:ff:ff 0806 42: arp who-has 128.59.68.56 tell 128.59.68.72 21:41:03.105338 10:0:5a:fa:4e:9e 0:60:97:6c:6f:b0 0806 60: arp reply 128.59.68.56 is-at 10:0:5a:fa:4e:9e 21:41:03.105970 0:60:97:6c:6f:b0 10:0:5a:fa:4e:9e 0800 1178: 128.59.68.72 > 128.59.68.56: (frag 15401:1144@2960) Note that the only part of the ICMP datagram to make it out the door is the final fragment. This fails to illicit a response from the RS/6000, so the ping times out. Now I issue the same ping command to send another 4096 byte ICMP request. This time, an ARP entry for sax exists on harpsichord, so no ARP packets are sent. This time, everything looks normal: 21:41:19.647643 0:60:97:6c:6f:b0 10:0:5a:fa:4e:9e 0800 1514: 128.59.68.72 > 128.59.68.56: icmp: echo request (frag 15424:1480@0+) 21:41:19.648423 0:60:97:6c:6f:b0 10:0:5a:fa:4e:9e 0800 1514: 128.59.68.72 > 128.59.68.56: (frag 15424:1480@1480+) 21:41:19.649053 0:60:97:6c:6f:b0 10:0:5a:fa:4e:9e 0800 1178: 128.59.68.72 > 128.59.68.56: (frag 15424:1144@2960) 21:41:19.652758 10:0:5a:fa:4e:9e 0:60:97:6c:6f:b0 0800 1514: 128.59.68.56 > 128.59.68.72: icmp: echo reply (frag 12732:1480@0+) 21:41:19.654060 10:0:5a:fa:4e:9e 0:60:97:6c:6f:b0 0800 1514: 128.59.68.56 > 128.59.68.72: (frag 12732:1480@1480+) 21:41:19.655099 10:0:5a:fa:4e:9e 0:60:97:6c:6f:b0 0800 1178: 128.59.68.56 > 128.59.68.72: (frag 12732:1144@2960) I originally observed this behavior on a 3.0CAM snapshot with my not quite complete (but largely functional) RealTek driver, however it appears to manifest itself on 2.2.x too. I'm at a loss to explain what's going on here, but something's clearly wrong. For a while I was convinced that my driver was at fault, but after adding some debug code I realized that the transmit start routine was only being called with one fragment, so the other fragments weren't even making it to the device driver stage. This is further evidenced by the fact that I can reproduce the problem on 2.2.6 with a totally different driver. I have no idea if this behavior goes all the way back to 2.1.x. Note that larger ICMP datagram sizes will also trigger the behavior: on FreeBSD 3.0, I was able to specify a size of 8100 bytes without ping complaining, but again only the last fragment of the first datagram gets transmitted (subsequent datagrams send after the ARP request/reply exchange are send properly). If anybody has any insights on this, I'd love to hear them. I really don't want to wade through TCP/IP Illustrated Vol.II trying to track this down. -Bill -- ============================================================================= -Bill Paul (212) 854-6020 | System Manager, Master of Unix-Fu Work: wpaul@ctr.columbia.edu | Center for Telecommunications Research Home: wpaul@skynet.ctr.columbia.edu | Columbia University, New York City ============================================================================= "It is not I who am crazy; it is I who am mad!" - Ren Hoek, "Space Madness" ============================================================================= To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message