From owner-freebsd-net@FreeBSD.ORG Thu Jul 3 19:23:31 2008 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 94DAA1065688 for ; Thu, 3 Jul 2008 19:23:31 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id 6546A8FC16 for ; Thu, 3 Jul 2008 19:23:31 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id 5CD5D5B4C; Thu, 3 Jul 2008 12:05:13 -0700 (PDT) To: Peter Jeremy In-reply-to: Your message of "Thu, 03 Jul 2008 21:52:43 +1000." <20080703115243.GR29380@server.vk2pj.dyndns.org> Date: Thu, 03 Jul 2008 12:05:13 -0700 From: Bakul Shah Message-Id: <20080703190513.5CD5D5B4C@mail.bitblocks.com> Cc: freebsd-net@freebsd.org Subject: Re: arplookup x.x.x.x failed: host is not on local network X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Jul 2008 19:23:31 -0000 > Possibly, I'm seeing packet leakage from the switches and that is > confusing FreeBSD - definitely the first packet above should not be > visible. Even if the switch broadcasts on all ports (effectively becoming a hub) that should not cause the symptom you are seeing. If the switch sent arp response to the wrong port, you would've seen this ARP request at least on the sending machine. There is no such packet (for .26) in your tcpdump output. That either means there was no such packet or you've failed to capture it! You said you see the problem with different freebsd versions. Did you boot diff. versions on the same hardware or do you mean different versions are running on diff. hosts? If the latter, specific freebsd versions are not ruled out. You might want to capture many more arp failed messages to see if there is a pattern. Earlier you had wondered if resource exhaustion was to blame. That is ruled out by the arp failed message since the reason indicates the route goes to a gateway. We don't see any ARP request for .26 so this likely means .26 is not the one doing arp lookup (on receiving a request) & the arplookup failed message is on .111, right? We see packets flowing from .26 to .111 but not the other way around. What does netstat -nr look like on .111? If all the clocks are synchronized, you might want to capture tcpdump on *all* the machines! Since syslog timestamp has a granuality of 1 sec, you want to look at packets within a second before and a second after. BTW, your picture is nice but it doesn't jive with anything in the tcpdump output you attached! > Corp Network > 192.168.10.0/24 | 192.168.12.0/24 > +------+-------------+----------| | |----------+-------------+-----+ > .1| .2| .254| | |.254 .3| .4| > +-------+ +-------+ +-------+ +-------+ +-------+ > | | | | | | | | | | > | host1 | | host2 | | NAT | | host3 | | host4 | > | | | | | | | | | | > +-------+ +-------+ +-------+ +-------+ +-------+ > .1| .2| .254| |.254 .3| .4| > +------+-------------+----------| |----------+-------------+-----+ > 192.168.11.0/24 192.168.13.0/24 > > The errors appear to be randomly spread across hosts and subnets. It > does not appear consistently and seems to correlate with load (I am > getting significant numbers at present and the NAT host is routing > about 90Kpps and 100MBps if netstat can be believed). The problem > also shows up on another interior routing host that has visibility > across the internal networks so it isn't related to NAT or directly > related to host load (that host is only seeing about 3.5Kpps - but is > also a much slower host). > > I have managed to capture a tcpdump across the error. syslog reported: > Jul 3 21:28:30 xxxx kernel: arplookup 192.168.169.26 failed: host is not o= > n local network > and the packets for that host during that second are: > 21:28:30.320340 00:0b:cd:d6:66:26 > 00:03:ba:ab:6f:ef, ethertype 802.1Q (0x= > 8100), length 64: vlan 169, p 0, ethertype IPv4, IP (tos 0x0, ttl 64, id 2= > 9304, offset 0, flags [none], length: 28) 192.168.169.26 > 192.168.169.111:= > icmp 8: echo request seq 35079 > 21:28:30.320429 00:d0:b7:20:8f:ee > 00:03:ba:ab:6f:ef, ethertype 802.1Q (0x= > 8100), length 46: vlan 168, p 0, ethertype IPv4, IP (tos 0x0, ttl 63, id 2= > 9304, offset 0, flags [none], length: 28) 192.168.169.26 > 192.168.169.111:= > icmp 8: echo request seq 35079 > 21:28:30.320445 00:0b:cd:d6:66:26 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x= > 8100), length 64: vlan 169, p 0, ethertype ARP, arp who-has 192.168.169.250= > tell 192.168.169.26 > 21:28:30.320459 00:0b:cd:d6:66:26 > 00:d0:b7:20:8f:ee, ethertype 802.1Q (0x= > 8100), length 64: vlan 169, p 0, ethertype IPv4, IP (tos 0x0, ttl 64, id 2= > 9307, offset 0, flags [none], length: 28) 192.168.169.26 > 192.168.169.250:= > icmp 8: echo request seq 35079 > 21:28:30.320493 00:d0:b7:20:8f:ee > 00:0b:cd:d6:66:e4, ethertype 802.1Q (0x= > 8100), length 46: vlan 168, p 0, ethertype IPv4, IP (tos 0x0, ttl 64, id 1= > 5305, offset 0, flags [none], length: 28) 192.168.169.250 > 192.168.169.26:= > icmp 8: echo reply seq 35079 > 21:28:30.320531 00:d0:b7:20:8f:ee > 00:0b:cd:d6:66:26, ethertype 802.1Q (0x= > 8100), length 46: vlan 169, p 0, ethertype ARP, arp reply 192.168.169.250 i= > s-at 00:d0:b7:20:8f:ee > (this was captured MAC 00:d0:b7:20:8f:ee).