From owner-freebsd-net@FreeBSD.ORG  Thu Jul  3 19:23:31 2008
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 94DAA1065688
	for <freebsd-net@freebsd.org>; Thu,  3 Jul 2008 19:23:31 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 6546A8FC16
	for <freebsd-net@freebsd.org>; Thu,  3 Jul 2008 19:23:31 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id 5CD5D5B4C;
	Thu,  3 Jul 2008 12:05:13 -0700 (PDT)
To: Peter Jeremy <peterjeremy@optushome.com.au>
In-reply-to: Your message of "Thu, 03 Jul 2008 21:52:43 +1000."
	<20080703115243.GR29380@server.vk2pj.dyndns.org> 
Date: Thu, 03 Jul 2008 12:05:13 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080703190513.5CD5D5B4C@mail.bitblocks.com>
Cc: freebsd-net@freebsd.org
Subject: Re: arplookup x.x.x.x failed: host is not on local network 
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Jul 2008 19:23:31 -0000

> Possibly, I'm seeing packet leakage from the switches and that is
> confusing FreeBSD - definitely the first packet above should not be
> visible.

Even if the switch broadcasts on all ports (effectively
becoming a hub) that should not cause the symptom you are
seeing.  If the switch sent arp response to the wrong
port, you would've seen this ARP request at least on
the sending machine.  There is no such packet (for .26) in
your tcpdump output.  That either means there was no such
packet or you've failed to capture it!

You said you see the problem with different freebsd versions.
Did you boot diff. versions on the same hardware or do you
mean different versions are running on diff. hosts?  If the
latter, specific freebsd versions are not ruled out.  You
might want to capture many more arp failed messages to see if
there is a pattern.

Earlier you had wondered if resource exhaustion was to blame.
That is ruled out by the arp failed message since the reason
indicates the route goes to a gateway.

We don't see any ARP request for .26 so this likely means .26
is not the one doing arp lookup (on receiving a request) &
the arplookup failed message is on .111, right?  We see
packets flowing from .26 to .111 but not the other way
around.  What does netstat -nr look like on .111? 

If all the clocks are synchronized, you might want to capture
tcpdump on *all* the machines! Since syslog timestamp has a
granuality of 1 sec, you want to look at packets within a
second before and a second after.

BTW, your picture is nice but it doesn't jive with anything
in the tcpdump output you attached!

> 			       Corp Network
>          192.168.10.0/24       	     |             192.168.12.0/24
>   +------+-------------+----------|  |  |----------+-------------+-----+
>        .1|           .2|      .254|  |  |.254    .3|           .4|
>      +-------+     +-------+     +-------+     +-------+     +-------+
>      |       |     |       |     |       |     |       |     |       |
>      | host1 |     | host2 |     |  NAT  |     | host3 |     | host4 |
>      |       |     |       |     |       |     |       |     |       |
>      +-------+     +-------+     +-------+     +-------+     +-------+
>        .1|           .2|      .254|     |.254    .3|           .4|
>   +------+-------------+----------|     |----------+-------------+-----+
>          192.168.11.0/24       	                   192.168.13.0/24
> 
> The errors appear to be randomly spread across hosts and subnets.  It
> does not appear consistently and seems to correlate with load (I am
> getting significant numbers at present and the NAT host is routing
> about 90Kpps and 100MBps if netstat can be believed).  The problem
> also shows up on another interior routing host that has visibility
> across the internal networks so it isn't related to NAT or directly
> related to host load (that host is only seeing about 3.5Kpps - but is
> also a much slower host).
> 
> I have managed to capture a tcpdump across the error.  syslog reported:
> Jul  3 21:28:30 xxxx kernel: arplookup 192.168.169.26 failed: host is not o=
> n local network
> and the packets for that host during that second are:
> 21:28:30.320340 00:0b:cd:d6:66:26 > 00:03:ba:ab:6f:ef, ethertype 802.1Q (0x=
> 8100), length 64: vlan 169, p 0, ethertype IPv4, IP (tos 0x0, ttl  64, id 2=
> 9304, offset 0, flags [none], length: 28) 192.168.169.26 > 192.168.169.111:=
>  icmp 8: echo request seq 35079
> 21:28:30.320429 00:d0:b7:20:8f:ee > 00:03:ba:ab:6f:ef, ethertype 802.1Q (0x=
> 8100), length 46: vlan 168, p 0, ethertype IPv4, IP (tos 0x0, ttl  63, id 2=
> 9304, offset 0, flags [none], length: 28) 192.168.169.26 > 192.168.169.111:=
>  icmp 8: echo request seq 35079
> 21:28:30.320445 00:0b:cd:d6:66:26 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x=
> 8100), length 64: vlan 169, p 0, ethertype ARP, arp who-has 192.168.169.250=
>  tell 192.168.169.26
> 21:28:30.320459 00:0b:cd:d6:66:26 > 00:d0:b7:20:8f:ee, ethertype 802.1Q (0x=
> 8100), length 64: vlan 169, p 0, ethertype IPv4, IP (tos 0x0, ttl  64, id 2=
> 9307, offset 0, flags [none], length: 28) 192.168.169.26 > 192.168.169.250:=
>  icmp 8: echo request seq 35079
> 21:28:30.320493 00:d0:b7:20:8f:ee > 00:0b:cd:d6:66:e4, ethertype 802.1Q (0x=
> 8100), length 46: vlan 168, p 0, ethertype IPv4, IP (tos 0x0, ttl  64, id 1=
> 5305, offset 0, flags [none], length: 28) 192.168.169.250 > 192.168.169.26:=
>  icmp 8: echo reply seq 35079
> 21:28:30.320531 00:d0:b7:20:8f:ee > 00:0b:cd:d6:66:26, ethertype 802.1Q (0x=
> 8100), length 46: vlan 169, p 0, ethertype ARP, arp reply 192.168.169.250 i=
> s-at 00:d0:b7:20:8f:ee
> (this was captured MAC 00:d0:b7:20:8f:ee).