From owner-freebsd-net@FreeBSD.ORG  Wed Apr 21 05:52:45 2010
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C9A9B106567F
	for <freebsd-net@freebsd.org>; Wed, 21 Apr 2010 05:52:45 +0000 (UTC)
	(envelope-from spork@bway.net)
Received: from xena.bway.net (xena.bway.net [216.220.96.26])
	by mx1.freebsd.org (Postfix) with ESMTP id 85FDC8FC08
	for <freebsd-net@freebsd.org>; Wed, 21 Apr 2010 05:52:45 +0000 (UTC)
Received: (qmail 23056 invoked by uid 0); 21 Apr 2010 05:26:04 -0000
Received: from unknown (HELO ?10.3.2.41?) (spork@96.57.144.66)
	by smtp.bway.net with (DHE-RSA-AES256-SHA encrypted) SMTP;
	21 Apr 2010 05:26:04 -0000
Date: Wed, 21 Apr 2010 01:26:03 -0400 (EDT)
From: Charles Sprickman <spork@bway.net>
X-X-Sender: spork@hotlap.local
To: freebsd-net@freebsd.org
Message-ID: <alpine.OSX.2.00.1004210108030.1000@hotlap.local>
User-Agent: Alpine 2.00 (OSX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
Subject: 8.0 carp problems
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 21 Apr 2010 05:52:46 -0000

Hello,

I still need to gather more info when I visit the datacenter to reboot one 
of the problematic hosts, but I wanted to verify my basic carp config here 
was solid.

I have two hosts that are running a recursing name server on our internal 
network for other servers.  Since failover from multiple entries in 
resolv.conf is painfully slow, I decided to start using carp to deal with 
possible dns failure by having pairs of boxes setup with carp in 
arpbalance mode.  Testing in vmware proved this works well...

However, after about 18 hours, one of the carp hosts (with an fxp 
interface)  paniced.  Then after coming back up after that, it hard locked 
- no serial console response, no ping response on either internal or 
external interfaces.  The other carp host (em interface) continues to run 
with no issues.

My config on each box is pretty simple:

carp-1:

(rc.conf)
ifconfig_fxp1="inet 192.168.1.107 netmask 255.255.255.0 media 100baseTX 
mediaopt full-duplex"
# carp stuff - this sets up two vhids, required for arpbalance
cloned_interfaces="carp0 carp1"
ifconfig_carp0="vhid 1 pass foobar 192.168.1.254/24"
ifconfig_carp1="vhid 2 advskew 100 pass foobar 192.168.1.254/24"

(sysctl.conf)
net.inet.carp.arpbalance=1
net.inet.carp.preempt=1

carp-2:

(rc.conf)
ifconfig_em0="inet 192.168.1.121 netmask 255.255.255.0 media 1000baseTX 
mediaopt full-duplex"
# carp stuff - this sets up two vhids, required for arpbalance
cloned_interfaces="carp0 carp1"
ifconfig_carp0="vhid 1 advskew 100 pass foobar 192.168.1.254/24"
ifconfig_carp1="vhid 2 pass foobar 192.168.1.254/24"

(sysctl.conf)
net.inet.carp.arpbalance=1
net.inet.carp.preempt=1

When the carp-1 paniced, this is what I saw in carp-2's logs:

Apr 20 22:32:40 h21 kernel: carp0: link state changed to UP
Apr 20 22:39:52 h21 kernel: carp0: incorrect hash
Apr 20 22:39:52 h21 kernel: carp1: incorrect hash
Apr 20 22:39:52 h21 kernel: arp: 00:00:5e:00:01:02 is using my IP address 
192.168.1.254 on em0!
Apr 20 22:39:54 h21 kernel: carp0: link state changed to DOWN
Apr 20 22:40:19 h21 kernel: carp0: link state changed to UP

carp-1 managed to squirt this out on the console before locking up:

fxp1: discard frame w/o leading ethernet header (len 4294967294 pkt len 
4294967294)

I'm bringing this up here since I've seen some traffic on -stable lately 
regarding issues with some of the intel nics.  I figure carp is probably 
doing some "interesting" things to create virtual macs and the like.

The two nics I'm using are as follows.

carp-1:

fxp1: <Intel 82559 Pro/100 Ethernet> port 0xd000-0xd03f mem 
0xfe9fd000-0xfe9fdfff,0xfe600000-0xfe6fffff irq 21 at device 5.0 on pci0^M
miibus1: <MII bus> on fxp1
inphy1: <i82555 10/100 media interface> PHY 1 on miibus1
inphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
fxp1: Ethernet address: 00:e0:81:03:b0:13
fxp1: [ITHREAD]

carp-2:

em0: <Intel(R) PRO/1000 Network Connection 6.9.14> port 0x3800-0x381f mem 
0xfc220000-0xfc23ffff,0xfc200000-0xfc21ffff irq 31 at device 4.0 on pci3
em0: [FILTER]
em0: Ethernet address: 00:30:48:12:2d:60

I've not fiddled with any settings on either nic beyond forcing media and 
duplex - so if checksum offloading is enabled/disabled by default, that's 
what I'd be using.

I can supply more information if needed.  I need to boot the locked box, 
enable dumps, and get more info on the revision of the fxp nics on it.

Thanks,

Charles