From owner-freebsd-net@FreeBSD.ORG Fri Apr 30 06:44:25 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CEF791065674 for ; Fri, 30 Apr 2010 06:44:25 +0000 (UTC) (envelope-from spork@bway.net) Received: from xena.bway.net (xena.bway.net [216.220.96.26]) by mx1.freebsd.org (Postfix) with ESMTP id 868498FC08 for ; Fri, 30 Apr 2010 06:44:25 +0000 (UTC) Received: (qmail 69906 invoked by uid 0); 30 Apr 2010 06:44:24 -0000 Received: from unknown (HELO ?10.3.2.41?) (spork@96.57.144.66) by smtp.bway.net with (DHE-RSA-AES256-SHA encrypted) SMTP; 30 Apr 2010 06:44:24 -0000 Date: Fri, 30 Apr 2010 02:44:24 -0400 (EDT) From: Charles Sprickman X-X-Sender: spork@hotlap.local To: freebsd-net@freebsd.org In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (OSX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Subject: Re: 8.0 carp problems X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Apr 2010 06:44:25 -0000 On Wed, 21 Apr 2010, Charles Sprickman wrote: > Hello, > > I still need to gather more info when I visit the datacenter to reboot one of > the problematic hosts, but I wanted to verify my basic carp config here was > solid. Said machine has been booted and is also on a remote power switch now. This keeps happening. The other host running carp+dnscache has not had any problems. It has the same config, same pf.conf rules (both the internal interface and carp interfaces are skipped - "set skip on ..."). The other host has an em interface. More info inline... > I have two hosts that are running a recursing name server on our internal > network for other servers. Since failover from multiple entries in > resolv.conf is painfully slow, I decided to start using carp to deal with > possible dns failure by having pairs of boxes setup with carp in arpbalance > mode. Testing in vmware proved this works well... > > However, after about 18 hours, one of the carp hosts (with an fxp interface) > paniced. Then after coming back up after that, it hard locked - no serial > console response, no ping response on either internal or external interfaces. > The other carp host (em interface) continues to run with no issues. > > My config on each box is pretty simple: > > carp-1: > > (rc.conf) > ifconfig_fxp1="inet 192.168.1.107 netmask 255.255.255.0 media 100baseTX > mediaopt full-duplex" > # carp stuff - this sets up two vhids, required for arpbalance > cloned_interfaces="carp0 carp1" > ifconfig_carp0="vhid 1 pass foobar 192.168.1.254/24" > ifconfig_carp1="vhid 2 advskew 100 pass foobar 192.168.1.254/24" > > (sysctl.conf) > net.inet.carp.arpbalance=1 > net.inet.carp.preempt=1 > > carp-2: > > (rc.conf) > ifconfig_em0="inet 192.168.1.121 netmask 255.255.255.0 media 1000baseTX > mediaopt full-duplex" > # carp stuff - this sets up two vhids, required for arpbalance > cloned_interfaces="carp0 carp1" > ifconfig_carp0="vhid 1 advskew 100 pass foobar 192.168.1.254/24" > ifconfig_carp1="vhid 2 pass foobar 192.168.1.254/24" > > (sysctl.conf) > net.inet.carp.arpbalance=1 > net.inet.carp.preempt=1 > > When the carp-1 paniced, this is what I saw in carp-2's logs: > > Apr 20 22:32:40 h21 kernel: carp0: link state changed to UP > Apr 20 22:39:52 h21 kernel: carp0: incorrect hash > Apr 20 22:39:52 h21 kernel: carp1: incorrect hash > Apr 20 22:39:52 h21 kernel: arp: 00:00:5e:00:01:02 is using my IP address > 192.168.1.254 on em0! > Apr 20 22:39:54 h21 kernel: carp0: link state changed to DOWN > Apr 20 22:40:19 h21 kernel: carp0: link state changed to UP > > carp-1 managed to squirt this out on the console before locking up: > > fxp1: discard frame w/o leading ethernet header (len 4294967294 pkt len > 4294967294) > > I'm bringing this up here since I've seen some traffic on -stable lately > regarding issues with some of the intel nics. I figure carp is probably > doing some "interesting" things to create virtual macs and the like. > > The two nics I'm using are as follows. > > carp-1: > > fxp1: port 0xd000-0xd03f mem > 0xfe9fd000-0xfe9fdfff,0xfe600000-0xfe6fffff irq 21 at device 5.0 on pci0^M > miibus1: on fxp1 > inphy1: PHY 1 on miibus1 > inphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > fxp1: Ethernet address: 00:e0:81:03:b0:13 > fxp1: [ITHREAD] pciconf: fxp1@pci0:0:5:0: class=0x020000 card=0x100c8086 chip=0x12298086 rev=0x08 hdr=0x00 vendor = 'Intel Corporation' device = '82550/1/7/8/9 EtherExpress PRO/100(B) Ethernet Adapter' class = network subclass = ethernet > carp-2: > > em0: port 0x3800-0x381f mem > 0xfc220000-0xfc23ffff,0xfc200000-0xfc21ffff irq 31 at device 4.0 on pci3 > em0: [FILTER] > em0: Ethernet address: 00:30:48:12:2d:60 pciconf: em0@pci0:3:4:0: class=0x020000 card=0x100d8086 chip=0x100d8086 rev=0x02 hdr=0x00 vendor = 'Intel Corporation' device = 'Gigabit Ethernet Controller (LOM) (82544GC)' class = network subclass = ethernet > I've not fiddled with any settings on either nic beyond forcing media and > duplex - so if checksum offloading is enabled/disabled by default, that's > what I'd be using. fxp1 on carp-1 supports and enabled "rxcsum" by default. Disabling it has not stopped the panics. > I can supply more information if needed. I need to boot the locked box, > enable dumps, and get more info on the revision of the fxp nics on it. Dumps have been enabled, but don't seem to be working very well. One did write out a "core.txt" file which has some decent information: panic: page fault cpuid = 1 Uptime: 4d19h43m50s Physical memory: 2035 MB Dumping 269 MB: 254 238 222 206 190 174 158 142 126 110 94 78 62 46 30 14 [reading symbols from zfs.ko, opensolaris.ko, geom_mirror.ko, sym.ko, pflog.ko, pf.ko] #0 doadump () at pcpu.h:246 246 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump () at pcpu.h:246 #1 0x80674117 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:416 #2 0x80674409 in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:579 #3 0x808f3a5c in trap_fatal (frame=0xda1b9895, eva=0) at /usr/src/sys/i386/i386/trap.c:933 #4 0x808f3cc0 in trap_pfault (frame=0xda1b9895, usermode=0, eva=0) at /usr/src/sys/i386/i386/trap.c:846 #5 0x808f4679 in trap (frame=0xda1b9895) at /usr/src/sys/i386/i386/trap.c:528 #6 0x808d786b in calltrap () at /usr/src/sys/i386/i386/exception.s:165 #7 0x86ec7905 in pf_test (dir=2, ifp=0x85694400, m0=0xda1b9a3c, eh=0x0, inp=0x86d83c08) at mbuf.h:997 #8 0x86ecf77c in pf_check_out (arg=0x0, m=0xda1b9a3c, ifp=0x85694400, dir=2, inp=0x86d83c08) at /usr/src/sys/modules/pf/../../contrib/pf/net/pf_ioctl.c:3686 #9 0x807264d8 in pfil_run_hooks (ph=Cannot access memory at address 0x401b4) at /usr/src/sys/net/pfil.c:81 Previous frame inner to this frame (corrupt stack?) (kgdb) [...] Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x0 fault code = supervisor write, page not present instruction pointer = 0x20:0x86ec7905 stack pointer = 0x28:0xda1b98d5 frame pointer = 0x28:0xda1b99dc code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 761 (dnscache) trap number = 12 panic: page fault cpuid = 1 Uptime: 4d19h43m50s Physical memory: 2035 MB Dumping 269 MB: 254 238 222 206 190 174 158 142 126 110 94 78 62 46 30 14 This file also contains ps -axl, vmstat, netstat, and a whole mess of other stuff. I can supply any of that. Looking for any recommendations on how to troubleshoot this. I have no idea if this is carp, fxp, or pf causing the issue. I'm leaning towards the fxp driver since the other host is not having any issues. Tonight while getting this info together and trying to run kgdb (/usr/src and /usr/obj are nfs mounted off of fxp1), I watched performance degrade horribly on that interface (nfs timeouts, ssh connection stalling). There were also occasional hangs on the serial console session when I attempted to run tcpdump on fxp1. Eventually connectivity failed on fxp1 until a reboot. pf was showing no blocked packets, netstat and the switch stats showed no errors or drops on the interface. Arp resolution started failing and nothing but a reboot brought the network back. Bringing the interface up/down, turning on/off rxcsum did not bring anything back. Can anyone point me in the right direction? Should I give any particular -stable snapshot a spin? An 8.1-beta (not sure if that was tagged yet)? I'd really appreciate any input on this... Thanks, Charles > Thanks, > > Charles >