From owner-freebsd-sparc64@FreeBSD.ORG Tue Sep 2 21:00:07 2008 Return-Path: Delivered-To: freebsd-sparc64@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BA1A1106568B for ; Tue, 2 Sep 2008 21:00:07 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id C7EF08FC27 for ; Tue, 2 Sep 2008 21:00:06 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id m82L06Kj089111 for ; Tue, 2 Sep 2008 21:00:06 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.2/8.14.1/Submit) id m82L06HT089110; Tue, 2 Sep 2008 21:00:06 GMT (envelope-from gnats) Resent-Date: Tue, 2 Sep 2008 21:00:06 GMT Resent-Message-Id: <200809022100.m82L06HT089110@freefall.freebsd.org> Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer) Resent-To: freebsd-sparc64@FreeBSD.org Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org, Paulo Afonso Graner Fessel Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7A6581065671 for ; Tue, 2 Sep 2008 20:58:57 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21]) by mx1.freebsd.org (Postfix) with ESMTP id 67C6A8FC26 for ; Tue, 2 Sep 2008 20:58:57 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from www.freebsd.org (localhost [127.0.0.1]) by www.freebsd.org (8.14.2/8.14.2) with ESMTP id m82KwuL4012460 for ; Tue, 2 Sep 2008 20:58:56 GMT (envelope-from nobody@www.freebsd.org) Received: (from nobody@localhost) by www.freebsd.org (8.14.2/8.14.1/Submit) id m82KwuWm012459; Tue, 2 Sep 2008 20:58:56 GMT (envelope-from nobody) Message-Id: <200809022058.m82KwuWm012459@www.freebsd.org> Date: Tue, 2 Sep 2008 20:58:56 GMT From: Paulo Afonso Graner Fessel To: freebsd-gnats-submit@FreeBSD.org X-Send-Pr-Version: www-3.1 Cc: Subject: sparc64/127051: hme interfaces "pause" with the message "device timeout" on FreeBSD 7.0/sparc64 on an Enterprise 220R X-BeenThere: freebsd-sparc64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the Sparc List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Sep 2008 21:00:07 -0000 >Number: 127051 >Category: sparc64 >Synopsis: hme interfaces "pause" with the message "device timeout" on FreeBSD 7.0/sparc64 on an Enterprise 220R >Confidential: no >Severity: serious >Priority: high >Responsible: freebsd-sparc64 >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Tue Sep 02 21:00:05 UTC 2008 >Closed-Date: >Last-Modified: >Originator: Paulo Afonso Graner Fessel >Release: 7.0 >Organization: Virtus TI >Environment: FreeBSD vtsfprfw01.virtus-ti.com.br 7.0-RELEASE FreeBSD 7.0-RELEASE #7: Fri Jun 20 19:29:52 BRT 2008 root@vtsuprfw01.dedalusprime.com.br:/usr/obj/usr/src/sys/VIRTUSFW sparc64 >Description: We have a pair of UltraSparc II servers configured as HA-firewall with carp and pfsync. I noticed that even with an advskew of zero on the primary firewall (vtsfprfw01) the carp interfaces end up migrating to the backup firewall, which has an advskew of 200. Here's ifconfig for the primary firewall (master): hme0: flags=8843 metric 0 mtu 1500 options=b ether 08:00:20:d0:c3:dd inet 192.168.0.1 netmask 0xffffff00 broadcast 192.168.0.255 media: Ethernet 100baseTX status: active hme1: flags=8b43 metric 0 mtu 1500 options=b ether 08:00:20:bc:a6:b4 inet 200.215.183.101 netmask 0xfffffff0 broadcast 200.215.183.111 media: Ethernet 100baseTX status: active hme2: flags=8b43 metric 0 mtu 1500 options=b ether 08:00:20:bc:a6:b5 inet 200.143.2.2 netmask 0xffffff00 broadcast 200.143.2.255 media: Ethernet 100baseTX status: active hme3: flags=8802 metric 0 mtu 1500 options=b ether 08:00:20:bc:a6:b6 media: Ethernet autoselect hme4: flags=8802 metric 0 mtu 1500 options=b ether 08:00:20:bc:a6:b7 media: Ethernet autoselect pflog0: flags=141 metric 0 mtu 33160 lo0: flags=8049 metric 0 mtu 16384 inet 127.0.0.1 netmask 0xff000000 pfsync0: flags=41 metric 0 mtu 1460 pfsync: syncdev: hme0 syncpeer: 192.168.0.2 maxupd: 128 carp0: flags=49 metric 0 mtu 1500 inet 200.215.183.100 netmask 0xfffffff0 carp: BACKUP vhid 1 advbase 1 advskew 0 carp1: flags=49 metric 0 mtu 1500 inet 200.143.2.1 netmask 0xffffff00 carp: BACKUP vhid 2 advbase 1 advskew 0 And the same, for the second firewall (backup): vtsfprfw02# ifconfig -a hme0: flags=8843 metric 0 mtu 1500 options=b ether 08:00:20:e7:39:31 inet 192.168.0.2 netmask 0xffffff00 broadcast 192.168.0.255 media: Ethernet 100baseTX status: active hme1: flags=8b43 metric 0 mtu 1500 options=b ether 08:00:20:bc:a3:a0 inet 200.215.183.102 netmask 0xfffffff0 broadcast 200.215.183.111 media: Ethernet 100baseTX status: active hme2: flags=8b43 metric 0 mtu 1500 options=b ether 08:00:20:bc:a3:a1 inet 200.143.2.3 netmask 0xffffff00 broadcast 200.143.2.255 media: Ethernet 100baseTX status: active hme3: flags=8802 metric 0 mtu 1500 options=b ether 08:00:20:bc:a3:a2 media: Ethernet autoselect hme4: flags=8802 metric 0 mtu 1500 options=b ether 08:00:20:bc:a3:a3 media: Ethernet autoselect pflog0: flags=141 metric 0 mtu 33160 lo0: flags=8049 metric 0 mtu 16384 inet 127.0.0.1 netmask 0xff000000 pfsync0: flags=41 metric 0 mtu 1460 pfsync: syncdev: hme0 syncpeer: 192.168.0.1 maxupd: 128 carp0: flags=49 metric 0 mtu 1500 inet 200.215.183.100 netmask 0xfffffff0 carp: MASTER vhid 1 advbase 1 advskew 200 carp1: flags=49 metric 0 mtu 1500 inet 200.143.2.1 netmask 0xffffff00 carp: MASTER vhid 2 advbase 1 advskew 200 After noticing this, I also saw that "local-mac-address?" on the first firewall was set to "false", what caused all the interface ports to show the same MAC address. I've fixed this and rebooted the server, to investigate if this had something to do with the issue. Everything was alright during approximate 30 minutes, when the firewall has changed to the secondary machine. Here's an excerpt from /var/log/messages from the primary firewall: Sep 2 10:18:51 vtsfprfw01 ftp-proxy[929]: listening on 127.0.0.1 port 8021 Sep 2 10:18:51 vtsfprfw01 getty[943]: open /dev/ttyv2: No such file or directory Sep 2 10:18:51 vtsfprfw01 getty[945]: open /dev/ttyv4: No such file or directory Sep 2 10:18:51 vtsfprfw01 getty[941]: open /dev/ttyv0: No such file or directory Sep 2 10:18:51 vtsfprfw01 getty[948]: open /dev/ttyv7: No such file or directory Sep 2 10:18:51 vtsfprfw01 getty[946]: open /dev/ttyv5: No such file or directory Sep 2 10:18:51 vtsfprfw01 getty[944]: open /dev/ttyv3: No such file or directory Sep 2 10:18:51 vtsfprfw01 getty[942]: open /dev/ttyv1: No such file or directory Sep 2 10:18:51 vtsfprfw01 getty[947]: open /dev/ttyv6: No such file or directory Sep 2 10:18:58 vtsfprfw01 login: ROOT LOGIN (root) ON ttyu0 Sep 2 10:44:43 vtsfprfw01 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 2 10:44:43 vtsfprfw01 kernel: hme2: device timeout Sep 2 10:44:48 vtsfprfw01 kernel: carp0: BACKUP -> MASTER (preempting a slower master) Sep 2 10:44:48 vtsfprfw01 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 2 11:11:48 vtsfprfw01 kernel: hme2: device timeout Sep 2 11:35:19 vtsfprfw01 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 2 11:35:19 vtsfprfw01 kernel: hme2: device timeout Sep 2 11:35:24 vtsfprfw01 kernel: carp0: BACKUP -> MASTER (preempting a slower master) Sep 2 11:35:24 vtsfprfw01 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 2 13:37:16 vtsfprfw01 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) Sep 2 13:37:16 vtsfprfw01 kernel: hme1: device timeout Sep 2 13:37:17 vtsfprfw01 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 2 16:11:06 vtsfprfw01 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK) Sep 2 16:11:14 vtsfprfw01 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) As it can be seen from the logs, there's a number of messages like "hmeX: device timeout". When this happens, carp0 and carp1 see this as a disconnection and are forced to the secondary firewall. The most interesting is part is that I don't lose communication with either machines: I'm able to ping the first firewall normally after the event, and I don't get messages neither in the OS or on the switch pointing to link loss. The secondary server also shows the same problem (hmeX device timeout): Sep 1 15:58:00 vtsfprfw02 kernel: hme0: device timeout Sep 1 16:35:00 vtsfprfw02 kernel: hme0: device timeout Sep 1 16:35:40 vtsfprfw02 last message repeated 2 times Sep 1 16:45:33 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) Sep 1 16:45:34 vtsfprfw02 kernel: hme1: device timeout Sep 1 16:45:35 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 1 16:45:47 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 1 16:45:47 vtsfprfw02 kernel: carp1: BACKUP -> MASTER (preempting a slower master) Sep 1 16:45:47 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK) Sep 1 16:45:48 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 1 16:45:48 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master) Sep 1 16:45:48 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 1 17:12:16 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) Sep 1 17:12:16 vtsfprfw02 kernel: hme1: device timeout Sep 1 17:12:17 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 1 17:44:33 vtsfprfw02 kernel: arp: 200.215.183.110 is on hme1 but got reply from 00:14:d1:38:92:ba on hme2 Sep 1 17:45:05 vtsfprfw02 last message repeated 21 times Sep 1 17:45:27 vtsfprfw02 last message repeated 15 times Sep 1 18:55:48 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 1 18:55:48 vtsfprfw02 kernel: carp1: BACKUP -> MASTER (preempting a slower master) Sep 1 18:55:48 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK) Sep 1 18:55:48 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 1 18:55:49 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master) Sep 1 18:55:49 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 1 19:24:43 vtsfprfw02 sshd[34638]: error: PAM: authentication error for pfessel from 201.20.234.104 Sep 1 19:24:46 vtsfprfw02 sshd[34641]: error: ssh_msg_send: write Sep 2 10:19:31 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 2 10:19:50 vtsfprfw02 kernel: arp: 200.143.2.2 moved from 08:00:20:d0:c3:dd to 08:00:20:bc:a6:b5 on hme2 Sep 2 10:19:51 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) Sep 2 10:45:15 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK) Sep 2 10:45:15 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master) Sep 2 10:45:15 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 2 10:45:20 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 2 10:45:26 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) Sep 2 11:12:19 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK) Sep 2 11:12:32 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) Sep 2 11:35:51 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK) Sep 2 11:35:51 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master) Sep 2 11:35:51 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 2 11:35:56 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 2 11:36:06 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received) Sep 2 13:37:47 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 2 13:37:48 vtsfprfw02 kernel: carp1: BACKUP -> MASTER (preempting a slower master) Sep 2 13:37:48 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK) Sep 2 13:37:48 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received) Sep 2 13:37:49 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master) Sep 2 13:37:49 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK) Sep 2 14:01:06 vtsfprfw02 sshd[37221]: error: PAM: authentication error for root from 200.143.2.125 Sep 2 14:01:08 vtsfprfw02 sshd[37221]: error: PAM: authentication error for root from 200.143.2.125 Sep 2 16:07:37 vtsfprfw02 sshd[37559]: error: PAM: authentication error for root from 200.143.2.125 Sep 2 16:08:17 vtsfprfw02 last message repeated 3 times Sep 2 16:08:35 vtsfprfw02 sshd[37559]: error: PAM: authentication error for root from 200.143.2.125 Sep 2 16:08:38 vtsfprfw02 sshd[37566]: error: ssh_msg_send: write Sep 2 16:08:50 vtsfprfw02 sshd[37567]: error: PAM: authentication error for root from 200.143.2.125 Sep 2 16:09:00 vtsfprfw02 last message repeated 5 times Sep 2 16:11:40 vtsfprfw02 kernel: hme2: device timeout This log is even more interesting, as it's shown here there's the first transition of the carp interfaces from MASTER to BACKUP from when I first rebooted the master firewall to fix the eeprom issue; but there are also other transitions from BACKUP to MASTER to BACKUP to MASTER again, which haven't been logged in the master firewall! Now, some background information about network topology. * carp0 is connected to hme1 in both servers (master and backup), and points to our default internet gateway; * carp1 is connected to hme2 in both servers (master and backup), and points to our data center's front-end network (200.143.2.0/24) * pfsync0 is associated to hme0 in both servers. We don't use a back-to-back connection via crossover cable, but instead we use a dedicated VLAN on one of our switches. Because of this I've chosen to specify syncpeers on both servers, with the master firewall pointing to the backup firewall and vice-versa, always using hme0 as the physical device for pfsync0. I've checked that VHIDs from other firewalls which use this VLAN don't coincide with the ones we use in this particular configuration. Finally, hardware configuration follows: * Master firewall: E220R with 2 USII 450 MHz processors, 2 SUN18G SCSI hard disks, 2 GB RAM, one Sun QFE PCI interface (hme1-hme4 ports) plus the onboard HME interface (hme0). * Backup firewall: E420R with 2 USII 450 MHz processors, 2 SUN18G SCSI hard disks, 2 GB RAM, one Sun QFE PCI interface (hme1-hme4 ports) plus the onboard HME interface (hme0). As you can see, both machines are nearly identical. My feeling is that this behaviour has something to do with something specific to sparc64 architecture, as I do have other 2 Sun QFE PCI boards running on Intel architecture (Dell Poweredges 1650 and 1750 respectively) and there are no such issues whatsoever. These i386 servers have already an uptime of 11 days and there is no "hmeX: device timeout" in either server. >How-To-Repeat: Just configure the networking in the servers as shown and then is just a question of time for the problem to appear. It can take days, hours, or minutes, but it will show up after some time. >Fix: >Release-Note: >Audit-Trail: >Unformatted: