Date: Thu, 14 Apr 2011 15:38:51 -0500 (CDT) From: Thomas Johnson <tom@claimlynx.com> To: FreeBSD-gnats-submit@FreeBSD.org Cc: jpaetzel@freebsd.org, , root@claimlynx.com Subject: amd64/156408: Routing failure when using VLANs vs. Physical ethernet interfaces. Message-ID: <20110414203851.4AC2611F863@jaguar-2.claimlynx.com> Resent-Message-ID: <201104142100.p3EL0JTq098378@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 156408 >Category: amd64 >Synopsis: Routing failure when using VLANs vs. Physical ethernet interfaces. >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-amd64 >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Apr 14 21:00:19 UTC 2011 >Closed-Date: >Last-Modified: >Originator: Thomas Johnson >Release: FreeBSD 8.2-RELEASE amd64 >Organization: ClaimLynx, Inc. >Environment: System: FreeBSD jaguar-2.claimlynx.com 8.2-RELEASE FreeBSD 8.2-RELEASE #8: Sat Feb 26 21:23:00 CST 2011 root@jaguar-2.claimlynx.com:/usr/obj/usr/src/sys/GENERIC-CARP amd64 >Description: I have discovered some odd routing behavior that seems to occur when VLANs are used as members of a bridge. Specifically, it seems that static routes do not function correctly. Here is some background on the situation I have. I am building a new host to replace our aging (running 8.0) firewall. The new machine I am building has a single ethernet interface (re driver, but over the course of troubleshooting I've used sk and igb ethernet adapters), so I am using VLANs to segment traffic. The 'LAN' VLAN on my setup uses interface vlan500, with the 'WAN' on vlan200. The firewall also has an OpenVPN tunnel to our data center, operating in bridged mode on interface tap0. vlan500 and tap0 are both members of bridge0, allowing the LANs at our office and data center to talk on the same subnet, 172.31.0.0/16. In this configuration, I am able to connect from the office lan to hosts on the data center lan. The openvpn server at the datacenter (separate host from the firewall) pushes out a route for the dc production subnet upon connect. The logical configuration looks something like this: (office lan)<->[vlan500|bridge0|tap0]<-vpn->(dc lan)<->[dc firewall]<->(dc production subnet) [ firewall ] [ common 172.31.0.0/16 subnet throughout ] [ 100.100.100.128/26 ] For the sake of reference, here are the relevant IP addresses: 172.31.0.252 - local firewall vlan500 172.31.0.254 - local firewall lan carp 172.31.5.1 - data center firewall The problem seems to exist with the route to the production subnet at the data center. When the openvpn connection comes up, the route is installed in the routing table as expected. However, attempts to connect to hosts on this network result in instantaneous failure; not even a host unreachable. For example ~-> ping hostfoo PING hostfoo.claimlynx.com (100.100.100.149): 56 data bytes ping: sendto: Invalid argument Here is the output of 'netstat -rn' on this host: root@shawshank-1:~-> netstat -rn Routing tables Internet: Destination Gateway Flags Refs Use Netif Expire default 10.8.20.1 UGS 4 124778 vlan20 172.31.0.0/16 link#12 U 3 56103 vlan50 172.31.0.252 link#12 UHS 0 0 lo0 172.31.0.254 link#13 UH 0 0 carp10 172.31.3.5 link#8 UHS 0 0 lo0 10.8.20.0/24 link#9 U 0 33 vlan20 10.8.20.252 link#9 UHS 0 0 lo0 10.8.20.254 link#14 UH 0 0 carp20 10.8.30.0/24 link#10 U 0 0 vlan30 10.8.30.252 link#10 UHS 0 0 lo0 10.8.30.254 link#15 UH 0 0 carp30 10.8.40.0/24 link#11 U 0 0 vlan40 10.8.40.252 link#11 UHS 0 0 lo0 127.0.0.1 link#7 UH 0 0 lo0 100.100.100.128/26 172.31.5.1 UGS 0 21466 tap0 Internet6: Destination Gateway Flags Netif Expire ::1 ::1 UH lo0 fe80::%lo0/64 link#7 U lo0 fe80::1%lo0 link#7 UHS lo0 ff01:7::/32 fe80::1%lo0 U lo0 ff02::%lo0/32 fe80::1%lo0 U lo0 As you can see, the routing table shows the 172.31.0.0/16 subnet route on the vlan500 interface, and puts the 100.100.100.128/26 production subnet route on the tap0 interface. While troubleshooting this, my hunch was that perhaps the system was choking because the next-hop for the production route was on a network (172.31.0.0/16) that is not reachable via tap0 (in actuality it is). To test this, I inserted a host route for the next hop: route add 172.31.5.1 -interface tap0 Adding this route resolves the condition, but it seems like a hacky fix. In comparison, the firewall that I am replacing uses the same lan/bridge/tap setup, but the machine has physical ethernet interfaces for all segments, rather than the vlans that my new setup uses. The existing setup works fine, without the need to add a host route. Here is the routing table for the existing firewall: tom@shawshank:~-> netstat -rn Routing tables Internet: Destination Gateway Flags Refs Use Netif Expire default 74.95.66.26 UGS 7 5043426 fxp2 172.31.0.0/16 link#2 U 4 70728235 fxp1 172.31.0.1 link#2 UHS 0 3870772 lo0 172.31.3.4 link#8 UHS 0 0 lo0 74.95.66.24/30 link#3 U 0 1243 fxp2 74.95.66.25 link#3 UHS 0 9 lo0 127.0.0.1 link#6 UH 0 1140570 lo0 192.168.50.0/24 link#1 U 0 0 fxp0 192.168.50.4 link#1 UHS 0 0 lo0 100.100.100.128/26 172.31.5.1 UGS 0 19877 fxp1 Internet6: Destination Gateway Flags Netif Expire ::1 ::1 UH lo0 fe80::%lo0/64 link#6 U lo0 fe80::1%lo0 link#6 UHS lo0 ff01:6::/32 fe80::1%lo0 U lo0 ff02::%lo0/32 fe80::1%lo0 U lo0 The noteworthy difference between the two routing tables is that the production route on the old firewall is put on the LAN interface (fxp1). >How-To-Repeat: This situation occurs every time this host is booted. >Fix: The workaround I have found is to add a host route for the next-hop to the tap0 interface. This seems to work alright, but I want to make sure that this isn't a symptom of a bug in the vlan driver or elsewhere. >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110414203851.4AC2611F863>