Date: Fri, 20 Dec 2019 11:22:01 +0000 From: Marko Zec <zec@fer.hr> To: "Patrick M. Hausen" <hausen@punkt.de> Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, Kristof Provost <kp@eurobsdcon.org> Subject: Re: Continuing problems in a bridged VNET setup Message-ID: <20191220122256.76942c07@x23> In-Reply-To: <BD4018F8-0BB7-4EA9-A726-F6383E9AC892@punkt.de> References: <BD4018F8-0BB7-4EA9-A726-F6383E9AC892@punkt.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Perhaps you could ditch if_bridge(4) and epair(4), and try ng_eiface(4) with ng_bridge(4) instead? Works rock-solid 24/7 here on 11.2 / 11.3. Marko On Fri, 20 Dec 2019 11:19:24 +0100 "Patrick M. Hausen" <hausen@punkt.de> wrote: > Hi all, >=20 > we still experience occasional network outages in production, > yet have not been able to find the root cause. >=20 > We run around 50 servers with VNET jails. some of them with > a handful, the busiest ones with 50 or more jails each. >=20 > Every now and then the jails are not reachable over the net, > anymore. The server itself is up and running, all jails are > up and running, one can ssh to the server but none of the > jails can communicate over the network. >=20 > There seems to be no pattern to the time of occurrance except > that more jails on one system make it "more likely". > Also having more than one bridge, e.g. for private networks > between jails seems to increase the probability. > When a server shows the problem it tends to get into the state > rather frequently, a couple of hours inbetween. Then again > most servers run for weeks without exhibiting the problem. > That's what makes it so hard to reproduce. The last couple of > days one system was failing regularly until we reduced the number > of jails from around 80 to around 50. Now it seems stable again. >=20 > I have a test system with lots of jails that I work with gatling > that did not show a single failure so far :-( >=20 >=20 > Setup: >=20 > All jails are iocage jails with VNET interfaces. They are > connected to at least one bridge that starts with the > physical external interface as a member and gets jails' > epair interfaces added as they start up. All jails are managed > by iocage. >=20 > ifconfig_igb0=3D"-rxcsum -rxcsum6 -txcsum -txcsum6 -vlanhwtag > -vlanhwtso up" cloned_interfaces=3D"bridge0" > ifconfig_bridge0_name=3D"inet0" > ifconfig_inet0=3D"addm igb0 up" > ifconfig_inet0_ipv6=3D"inet6 <host-address>/64 auto_linklocal" >=20 > $ iocage get interfaces vpro0087 > vnet0:inet0 >=20 > $ ifconfig inet0 > inet0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 > mtu 1500 ether 90:1b:0e:63:ef:51 > inet6 fe80::921b:eff:fe63:ef51%inet0 prefixlen 64 scopeid 0x4 > inet6 <host-address> prefixlen 64 > nd6 options=3D21<PERFORMNUD,AUTO_LINKLOCAL> > groups: bridge > id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 > maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 > root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 > member: vnet0.4 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> > ifmaxaddr 0 port 7 priority 128 path cost 2000 > member: vnet0.1 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> > ifmaxaddr 0 port 6 priority 128 path cost 2000 > member: igb0 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> > ifmaxaddr 0 port 1 priority 128 path cost 2000000 >=20 >=20 > What we tried: >=20 > At first we suspected the bridge to become "wedged" somehow. This was > corroborated by talking to various people at devsummits and EuroBSDCon > with Kristof Provost specifically suggesting that if_bridge was > still under giant lock and there might be a problem here that the > lock is not released under some race condition and then the entire > bridge subsystem would be stalled. That sounds plausible given the > random occurrance. >=20 > But I think we can rule out that one, because: >=20 > - ifconfig up/down does not help > - the host is still communicating fine over the same bridge interface > - tearing down the bridge, kldunload (!) of if_bridge.ko followed by > a new kldload and reconstructing the members with `ifconfig addm` > does not help, either > - only a host reboot restores function >=20 > Finally I created a not iocage managed jail on the problem host. > Please ignore the `iocage` in the path, I used it to populate the > root directory. But it is not started by iocage at boot time and > the manual config is this: >=20 > testjail { > host.hostname =3D "testjail"; # hostname > path =3D "/iocage/jails/testjail/root"; # root directory > exec.clean; > exec.system_user =3D "root"; > exec.jail_user =3D "root"; > vnet;=20 > vnet.interface =3D "epair999b"; > exec.prestart +=3D "ifconfig epair999 create; ifconfig > epair999a inet6 2A00:B580:8000:8000::1/64 auto_linklocal"; > exec.poststop +=3D "sleep 2; ifconfig epair999a destroy; sleep 2";=20 > # Standard stuff > exec.start +=3D "/bin/sh /etc/rc"; > exec.stop =3D "/bin/sh /etc/rc.shutdown"; > exec.consolelog =3D "/var/log/jail_testjail_console.log"; > mount.devfs; #mount devfs > allow.raw_sockets; #allow ping-pong > devfs_ruleset=3D"4"; #devfs ruleset for this jail > } >=20 > $ cat /iocage/jails/testjail/root/etc/rc.conf > hostname=3D"testjail" >=20 > ifconfig_epair999b_ipv6=3D"inet6 2A00:B580:8000:8000::2/64 > auto_linklocal" >=20 > When I do `service jail onestart testjail` I can then ping6 the jail > from the host and the host from the jail. As you can see the > if_bridge is not involved in this traffic. >=20 > When the host is in the wedged state and I start this testjail the > same way, no communication across the epair interface is possible. >=20 > To me this seems to indicate that not the bridge but all epair > interfaces stop working at the very same time. >=20 >=20 > OS is RELENG_11_3, hardware and specifically network adapters vary, > we have igb, ix, ixl, bnxt ... >=20 >=20 > Does anyone have a suggestion what diagnostic measures could help to > pinpoint the culprit? The random occurrance and the fact that the > problem seems to prefer the production environment only makes this a > real pain ... >=20 >=20 > Thanks and kind regards, > Patrick
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20191220122256.76942c07>