From owner-freebsd-net@freebsd.org  Fri Dec 20 10:19:29 2019
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id EBEFB1D52B2
 for <freebsd-net@mailman.nyi.freebsd.org>;
 Fri, 20 Dec 2019 10:19:29 +0000 (UTC) (envelope-from hausen@punkt.de)
Received: from kagate.punkt.de (kagate.punkt.de [217.29.33.131])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 47fPqh52dPz3JGL
 for <freebsd-net@freebsd.org>; Fri, 20 Dec 2019 10:19:28 +0000 (UTC)
 (envelope-from hausen@punkt.de)
Received: from hugo10.ka.punkt.de (hugo10.ka.punkt.de [217.29.44.10])
 by gate1.intern.punkt.de with ESMTP id xBKAJO9R051980;
 Fri, 20 Dec 2019 11:19:24 +0100 (CET)
Received: from [217.29.44.222] ([217.29.44.222])
 by hugo10.ka.punkt.de (8.14.2/8.14.2) with ESMTP id xBKAJOTx016153;
 Fri, 20 Dec 2019 11:19:24 +0100 (CET) (envelope-from hausen@punkt.de)
From: "Patrick M. Hausen" <hausen@punkt.de>
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Continuing problems in a bridged VNET setup
Message-Id: <BD4018F8-0BB7-4EA9-A726-F6383E9AC892@punkt.de>
Date: Fri, 20 Dec 2019 11:19:24 +0100
Cc: Kristof Provost <kp@eurobsdcon.org>
To: freebsd-net@freebsd.org
X-Mailer: Apple Mail (2.3445.104.11)
X-Rspamd-Queue-Id: 47fPqh52dPz3JGL
X-Spamd-Bar: --
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=pass (mx1.freebsd.org: domain of hausen@punkt.de designates 217.29.33.131
 as permitted sender) smtp.mailfrom=hausen@punkt.de
X-Spamd-Result: default: False [-2.18 / 15.00]; ARC_NA(0.00)[];
 NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; FROM_HAS_DN(0.00)[];
 TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:217.29.32.0/20:c];
 MV_CASE(0.50)[]; MIME_GOOD(-0.10)[text/plain];
 RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[punkt.de];
 NEURAL_HAM_LONG(-1.00)[-0.999,0]; RCVD_COUNT_THREE(0.00)[3];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; RCPT_COUNT_TWO(0.00)[2];
 RCVD_IN_DNSWL_NONE(0.00)[131.33.29.217.list.dnswl.org : 127.0.10.0];
 IP_SCORE(-0.38)[ip: (-0.36), ipnet: 217.29.32.0/20(-0.86), asn: 16188(-0.67),
 country: DE(-0.02)]; FROM_EQ_ENVFROM(0.00)[];
 R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+];
 ASN(0.00)[asn:16188, ipnet:217.29.32.0/20, country:DE];
 MID_RHS_MATCH_FROM(0.00)[]
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 20 Dec 2019 10:19:30 -0000

Hi all,

we still experience occasional network outages in production,
yet have not been able to find the root cause.

We run around 50 servers with VNET jails. some of them with
a handful, the busiest ones with 50 or more jails each.

Every now and then the jails are not reachable over the net,
anymore. The server itself is up and running, all jails are
up and running, one can ssh to the server but none of the
jails can communicate over the network.

There seems to be no pattern to the time of occurrance except
that more jails on one system make it "more likely".
Also having more than one bridge, e.g. for private networks
between jails seems to increase the probability.
When a server shows the problem it tends to get into the state
rather frequently, a couple of hours inbetween. Then again
most servers run for weeks without exhibiting the problem.
That's what makes it so hard to reproduce. The last couple of
days one system was failing regularly until we reduced the number
of jails from around 80 to around 50. Now it seems stable again.

I have a test system with lots of jails that I work with gatling
that did not show a single failure so far :-(


Setup:

All jails are iocage jails with VNET interfaces. They are
connected to at least one bridge that starts with the
physical external interface as a member and gets jails'
epair interfaces added as they start up. All jails are managed
by iocage.

ifconfig_igb0=3D"-rxcsum -rxcsum6 -txcsum -txcsum6 -vlanhwtag -vlanhwtso =
up"
cloned_interfaces=3D"bridge0"
ifconfig_bridge0_name=3D"inet0"
ifconfig_inet0=3D"addm igb0 up"
ifconfig_inet0_ipv6=3D"inet6 <host-address>/64 auto_linklocal"

$ iocage get interfaces vpro0087
vnet0:inet0

$ ifconfig inet0
inet0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu =
1500
	ether 90:1b:0e:63:ef:51
	inet6 fe80::921b:eff:fe63:ef51%inet0 prefixlen 64 scopeid 0x4
	inet6 <host-address> prefixlen 64
	nd6 options=3D21<PERFORMNUD,AUTO_LINKLOCAL>
	groups: bridge
	id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
	maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
	root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
	member: vnet0.4 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 7 priority 128 path cost 2000
	member: vnet0.1 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 6 priority 128 path cost 2000
	member: igb0 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 1 priority 128 path cost 2000000


What we tried:

At first we suspected the bridge to become "wedged" somehow. This was
corroborated by talking to various people at devsummits and EuroBSDCon
with Kristof Provost specifically suggesting that if_bridge was
still under giant lock and there might be a problem here that the lock =
is
not released under some race condition and then the entire bridge =
subsystem
would be stalled. That sounds plausible given the random occurrance.

But I think we can rule out that one, because:

- ifconfig up/down does not help
- the host is still communicating fine over the same bridge interface
- tearing down the bridge, kldunload (!) of if_bridge.ko followed by
  a new kldload and reconstructing the members with `ifconfig addm`
  does not help, either
- only a host reboot restores function

Finally I created a not iocage managed jail on the problem host.
Please ignore the `iocage` in the path, I used it to populate the
root directory. But it is not started by iocage at boot time and
the manual config is this:

testjail {
        host.hostname =3D "testjail";   # hostname
        path =3D "/iocage/jails/testjail/root";     # root directory
        exec.clean;
        exec.system_user =3D "root";
        exec.jail_user =3D "root";
        vnet;=20
	vnet.interface =3D "epair999b";
        exec.prestart +=3D "ifconfig epair999 create; ifconfig epair999a =
inet6 2A00:B580:8000:8000::1/64 auto_linklocal";
        exec.poststop +=3D "sleep 2; ifconfig epair999a destroy; sleep =
2";
=20
        # Standard stuff
        exec.start +=3D "/bin/sh /etc/rc";
        exec.stop =3D "/bin/sh /etc/rc.shutdown";
        exec.consolelog =3D "/var/log/jail_testjail_console.log";
        mount.devfs;          #mount devfs
        allow.raw_sockets;    #allow ping-pong
        devfs_ruleset=3D"4";    #devfs ruleset for this jail
}

$ cat /iocage/jails/testjail/root/etc/rc.conf
hostname=3D"testjail"

ifconfig_epair999b_ipv6=3D"inet6 2A00:B580:8000:8000::2/64 =
auto_linklocal"

When I do `service jail onestart testjail` I can then ping6 the jail =
from
the host and the host from the jail. As you can see the if_bridge is not
involved in this traffic.

When the host is in the wedged state and I start this testjail the same
way, no communication across the epair interface is possible.

To me this seems to indicate that not the bridge but all epair =
interfaces
stop working at the very same time.


OS is RELENG_11_3, hardware and specifically network adapters vary, we =
have
igb, ix, ixl, bnxt ...


Does anyone have a suggestion what diagnostic measures could help to =
pinpoint
the culprit? The random occurrance and the fact that the problem seems =
to
prefer the production environment only makes this a real pain ...


Thanks and kind regards,
Patrick
--=20
punkt.de GmbH
Patrick M. Hausen
.infrastructure

Kaiserallee 13a
76133 Karlsruhe

Tel. +49 721 9109500

https://infrastructure.punkt.de
info@punkt.de

AG Mannheim 108285
Gesch=C3=A4ftsf=C3=BChrer: J=C3=BCrgen Egeling, Daniel Lienert, Fabian =
Stein