Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Dec 2019 11:31:59 +0000
From:      =?UTF-8?Q?Goran_Meki=C4=87?= <meka@tilda.center>
To:        freebsd-net@freebsd.org, Marko Zec <zec@fer.hr>, "Patrick M. Hausen" <hausen@punkt.de>
Cc:        Kristof Provost <kp@eurobsdcon.org>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>
Subject:   Re: Continuing problems in a bridged VNET setup
Message-ID:  <1AB8ACD6-0FF0-487C-963D-3A1B05288FD9@tilda.center>
In-Reply-To: <20191220122256.76942c07@x23>
References:  <BD4018F8-0BB7-4EA9-A726-F6383E9AC892@punkt.de> <20191220122256.76942c07@x23>

next in thread | previous in thread | raw e-mail | index | archive | help
On December 20, 2019 11:22:01 AM UTC, Marko Zec <zec@fer=2Ehr> wrote:
>Perhaps you could ditch if_bridge(4) and epair(4), and try ng_eiface(4)
>with ng_bridge(4) instead?  Works rock-solid 24/7 here on 11=2E2 / 11=2E3=
=2E
>
>Marko
>
>On Fri, 20 Dec 2019 11:19:24 +0100
>"Patrick M=2E Hausen" <hausen@punkt=2Ede> wrote:
>
>> Hi all,
>>=20
>> we still experience occasional network outages in production,
>> yet have not been able to find the root cause=2E
>>=20
>> We run around 50 servers with VNET jails=2E some of them with
>> a handful, the busiest ones with 50 or more jails each=2E
>>=20
>> Every now and then the jails are not reachable over the net,
>> anymore=2E The server itself is up and running, all jails are
>> up and running, one can ssh to the server but none of the
>> jails can communicate over the network=2E
>>=20
>> There seems to be no pattern to the time of occurrance except
>> that more jails on one system make it "more likely"=2E
>> Also having more than one bridge, e=2Eg=2E for private networks
>> between jails seems to increase the probability=2E
>> When a server shows the problem it tends to get into the state
>> rather frequently, a couple of hours inbetween=2E Then again
>> most servers run for weeks without exhibiting the problem=2E
>> That's what makes it so hard to reproduce=2E The last couple of
>> days one system was failing regularly until we reduced the number
>> of jails from around 80 to around 50=2E Now it seems stable again=2E
>>=20
>> I have a test system with lots of jails that I work with gatling
>> that did not show a single failure so far :-(
>>=20
>>=20
>> Setup:
>>=20
>> All jails are iocage jails with VNET interfaces=2E They are
>> connected to at least one bridge that starts with the
>> physical external interface as a member and gets jails'
>> epair interfaces added as they start up=2E All jails are managed
>> by iocage=2E
>>=20
>> ifconfig_igb0=3D"-rxcsum -rxcsum6 -txcsum -txcsum6 -vlanhwtag
>> -vlanhwtso up" cloned_interfaces=3D"bridge0"
>> ifconfig_bridge0_name=3D"inet0"
>> ifconfig_inet0=3D"addm igb0 up"
>> ifconfig_inet0_ipv6=3D"inet6 <host-address>/64 auto_linklocal"
>>=20
>> $ iocage get interfaces vpro0087
>> vnet0:inet0
>>=20
>> $ ifconfig inet0
>> inet0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0
>> mtu 1500 ether 90:1b:0e:63:ef:51
>> 	inet6 fe80::921b:eff:fe63:ef51%inet0 prefixlen 64 scopeid 0x4
>> 	inet6 <host-address> prefixlen 64
>> 	nd6 options=3D21<PERFORMNUD,AUTO_LINKLOCAL>
>> 	groups: bridge
>> 	id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
>> 	maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
>> 	root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
>> 	member: vnet0=2E4 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
>> 	        ifmaxaddr 0 port 7 priority 128 path cost 2000
>> 	member: vnet0=2E1 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
>> 	        ifmaxaddr 0 port 6 priority 128 path cost 2000
>> 	member: igb0 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
>> 	        ifmaxaddr 0 port 1 priority 128 path cost 2000000
>>=20
>>=20
>> What we tried:
>>=20
>> At first we suspected the bridge to become "wedged" somehow=2E This was
>> corroborated by talking to various people at devsummits and
>EuroBSDCon
>> with Kristof Provost specifically suggesting that if_bridge was
>> still under giant lock and there might be a problem here that the
>> lock is not released under some race condition and then the entire
>> bridge subsystem would be stalled=2E That sounds plausible given the
>> random occurrance=2E
>>=20
>> But I think we can rule out that one, because:
>>=20
>> - ifconfig up/down does not help
>> - the host is still communicating fine over the same bridge interface
>> - tearing down the bridge, kldunload (!) of if_bridge=2Eko followed by
>>   a new kldload and reconstructing the members with `ifconfig addm`
>>   does not help, either
>> - only a host reboot restores function
>>=20
>> Finally I created a not iocage managed jail on the problem host=2E
>> Please ignore the `iocage` in the path, I used it to populate the
>> root directory=2E But it is not started by iocage at boot time and
>> the manual config is this:
>>=20
>> testjail {
>>         host=2Ehostname =3D "testjail";   # hostname
>>         path =3D "/iocage/jails/testjail/root";     # root directory
>>         exec=2Eclean;
>>         exec=2Esystem_user =3D "root";
>>         exec=2Ejail_user =3D "root";
>>         vnet;=20
>> 	vnet=2Einterface =3D "epair999b";
>>         exec=2Eprestart +=3D "ifconfig epair999 create; ifconfig
>> epair999a inet6 2A00:B580:8000:8000::1/64 auto_linklocal";
>> exec=2Epoststop +=3D "sleep 2; ifconfig epair999a destroy; sleep 2";=20
>>         # Standard stuff
>>         exec=2Estart +=3D "/bin/sh /etc/rc";
>>         exec=2Estop =3D "/bin/sh /etc/rc=2Eshutdown";
>>         exec=2Econsolelog =3D "/var/log/jail_testjail_console=2Elog";
>>         mount=2Edevfs;          #mount devfs
>>         allow=2Eraw_sockets;    #allow ping-pong
>>         devfs_ruleset=3D"4";    #devfs ruleset for this jail
>> }
>>=20
>> $ cat /iocage/jails/testjail/root/etc/rc=2Econf
>> hostname=3D"testjail"
>>=20
>> ifconfig_epair999b_ipv6=3D"inet6 2A00:B580:8000:8000::2/64
>> auto_linklocal"
>>=20
>> When I do `service jail onestart testjail` I can then ping6 the jail
>> from the host and the host from the jail=2E As you can see the
>> if_bridge is not involved in this traffic=2E
>>=20
>> When the host is in the wedged state and I start this testjail the
>> same way, no communication across the epair interface is possible=2E
>>=20
>> To me this seems to indicate that not the bridge but all epair
>> interfaces stop working at the very same time=2E
>>=20
>>=20
>> OS is RELENG_11_3, hardware and specifically network adapters vary,
>> we have igb, ix, ixl, bnxt =2E=2E=2E
>>=20
>>=20
>> Does anyone have a suggestion what diagnostic measures could help to
>> pinpoint the culprit? The random occurrance and the fact that the
>> problem seems to prefer the production environment only makes this a
>> real pain =2E=2E=2E
>>=20
>>=20
>> Thanks and kind regards,
>> Patrick
>
>_______________________________________________
>freebsd-net@freebsd=2Eorg mailing list
>https://lists=2Efreebsd=2Eorg/mailman/listinfo/freebsd-net
>To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd=2Eorg"

Does it work with pf?
--=20
Sent from my Android device with K-9 Mail=2E Please excuse my brevity=2E
From owner-freebsd-net@freebsd.org  Fri Dec 20 11:43:32 2019
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 82D0C1D72CF
 for <freebsd-net@mailman.nyi.freebsd.org>;
 Fri, 20 Dec 2019 11:43:32 +0000 (UTC) (envelope-from zec@fer.hr)
Received: from EUR03-AM5-obe.outbound.protection.outlook.com
 (mail-eopbgr30075.outbound.protection.outlook.com [40.107.3.75])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "mail.protection.outlook.com",
 Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 47fRhg1p0Sz3NW0
 for <freebsd-net@freebsd.org>; Fri, 20 Dec 2019 11:43:30 +0000 (UTC)
 (envelope-from zec@fer.hr)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=llMtIV6mEZMO7nim0w/45uB1qzz2CUm3QMkpL2TqRGogjXDY0TtOuu7f4rKaGuO9boHtvFq8l3a4chlTYMtfDT7X3sxaRA1v8rQrFgG9yirl4LItzEocv2lOuTLHFc9DVXui2Yk7OaOSUYXq6w1ZTHB50CSweCzcFam6HOZToMDipGNPDO6w5iENTtMdu4PkuCaxA22l/KW/HSMXS2TamfzNUntnW+vrCw0aRhLNo0r8geANbC2HYeSwXnOWxTMeSXElOtctr/aO2NfXL11CVNXD3d5WydHari9MjkS9FhWE9sVNefEvAAdA9QU716q3rjNQ2e/+jeMzpiiio/v4Bg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; 
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=fNEGiQWsLWqNJE0V6O/2BpHOhRHvfAK3O13pmxA+eic=;
 b=Qbmb5jZhyL3xU02vmbmMme471W4SHmH3hCGchQqj4HTKS2TtIKToDGEAH6eGJR5wSvBuSDSDRFV5avq7eZpAILwf2gleYY+Ylja5PqGB89DI/Gkfy6EHD1X8OdcrqvfSIK9ThfjKN0vdox0da49oL9bwqHgHECgJUumvEW4fMHISadHCDgFcTrTwk1kjGYcEa1q9Rlr1ESKKPm5TV0pm8IrMOf/kiDonzzc4fwwcoO2L6waZfUsrNuPFYFsBuSvQXxsTY+Im9dSlzmcmfJ+y52SJNjtSdtpyrqOOZnXMeAQ6qx3gSh94tQVidM0uZTsQXzzTm7INiLEbdNBMwIfBWg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=fer.hr; dmarc=pass action=none header.from=fer.hr; dkim=pass
 header.d=fer.hr; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ferhr.onmicrosoft.com; 
 s=selector2-ferhr-onmicrosoft-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=fNEGiQWsLWqNJE0V6O/2BpHOhRHvfAK3O13pmxA+eic=;
 b=G+xENGXu8k84TVQh1sQCbLUeU/mpbutA8mRuHmWmwH+7UradI2+pNOs7z43FqqVCTof4oMdxyGj1UfbZy2nyDzPNNuDu0IJvZpjhWsDVlT1z7eCgTIvzGaTLQ+YWCw2j3cTp11E6dJiMFtEe/BvZV8E8PnYpUx9VzAXMffV4/EQ=
Received: from AM6PR08MB3078.eurprd08.prod.outlook.com (52.135.164.16) by
 AM6PR08MB4246.eurprd08.prod.outlook.com (20.179.6.141) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.2538.19; Fri, 20 Dec 2019 11:43:28 +0000
Received: from AM6PR08MB3078.eurprd08.prod.outlook.com
 ([fe80::a8d0:1e6:a51:66aa]) by AM6PR08MB3078.eurprd08.prod.outlook.com
 ([fe80::a8d0:1e6:a51:66aa%3]) with mapi id 15.20.2559.016; Fri, 20 Dec 2019
 11:43:28 +0000
From: Marko Zec <zec@fer.hr>
To: Goran Meki? <meka@tilda.center>
CC: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, "Patrick M. Hausen"
 <hausen@punkt.de>, Kristof Provost <kp@eurobsdcon.org>
Subject: Re: Continuing problems in a bridged VNET setup
Thread-Topic: Continuing problems in a bridged VNET setup
Thread-Index: AQHVtx7/PdlOyA2FWE2l/uUQiHRcnKfC4WkAgAACh4CAAAN2AA==
Date: Fri, 20 Dec 2019 11:43:28 +0000
Message-ID: <20191220124422.11c03f5c@x23>
References: <BD4018F8-0BB7-4EA9-A726-F6383E9AC892@punkt.de>
 <20191220122256.76942c07@x23>
 <1AB8ACD6-0FF0-487C-963D-3A1B05288FD9@tilda.center>
In-Reply-To: <1AB8ACD6-0FF0-487C-963D-3A1B05288FD9@tilda.center>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-clientproxiedby: FRYP281CA0002.DEUP281.PROD.OUTLOOK.COM (2603:10a6:d10::12)
 To AM6PR08MB3078.eurprd08.prod.outlook.com
 (2603:10a6:209:46::16)
x-ms-exchange-messagesentrepresentingtype: 1
x-mailer: Claws Mail 3.17.4 (GTK+ 2.24.32; amd64-portbld-freebsd11.3)
x-originating-ip: [161.53.19.9]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: 96b44354-e427-4b4d-6493-08d78541d47d
x-ms-traffictypediagnostic: AM6PR08MB4246:
x-microsoft-antispam-prvs: <AM6PR08MB42465059C0A71A923FDFC628C32D0@AM6PR08MB4246.eurprd08.prod.outlook.com>
x-ms-oob-tlc-oobclassifiers: OLM:4303;
x-forefront-prvs: 025796F161
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(7916004)(346002)(366004)(376002)(136003)(39850400004)(396003)(199004)(189003)(66446008)(66476007)(64756008)(186003)(66946007)(8936002)(81166006)(81156014)(5660300002)(8676002)(66556008)(9686003)(6512007)(54906003)(786003)(71200400001)(6506007)(52116002)(4326008)(33716001)(86362001)(2906002)(478600001)(6486002)(26005)(1076003)(4744005)(6916009)(316002)(39210200001);
 DIR:OUT; SFP:1101; SCL:1; SRVR:AM6PR08MB4246;
 H:AM6PR08MB3078.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en;
 PTR:InfoNoRecords; A:1; MX:1; 
received-spf: None (protection.outlook.com: fer.hr does not designate
 permitted sender hosts)
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: BeFpJYiA4mz15buwlV3efoY7juRMEfvtJjW0pwoI9HN73Y78dHu9UPx3nSpc+jdIaSVB4AF+aiFLvkuvQi2h70Wg5rNYiPKAnbfKV0nVD+n/EbWNNMdWx6RT4Z+mTtfxJNnIGuPzh+N6xpEEdlcMhwlYlhOyhc3dCwEVUjt0zRj515D2l/aHZTLEkZsPJIcqvjcyXDVIjQ4cmHeuRjMvIsI6o1e7PQxaZkTg7SOXWEQ+6CVjN+PlzKG/bRum4IV/mZOSGSXfkmG6Y17h6tIOKLWJTq9igsDpFJqNTAhd2XQSHxoG9VQ+aiRHKjZyrLC76DeOpjpVGT5iS/8Mo9qMI2wBS/OzeSh/ci3xFEhIUZz4YtcFhp6GmQhsKYlk0K+sEi6jUMuKmWYJdTyOfvLTz/nja8EpdrGKqM6XWtyCGkgU8CQ8x+Dk0v/G101y5DWajaIt2TdDPUbVxv7fuY2x5Gp544cklrITerkBRD99/ucJsKF4/khfPaGAFyKplvZK
x-ms-exchange-transport-forked: True
Content-Type: text/plain; charset="utf-8"
Content-ID: <DFA90255E300544AA45083EA2421D476@eurprd08.prod.outlook.com>
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-OriginatorOrg: fer.hr
X-MS-Exchange-CrossTenant-Network-Message-Id: 96b44354-e427-4b4d-6493-08d78541d47d
X-MS-Exchange-CrossTenant-originalarrivaltime: 20 Dec 2019 11:43:28.2729 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: ca71eddc-cc7b-4e5b-95bd-55b658e696be
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: xdqBARlWbrNQcAiwJUjDSmGVNMPsdJptboG4B5DZ2WQuMrKPAuledpQbdgyPtNZj
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB4246
X-Rspamd-Queue-Id: 47fRhg1p0Sz3NW0
X-Spamd-Bar: ----
Authentication-Results: mx1.freebsd.org;
 dkim=pass header.d=ferhr.onmicrosoft.com
 header.s=selector2-ferhr-onmicrosoft-com header.b=G+xENGXu; 
 dmarc=none;
 spf=pass (mx1.freebsd.org: domain of zec@fer.hr designates 40.107.3.75 as
 permitted sender) smtp.mailfrom=zec@fer.hr
X-Spamd-Result: default: False [-4.26 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[];
 NEURAL_HAM_MEDIUM(-1.00)[-1.000,0];
 R_DKIM_ALLOW(-0.20)[ferhr.onmicrosoft.com:s=selector2-ferhr-onmicrosoft-com]; 
 HAS_XOIP(0.00)[]; FROM_HAS_DN(0.00)[];
 RCPT_COUNT_THREE(0.00)[4];
 R_SPF_ALLOW(-0.20)[+ip4:40.107.0.0/16];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain];
 RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[fer.hr];
 TO_DN_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3];
 TO_MATCH_ENVRCPT_SOME(0.00)[];
 DKIM_TRACE(0.00)[ferhr.onmicrosoft.com:+];
 MIME_BASE64_TEXT(0.10)[];
 RCVD_IN_DNSWL_NONE(0.00)[75.3.107.40.list.dnswl.org : 127.0.3.0];
 IP_SCORE(-1.36)[ipnet: 40.64.0.0/10(-3.84), asn: 8075(-2.92), country:
 US(-0.05)]; FROM_EQ_ENVFROM(0.00)[]; MID_RHS_NOT_FQDN(0.50)[];
 MIME_TRACE(0.00)[0:+];
 ASN(0.00)[asn:8075, ipnet:40.64.0.0/10, country:US];
 ARC_ALLOW(-1.00)[i=1]
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>;
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 20 Dec 2019 11:43:32 -0000

T24gRnJpLCAyMCBEZWMgMjAxOSAxMTozMTo1OSArMDAwMA0KR29yYW4gTWVracSHIDxtZWthQHRp
bGRhLmNlbnRlcj4gd3JvdGU6DQoNCj4gT24gRGVjZW1iZXIgMjAsIDIwMTkgMTE6MjI6MDEgQU0g
VVRDLCBNYXJrbyBaZWMgPHplY0BmZXIuaHI+IHdyb3RlOg0KPiA+UGVyaGFwcyB5b3UgY291bGQg
ZGl0Y2ggaWZfYnJpZGdlKDQpIGFuZCBlcGFpcig0KSwgYW5kIHRyeQ0KPiA+bmdfZWlmYWNlKDQp
IHdpdGggbmdfYnJpZGdlKDQpIGluc3RlYWQ/ICBXb3JrcyByb2NrLXNvbGlkIDI0LzcgaGVyZQ0K
PiA+b24gMTEuMiAvIDExLjMuDQo+IA0KPiBEb2VzIGl0IHdvcmsgd2l0aCBwZj8NCg0KSW4gdGhl
IHBhcnRpY3VsYXIgcHJvZHVjdGlvbiBzZXR1cCBJIHdhcyByZWZlcmluZyB0byB3ZSB1c2UgaXBm
dywgc28NCmNhbid0IHNoYXJlIGFueSAxc3QtaGFuZCBleHBlcmllbmNlcyB3aXRoIHBmLg0KDQpN
YXJrbw0K



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1AB8ACD6-0FF0-487C-963D-3A1B05288FD9>