Date: Fri, 20 Dec 2019 11:31:59 +0000 From: =?UTF-8?Q?Goran_Meki=C4=87?= <meka@tilda.center> To: freebsd-net@freebsd.org, Marko Zec <zec@fer.hr>, "Patrick M. Hausen" <hausen@punkt.de> Cc: Kristof Provost <kp@eurobsdcon.org>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org> Subject: Re: Continuing problems in a bridged VNET setup Message-ID: <1AB8ACD6-0FF0-487C-963D-3A1B05288FD9@tilda.center> In-Reply-To: <20191220122256.76942c07@x23> References: <BD4018F8-0BB7-4EA9-A726-F6383E9AC892@punkt.de> <20191220122256.76942c07@x23>
next in thread | previous in thread | raw e-mail | index | archive | help
On December 20, 2019 11:22:01 AM UTC, Marko Zec <zec@fer=2Ehr> wrote: >Perhaps you could ditch if_bridge(4) and epair(4), and try ng_eiface(4) >with ng_bridge(4) instead? Works rock-solid 24/7 here on 11=2E2 / 11=2E3= =2E > >Marko > >On Fri, 20 Dec 2019 11:19:24 +0100 >"Patrick M=2E Hausen" <hausen@punkt=2Ede> wrote: > >> Hi all, >>=20 >> we still experience occasional network outages in production, >> yet have not been able to find the root cause=2E >>=20 >> We run around 50 servers with VNET jails=2E some of them with >> a handful, the busiest ones with 50 or more jails each=2E >>=20 >> Every now and then the jails are not reachable over the net, >> anymore=2E The server itself is up and running, all jails are >> up and running, one can ssh to the server but none of the >> jails can communicate over the network=2E >>=20 >> There seems to be no pattern to the time of occurrance except >> that more jails on one system make it "more likely"=2E >> Also having more than one bridge, e=2Eg=2E for private networks >> between jails seems to increase the probability=2E >> When a server shows the problem it tends to get into the state >> rather frequently, a couple of hours inbetween=2E Then again >> most servers run for weeks without exhibiting the problem=2E >> That's what makes it so hard to reproduce=2E The last couple of >> days one system was failing regularly until we reduced the number >> of jails from around 80 to around 50=2E Now it seems stable again=2E >>=20 >> I have a test system with lots of jails that I work with gatling >> that did not show a single failure so far :-( >>=20 >>=20 >> Setup: >>=20 >> All jails are iocage jails with VNET interfaces=2E They are >> connected to at least one bridge that starts with the >> physical external interface as a member and gets jails' >> epair interfaces added as they start up=2E All jails are managed >> by iocage=2E >>=20 >> ifconfig_igb0=3D"-rxcsum -rxcsum6 -txcsum -txcsum6 -vlanhwtag >> -vlanhwtso up" cloned_interfaces=3D"bridge0" >> ifconfig_bridge0_name=3D"inet0" >> ifconfig_inet0=3D"addm igb0 up" >> ifconfig_inet0_ipv6=3D"inet6 <host-address>/64 auto_linklocal" >>=20 >> $ iocage get interfaces vpro0087 >> vnet0:inet0 >>=20 >> $ ifconfig inet0 >> inet0: flags=3D8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 >> mtu 1500 ether 90:1b:0e:63:ef:51 >> inet6 fe80::921b:eff:fe63:ef51%inet0 prefixlen 64 scopeid 0x4 >> inet6 <host-address> prefixlen 64 >> nd6 options=3D21<PERFORMNUD,AUTO_LINKLOCAL> >> groups: bridge >> id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 >> maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 >> root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 >> member: vnet0=2E4 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> >> ifmaxaddr 0 port 7 priority 128 path cost 2000 >> member: vnet0=2E1 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> >> ifmaxaddr 0 port 6 priority 128 path cost 2000 >> member: igb0 flags=3D143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> >> ifmaxaddr 0 port 1 priority 128 path cost 2000000 >>=20 >>=20 >> What we tried: >>=20 >> At first we suspected the bridge to become "wedged" somehow=2E This was >> corroborated by talking to various people at devsummits and >EuroBSDCon >> with Kristof Provost specifically suggesting that if_bridge was >> still under giant lock and there might be a problem here that the >> lock is not released under some race condition and then the entire >> bridge subsystem would be stalled=2E That sounds plausible given the >> random occurrance=2E >>=20 >> But I think we can rule out that one, because: >>=20 >> - ifconfig up/down does not help >> - the host is still communicating fine over the same bridge interface >> - tearing down the bridge, kldunload (!) of if_bridge=2Eko followed by >> a new kldload and reconstructing the members with `ifconfig addm` >> does not help, either >> - only a host reboot restores function >>=20 >> Finally I created a not iocage managed jail on the problem host=2E >> Please ignore the `iocage` in the path, I used it to populate the >> root directory=2E But it is not started by iocage at boot time and >> the manual config is this: >>=20 >> testjail { >> host=2Ehostname =3D "testjail"; # hostname >> path =3D "/iocage/jails/testjail/root"; # root directory >> exec=2Eclean; >> exec=2Esystem_user =3D "root"; >> exec=2Ejail_user =3D "root"; >> vnet;=20 >> vnet=2Einterface =3D "epair999b"; >> exec=2Eprestart +=3D "ifconfig epair999 create; ifconfig >> epair999a inet6 2A00:B580:8000:8000::1/64 auto_linklocal"; >> exec=2Epoststop +=3D "sleep 2; ifconfig epair999a destroy; sleep 2";=20 >> # Standard stuff >> exec=2Estart +=3D "/bin/sh /etc/rc"; >> exec=2Estop =3D "/bin/sh /etc/rc=2Eshutdown"; >> exec=2Econsolelog =3D "/var/log/jail_testjail_console=2Elog"; >> mount=2Edevfs; #mount devfs >> allow=2Eraw_sockets; #allow ping-pong >> devfs_ruleset=3D"4"; #devfs ruleset for this jail >> } >>=20 >> $ cat /iocage/jails/testjail/root/etc/rc=2Econf >> hostname=3D"testjail" >>=20 >> ifconfig_epair999b_ipv6=3D"inet6 2A00:B580:8000:8000::2/64 >> auto_linklocal" >>=20 >> When I do `service jail onestart testjail` I can then ping6 the jail >> from the host and the host from the jail=2E As you can see the >> if_bridge is not involved in this traffic=2E >>=20 >> When the host is in the wedged state and I start this testjail the >> same way, no communication across the epair interface is possible=2E >>=20 >> To me this seems to indicate that not the bridge but all epair >> interfaces stop working at the very same time=2E >>=20 >>=20 >> OS is RELENG_11_3, hardware and specifically network adapters vary, >> we have igb, ix, ixl, bnxt =2E=2E=2E >>=20 >>=20 >> Does anyone have a suggestion what diagnostic measures could help to >> pinpoint the culprit? The random occurrance and the fact that the >> problem seems to prefer the production environment only makes this a >> real pain =2E=2E=2E >>=20 >>=20 >> Thanks and kind regards, >> Patrick > >_______________________________________________ >freebsd-net@freebsd=2Eorg mailing list >https://lists=2Efreebsd=2Eorg/mailman/listinfo/freebsd-net >To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd=2Eorg" Does it work with pf? --=20 Sent from my Android device with K-9 Mail=2E Please excuse my brevity=2E From owner-freebsd-net@freebsd.org Fri Dec 20 11:43:32 2019 Return-Path: <owner-freebsd-net@freebsd.org> Delivered-To: freebsd-net@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 82D0C1D72CF for <freebsd-net@mailman.nyi.freebsd.org>; Fri, 20 Dec 2019 11:43:32 +0000 (UTC) (envelope-from zec@fer.hr) Received: from EUR03-AM5-obe.outbound.protection.outlook.com (mail-eopbgr30075.outbound.protection.outlook.com [40.107.3.75]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 47fRhg1p0Sz3NW0 for <freebsd-net@freebsd.org>; Fri, 20 Dec 2019 11:43:30 +0000 (UTC) (envelope-from zec@fer.hr) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=llMtIV6mEZMO7nim0w/45uB1qzz2CUm3QMkpL2TqRGogjXDY0TtOuu7f4rKaGuO9boHtvFq8l3a4chlTYMtfDT7X3sxaRA1v8rQrFgG9yirl4LItzEocv2lOuTLHFc9DVXui2Yk7OaOSUYXq6w1ZTHB50CSweCzcFam6HOZToMDipGNPDO6w5iENTtMdu4PkuCaxA22l/KW/HSMXS2TamfzNUntnW+vrCw0aRhLNo0r8geANbC2HYeSwXnOWxTMeSXElOtctr/aO2NfXL11CVNXD3d5WydHari9MjkS9FhWE9sVNefEvAAdA9QU716q3rjNQ2e/+jeMzpiiio/v4Bg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=fNEGiQWsLWqNJE0V6O/2BpHOhRHvfAK3O13pmxA+eic=; b=Qbmb5jZhyL3xU02vmbmMme471W4SHmH3hCGchQqj4HTKS2TtIKToDGEAH6eGJR5wSvBuSDSDRFV5avq7eZpAILwf2gleYY+Ylja5PqGB89DI/Gkfy6EHD1X8OdcrqvfSIK9ThfjKN0vdox0da49oL9bwqHgHECgJUumvEW4fMHISadHCDgFcTrTwk1kjGYcEa1q9Rlr1ESKKPm5TV0pm8IrMOf/kiDonzzc4fwwcoO2L6waZfUsrNuPFYFsBuSvQXxsTY+Im9dSlzmcmfJ+y52SJNjtSdtpyrqOOZnXMeAQ6qx3gSh94tQVidM0uZTsQXzzTm7INiLEbdNBMwIfBWg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fer.hr; dmarc=pass action=none header.from=fer.hr; dkim=pass header.d=fer.hr; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ferhr.onmicrosoft.com; s=selector2-ferhr-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=fNEGiQWsLWqNJE0V6O/2BpHOhRHvfAK3O13pmxA+eic=; b=G+xENGXu8k84TVQh1sQCbLUeU/mpbutA8mRuHmWmwH+7UradI2+pNOs7z43FqqVCTof4oMdxyGj1UfbZy2nyDzPNNuDu0IJvZpjhWsDVlT1z7eCgTIvzGaTLQ+YWCw2j3cTp11E6dJiMFtEe/BvZV8E8PnYpUx9VzAXMffV4/EQ= Received: from AM6PR08MB3078.eurprd08.prod.outlook.com (52.135.164.16) by AM6PR08MB4246.eurprd08.prod.outlook.com (20.179.6.141) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2538.19; Fri, 20 Dec 2019 11:43:28 +0000 Received: from AM6PR08MB3078.eurprd08.prod.outlook.com ([fe80::a8d0:1e6:a51:66aa]) by AM6PR08MB3078.eurprd08.prod.outlook.com ([fe80::a8d0:1e6:a51:66aa%3]) with mapi id 15.20.2559.016; Fri, 20 Dec 2019 11:43:28 +0000 From: Marko Zec <zec@fer.hr> To: Goran Meki? <meka@tilda.center> CC: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, "Patrick M. Hausen" <hausen@punkt.de>, Kristof Provost <kp@eurobsdcon.org> Subject: Re: Continuing problems in a bridged VNET setup Thread-Topic: Continuing problems in a bridged VNET setup Thread-Index: AQHVtx7/PdlOyA2FWE2l/uUQiHRcnKfC4WkAgAACh4CAAAN2AA== Date: Fri, 20 Dec 2019 11:43:28 +0000 Message-ID: <20191220124422.11c03f5c@x23> References: <BD4018F8-0BB7-4EA9-A726-F6383E9AC892@punkt.de> <20191220122256.76942c07@x23> <1AB8ACD6-0FF0-487C-963D-3A1B05288FD9@tilda.center> In-Reply-To: <1AB8ACD6-0FF0-487C-963D-3A1B05288FD9@tilda.center> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-clientproxiedby: FRYP281CA0002.DEUP281.PROD.OUTLOOK.COM (2603:10a6:d10::12) To AM6PR08MB3078.eurprd08.prod.outlook.com (2603:10a6:209:46::16) x-ms-exchange-messagesentrepresentingtype: 1 x-mailer: Claws Mail 3.17.4 (GTK+ 2.24.32; amd64-portbld-freebsd11.3) x-originating-ip: [161.53.19.9] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 96b44354-e427-4b4d-6493-08d78541d47d x-ms-traffictypediagnostic: AM6PR08MB4246: x-microsoft-antispam-prvs: <AM6PR08MB42465059C0A71A923FDFC628C32D0@AM6PR08MB4246.eurprd08.prod.outlook.com> x-ms-oob-tlc-oobclassifiers: OLM:4303; x-forefront-prvs: 025796F161 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(7916004)(346002)(366004)(376002)(136003)(39850400004)(396003)(199004)(189003)(66446008)(66476007)(64756008)(186003)(66946007)(8936002)(81166006)(81156014)(5660300002)(8676002)(66556008)(9686003)(6512007)(54906003)(786003)(71200400001)(6506007)(52116002)(4326008)(33716001)(86362001)(2906002)(478600001)(6486002)(26005)(1076003)(4744005)(6916009)(316002)(39210200001); DIR:OUT; SFP:1101; SCL:1; SRVR:AM6PR08MB4246; H:AM6PR08MB3078.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: fer.hr does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: BeFpJYiA4mz15buwlV3efoY7juRMEfvtJjW0pwoI9HN73Y78dHu9UPx3nSpc+jdIaSVB4AF+aiFLvkuvQi2h70Wg5rNYiPKAnbfKV0nVD+n/EbWNNMdWx6RT4Z+mTtfxJNnIGuPzh+N6xpEEdlcMhwlYlhOyhc3dCwEVUjt0zRj515D2l/aHZTLEkZsPJIcqvjcyXDVIjQ4cmHeuRjMvIsI6o1e7PQxaZkTg7SOXWEQ+6CVjN+PlzKG/bRum4IV/mZOSGSXfkmG6Y17h6tIOKLWJTq9igsDpFJqNTAhd2XQSHxoG9VQ+aiRHKjZyrLC76DeOpjpVGT5iS/8Mo9qMI2wBS/OzeSh/ci3xFEhIUZz4YtcFhp6GmQhsKYlk0K+sEi6jUMuKmWYJdTyOfvLTz/nja8EpdrGKqM6XWtyCGkgU8CQ8x+Dk0v/G101y5DWajaIt2TdDPUbVxv7fuY2x5Gp544cklrITerkBRD99/ucJsKF4/khfPaGAFyKplvZK x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="utf-8" Content-ID: <DFA90255E300544AA45083EA2421D476@eurprd08.prod.outlook.com> Content-Transfer-Encoding: base64 MIME-Version: 1.0 X-OriginatorOrg: fer.hr X-MS-Exchange-CrossTenant-Network-Message-Id: 96b44354-e427-4b4d-6493-08d78541d47d X-MS-Exchange-CrossTenant-originalarrivaltime: 20 Dec 2019 11:43:28.2729 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: ca71eddc-cc7b-4e5b-95bd-55b658e696be X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: xdqBARlWbrNQcAiwJUjDSmGVNMPsdJptboG4B5DZ2WQuMrKPAuledpQbdgyPtNZj X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB4246 X-Rspamd-Queue-Id: 47fRhg1p0Sz3NW0 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=ferhr.onmicrosoft.com header.s=selector2-ferhr-onmicrosoft-com header.b=G+xENGXu; dmarc=none; spf=pass (mx1.freebsd.org: domain of zec@fer.hr designates 40.107.3.75 as permitted sender) smtp.mailfrom=zec@fer.hr X-Spamd-Result: default: False [-4.26 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; R_DKIM_ALLOW(-0.20)[ferhr.onmicrosoft.com:s=selector2-ferhr-onmicrosoft-com]; HAS_XOIP(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; R_SPF_ALLOW(-0.20)[+ip4:40.107.0.0/16]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[fer.hr]; TO_DN_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[ferhr.onmicrosoft.com:+]; MIME_BASE64_TEXT(0.10)[]; RCVD_IN_DNSWL_NONE(0.00)[75.3.107.40.list.dnswl.org : 127.0.3.0]; IP_SCORE(-1.36)[ipnet: 40.64.0.0/10(-3.84), asn: 8075(-2.92), country: US(-0.05)]; FROM_EQ_ENVFROM(0.00)[]; MID_RHS_NOT_FQDN(0.50)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:8075, ipnet:40.64.0.0/10, country:US]; ARC_ALLOW(-1.00)[i=1] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org> List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>, <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe> List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/> List-Post: <mailto:freebsd-net@freebsd.org> List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help> List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>, <mailto:freebsd-net-request@freebsd.org?subject=subscribe> X-List-Received-Date: Fri, 20 Dec 2019 11:43:32 -0000 T24gRnJpLCAyMCBEZWMgMjAxOSAxMTozMTo1OSArMDAwMA0KR29yYW4gTWVracSHIDxtZWthQHRp bGRhLmNlbnRlcj4gd3JvdGU6DQoNCj4gT24gRGVjZW1iZXIgMjAsIDIwMTkgMTE6MjI6MDEgQU0g VVRDLCBNYXJrbyBaZWMgPHplY0BmZXIuaHI+IHdyb3RlOg0KPiA+UGVyaGFwcyB5b3UgY291bGQg ZGl0Y2ggaWZfYnJpZGdlKDQpIGFuZCBlcGFpcig0KSwgYW5kIHRyeQ0KPiA+bmdfZWlmYWNlKDQp IHdpdGggbmdfYnJpZGdlKDQpIGluc3RlYWQ/ICBXb3JrcyByb2NrLXNvbGlkIDI0LzcgaGVyZQ0K PiA+b24gMTEuMiAvIDExLjMuDQo+IA0KPiBEb2VzIGl0IHdvcmsgd2l0aCBwZj8NCg0KSW4gdGhl IHBhcnRpY3VsYXIgcHJvZHVjdGlvbiBzZXR1cCBJIHdhcyByZWZlcmluZyB0byB3ZSB1c2UgaXBm dywgc28NCmNhbid0IHNoYXJlIGFueSAxc3QtaGFuZCBleHBlcmllbmNlcyB3aXRoIHBmLg0KDQpN YXJrbw0K
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1AB8ACD6-0FF0-487C-963D-3A1B05288FD9>