From owner-freebsd-net@freebsd.org Fri Nov 17 16:56:53 2017 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B5450DDE8C2 for ; Fri, 17 Nov 2017 16:56:53 +0000 (UTC) (envelope-from freebsd@omnilan.de) Received: from mx0.gentlemail.de (mx0.gentlemail.de [IPv6:2a00:e10:2800::a130]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4CFB26D536 for ; Fri, 17 Nov 2017 16:56:53 +0000 (UTC) (envelope-from freebsd@omnilan.de) Received: from mh0.gentlemail.de (mh0.gentlemail.de [78.138.80.135]) by mx0.gentlemail.de (8.14.5/8.14.5) with ESMTP id vAHGuoYt008468 for ; Fri, 17 Nov 2017 17:56:50 +0100 (CET) (envelope-from freebsd@omnilan.de) Received: from titan.inop.mo1.omnilan.net (s1.omnilan.de [217.91.127.234]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mh0.gentlemail.de (Postfix) with ESMTPSA id 557381C8; Fri, 17 Nov 2017 17:56:50 +0100 (CET) Message-ID: <5A0F14CD.3040407@omnilan.de> Date: Fri, 17 Nov 2017 17:56:45 +0100 From: Harry Schmalzbauer Organization: OmniLAN User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; de-DE; rv:1.9.2.8) Gecko/20100906 Lightning/1.0b2 Thunderbird/3.1.2 MIME-Version: 1.0 To: freebsd-net@freebsd.org Subject: netmap/vale periodic deadlock Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Greylist: ACL 129 matched, not delayed by milter-greylist-4.2.7 (mx0.gentlemail.de [78.138.80.130]); Fri, 17 Nov 2017 17:56:50 +0100 (CET) X-Milter: Spamilter (Reciever: mx0.gentlemail.de; Sender-ip: 78.138.80.135; Sender-helo: mh0.gentlemail.de; ) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Nov 2017 16:56:53 -0000 Hello, sorry for annoying with another question/problem. I'm using netmap's vale (on stable/11) for bhyve(8) virtio-net backed SDN. The guests – unfortunately in production already – quit network services (resp. are not able to transceive any packets anymore) after about 2 days; repeatedly and most likely not load related, since there is no significant load. Each guest is running fine, the host also runs without any other problem, no network problem elsewhere (different NICs; I use one dedicated NIC with vlan(4) children, each child connected to one vale switch). At some point, the complete netmap subsystem seems to deadlock: 'vale-ctl' hangs uninteruptable. Trying to attach a tcpdump to a vale switch also hands uninteruptable. Stoping (shuting down from inside) bhyve guests works up to the point where the vale port should be destroyed. I could continue the list of symptoms, but that doesn't help in any way I guess. My question is, where can I start finding out what happens with the netmap subsystem? There were no kernel messages right before or during the deadlock! The only userland tool I'm familar with (vale-ctl) isn't usable at all in that situation. Any hints what to try? Here's a excerpt of processes running when the netmap-lockuped host has all guests shut down, just before I rebooted. Snipped alot, the interesing ones are thos in state "netmap_g": … 0 14213 1 0 20 0 5864 0 wait IW 3 0:00,00 (sh) 0 14214 14213 0 -92 0 5358120 3586232 nm_kn_lo TC 3 148:02,02 bhyve: kallisto (bhyve) 0 14976 2522 0 20 0 6976 0 wait IW 3 0:00,00 su 0 14981 14976 0 20 0 8256 0 pause IW 3 0:00,00 _su (csh) 0 61615 14981 0 20 0 5864 0 wait IW 3 0:00,00 (sh) 0 61616 61615 0 52 0 2180648 1973252 netmap_g DEC 3 286:11,91 bhyve: preed (bhyve) 0 62845 14981 0 20 0 11624 3328 bdg lock L+ 3 0:00,01 tcpdump -n -e -s 150 -i vale1:test … 0 1390 1388 0 -92 0 2330024 767756 nm_kn_lo TC v0- 94:01,90 bhyve: styx0 (bhyve) 0 1401 1 0 52 0 5784 0 wait IW v0- 0:00,00 (sh) 0 1403 1401 0 20 0 368328 43444 - TC v0- 3:35,66 bhyve: korso (bhyve) …