From owner-freebsd-net@freebsd.org Thu Mar 29 13:09:17 2018 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 46EE4F74B9E for ; Thu, 29 Mar 2018 13:09:17 +0000 (UTC) (envelope-from srs0=hify=gt=sigsegv.be=kristof@codepro.be) Received: from venus.codepro.be (venus.codepro.be [IPv6:2a01:4f8:162:1127::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "*.codepro.be", Issuer "Gandi Standard SSL CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C98EA6CD8A for ; Thu, 29 Mar 2018 13:09:16 +0000 (UTC) (envelope-from srs0=hify=gt=sigsegv.be=kristof@codepro.be) Received: from [10.0.2.164] (ptr-8ripyygcwkmr6opr7m9.18120a2.ip6.access.telenet.be [IPv6:2a02:1811:2419:4e02:80e7:f88b:9064:9851]) (Authenticated sender: kp) by venus.codepro.be (Postfix) with ESMTPSA id 264065B28C; Thu, 29 Mar 2018 15:09:15 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sigsegv.be; s=mail; t=1522328955; bh=+DDtsIhLyMF0eNpfMyJTUq9ZpqNHUvLqeJiIpr3N4n8=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=M/zeFpo42YIhN/RVk53BYqQwB0bc5wIGQ3QC+Y10k26G9Vs9ZjiOX9qt2z5Wb0eGt SMcPOL362DPwIrlrGwTV5oDe9WS62fZA3/3EsSzZl2VsEZLBUdDbHO1LM9VygAIWTQ tcMBZ9aXcdZSfEOMPPN4lVhmleY/TSJynQEK6Smw= From: "Kristof Provost" To: "Reshad Patuck" Cc: freebsd-net@freebsd.org, "Bjoern A. Zeeb" , "Reshad Patuck" Subject: Re: [vnet] [epair] epair interface stops working after some time Date: Thu, 29 Mar 2018 15:09:13 +0200 X-Mailer: MailMate (2.0BETAr6106) Message-ID: <10647168-66DF-48CD-9121-9CC2B00848D4@sigsegv.be> In-Reply-To: <97945712-B53E-4CF6-B20E-6001CF40CDFC@gmail.com> References: <71B1A1BD-6FCF-47BB-9523-CCAAC03799A5@sigsegv.be> <1563563.7DUcjoHYMp@reshadlaptop.patuck.net> <1D6101CD-BCB4-4206-838B-1A75152ACCC4@sigsegv.be> <38C78C2B-87D2-4225-8F4B-A5EA48BA5D17@patuck.net> <5803CAA2-DC4A-4E49-B715-6DE472088DDD@sigsegv.be> <9CAB4522-0B0A-42BF-B9A4-BF36AFC60286@patuck.net> <7202AFF2-A314-41FE-BD13-C4C77A95E106@sigsegv.be> <2D15ABDE-0C25-4C97-AEA6-0098459A2795@lists.zabbadoz.net> <277350C5-3B1F-4105-AF0A-886B6133218E@sigsegv.be> <97945712-B53E-4CF6-B20E-6001CF40CDFC@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Mar 2018 13:09:17 -0000 On 29 Mar 2018, at 14:48, Reshad Patuck wrote: > pulling the 'net.link.epair.netisr_maxqlen' down does seem to make > this occur faster. > ​ Good, I think my hypothesis about where the issue lies is correct then. You should be able to avoid (or at least reduce the frequency of) the issue by increasing the value on your system(s). > When I dropped it to 2 like Kristof did and I have the same symptoms > on a box which was not exhibiting the problems manually began to have > the same symptoms. > Bumping it back up to 2100 did not restore the functionality (I don't > know if it should). > ​ It’s good to know this. It doesn’t surprise me that it doesn’t fix things. Something’s wrong in the code which handle an overflow of the netisr queue in the epair driver. Once that happens the IFF_DRV_OACTIVE flag gets set, and we keep enqueuing outside the netisr queue. Somehow we never end up back in epair_nh_drainedcpu(), so the flag never gets cleared and the driver never recovers. > I will create a PR for this later today with all the information I > have gathered so that we can have it all in one place. > Thanks. Please cc me on it. I’ll see if I can figure out what the problem is, but we might need someone smarter, so cc Bjoern too. Regards, Kristof