From owner-freebsd-net@freebsd.org Tue Mar 27 14:48:37 2018 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 44081F5C191 for ; Tue, 27 Mar 2018 14:48:37 +0000 (UTC) (envelope-from bzeeb-lists@lists.zabbadoz.net) Received: from mx1.sbone.de (mx1.sbone.de [IPv6:2a01:4f8:130:3ffc::401:25]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "mx1.sbone.de", Issuer "SBone.DE" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id CBB06698AC for ; Tue, 27 Mar 2018 14:48:36 +0000 (UTC) (envelope-from bzeeb-lists@lists.zabbadoz.net) Received: from mail.sbone.de (mail.sbone.de [IPv6:fde9:577b:c1a9:31::2013:587]) (using TLSv1 with cipher ADH-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mx1.sbone.de (Postfix) with ESMTPS id B8E3425D3A6E; Tue, 27 Mar 2018 14:48:34 +0000 (UTC) Received: from content-filter.sbone.de (content-filter.sbone.de [IPv6:fde9:577b:c1a9:31::2013:2742]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sbone.de (Postfix) with ESMTPS id F2299D1F904; Tue, 27 Mar 2018 14:48:33 +0000 (UTC) X-Virus-Scanned: amavisd-new at sbone.de Received: from mail.sbone.de ([IPv6:fde9:577b:c1a9:31::2013:587]) by content-filter.sbone.de (content-filter.sbone.de [fde9:577b:c1a9:31::2013:2742]) (amavisd-new, port 10024) with ESMTP id tMLHDd0FivuW; Tue, 27 Mar 2018 14:48:32 +0000 (UTC) Received: from [192.168.1.88] (fresh-ayiya.sbone.de [IPv6:fde9:577b:c1a9:f001::2]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.sbone.de (Postfix) with ESMTPSA id 8B805D1F902; Tue, 27 Mar 2018 14:48:31 +0000 (UTC) From: "Bjoern A. Zeeb" To: "Kristof Provost" Cc: "Reshad Patuck" , "FreeBSD Net" Subject: Re: [vnet] [epair] epair interface stops working after some time Date: Tue, 27 Mar 2018 14:48:29 +0000 X-Mailer: MailMate (2.0BETAr6106) Message-ID: <2D15ABDE-0C25-4C97-AEA6-0098459A2795@lists.zabbadoz.net> In-Reply-To: <7202AFF2-A314-41FE-BD13-C4C77A95E106@sigsegv.be> References: <71B1A1BD-6FCF-47BB-9523-CCAAC03799A5@sigsegv.be> <1563563.7DUcjoHYMp@reshadlaptop.patuck.net> <1D6101CD-BCB4-4206-838B-1A75152ACCC4@sigsegv.be> <38C78C2B-87D2-4225-8F4B-A5EA48BA5D17@patuck.net> <5803CAA2-DC4A-4E49-B715-6DE472088DDD@sigsegv.be> <9CAB4522-0B0A-42BF-B9A4-BF36AFC60286@patuck.net> <7202AFF2-A314-41FE-BD13-C4C77A95E106@sigsegv.be> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Mar 2018 14:48:37 -0000 On 27 Mar 2018, at 14:40, Kristof Provost wrote: > (Re-cc freebsd-net, because this is useful information) > > On 27 Mar 2018, at 13:07, Reshad Patuck wrote: >> The epair crash occurred again today running the epair module code >> with the added dtrace sdt providers. >> ​ >> Running the same command as last time, 'dtrace -n ::epair\*:' returns >> the following: >> ``` >> CPU ID FUNCTION:NAME > … >> 0 66499 epair_transmit_locked:enqueued >> ``` > >> Looks like its filled up a queue somewhere and is dropping >> connections post that. >> ​ >> The value of the 'error' is 55 I can see both the ifp and m structs >> but don't know what to look for in them. >> > That’s useful. Error 55 is ENOBUFS, which in IFQ_ENQUEUE() means > we’re hitting _IF_QFULL(). > There don’t seem to be counters for that drop though, so that makes > it hard to diagnose without these extra probe points. > It also explains why you don’t really see any drop counters > incrementing. > > The fact that this queue is full presumably means that the other side > is not reading packets off it any more. > That’s supposed to happen in epair_start_locked() (Look for the > IFQ_DEQUEUE() calls). > > It’s not at all clear to my how, but it looks like the receive side > is not doing its work. > > It looks like the IFQ code is already a fallback for when the netisr > queue is full. > That code might be broken, or there might be a different issue that > will just mean you’ll always end up in the same situation, > regardless of queue size. > > It’s probably worth trying to play with > ‘net.route.netisr_maxqlen’. I’d recommend *lowering* it, to see > if the problem happens more frequently that way. If it does it’ll be > helpful in reproducing and trying to fix this. If it doesn’t the > full queues is probably a consequence rather than a cause/trigger. > (Of course, once you’ve confirmed that lowering the netisr_maxqlen > makes the problem more frequent go ahead and increase it.) netstat -Q will be useful