Date: Tue, 27 Mar 2018 20:34:14 +0200 From: "Kristof Provost" <kristof@sigsegv.be> To: "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> Cc: "Reshad Patuck" <reshad@patuck.net>, "FreeBSD Net" <freebsd-net@freebsd.org> Subject: Re: [vnet] [epair] epair interface stops working after some time Message-ID: <87E6B8EC-F92B-49F4-A540-04E084FA0A33@sigsegv.be> In-Reply-To: <2D15ABDE-0C25-4C97-AEA6-0098459A2795@lists.zabbadoz.net> References: <CADaJeD2LZy=RU0vtqD7%2BdkZkUs0GKW%2B7duGDQkZ19GR-_cS=MQ@mail.gmail.com> <71B1A1BD-6FCF-47BB-9523-CCAAC03799A5@sigsegv.be> <1563563.7DUcjoHYMp@reshadlaptop.patuck.net> <C162AFB2-FF80-4640-BDC8-23B30CC22873@sigsegv.be> <1D6101CD-BCB4-4206-838B-1A75152ACCC4@sigsegv.be> <AB52ED81-F97F-471B-A1BA-F3221152A586@patuck.net> <F382A5B4-6941-43C0-9686-4B108034EBF1@patuck.net> <FDCE9FAA-1289-4E15-9239-1B6FD98B589C@sigsegv.be> <38C78C2B-87D2-4225-8F4B-A5EA48BA5D17@patuck.net> <5803CAA2-DC4A-4E49-B715-6DE472088DDD@sigsegv.be> <9CAB4522-0B0A-42BF-B9A4-BF36AFC60286@patuck.net> <7202AFF2-A314-41FE-BD13-C4C77A95E106@sigsegv.be> <2D15ABDE-0C25-4C97-AEA6-0098459A2795@lists.zabbadoz.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 27 Mar 2018, at 16:48, Bjoern A. Zeeb wrote: > On 27 Mar 2018, at 14:40, Kristof Provost wrote: > >> (Re-cc freebsd-net, because this is useful information) >> >> On 27 Mar 2018, at 13:07, Reshad Patuck wrote: >>> The epair crash occurred again today running the epair module code >>> with the added dtrace sdt providers. >>> >>> Running the same command as last time, 'dtrace -n ::epair\*:' >>> returns the following: >>> ``` >>> CPU ID FUNCTION:NAME >> … >>> 0 66499 epair_transmit_locked:enqueued >>> ``` >> >>> Looks like its filled up a queue somewhere and is dropping >>> connections post that. >>> >>> The value of the 'error' is 55 I can see both the ifp and m structs >>> but don't know what to look for in them. >>> >> That’s useful. Error 55 is ENOBUFS, which in IFQ_ENQUEUE() means >> we’re hitting _IF_QFULL(). >> There don’t seem to be counters for that drop though, so that makes >> it hard to diagnose without these extra probe points. >> It also explains why you don’t really see any drop counters >> incrementing. >> >> The fact that this queue is full presumably means that the other side >> is not reading packets off it any more. >> That’s supposed to happen in epair_start_locked() (Look for the >> IFQ_DEQUEUE() calls). >> >> It’s not at all clear to my how, but it looks like the receive side >> is not doing its work. >> >> It looks like the IFQ code is already a fallback for when the netisr >> queue is full. >> That code might be broken, or there might be a different issue that >> will just mean you’ll always end up in the same situation, >> regardless of queue size. >> >> It’s probably worth trying to play with >> ‘net.route.netisr_maxqlen’. I’d recommend *lowering* it, to see >> if the problem happens more frequently that way. If it does it’ll >> be helpful in reproducing and trying to fix this. If it doesn’t the >> full queues is probably a consequence rather than a cause/trigger. >> (Of course, once you’ve confirmed that lowering the netisr_maxqlen >> makes the problem more frequent go ahead and increase it.) > > netstat -Q will be useful Reshad included that in his e-mail to me: > On the system with the bug 'netstat -Q' seems to have queue drops for > epair. > ``` > # netstat -Q > Configuration: > Setting Current Limit > Thread count 1 1 > Default queue limit 256 10240 > Dispatch policy direct n/a > Threads bound to CPUs disabled n/a > > Protocols: > Name Proto QLimit Policy Dispatch Flags > ip 1 256 flow default --- > igmp 2 256 source default --- > rtsock 3 256 source default --- > arp 4 256 source default --- > ether 5 256 source direct --- > ip6 6 256 flow default --- > epair 8 2100 cpu default CD- > > Workstreams: > WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled > 0 0 ip 0 30 11150458 0 0 13092275 24242558 > 0 0 igmp 0 0 0 0 0 0 0 > 0 0 rtsock 0 1 0 0 0 42 42 > 0 0 arp 0 0 56380919 0 0 0 56380919 > 0 0 ether 0 0 108761357 0 0 0 108761357 > 0 0 ip6 0 10 34999359 0 0 4091259 39090613 > 0 0 epair 0 2100 0 0 210972 303785724 303785724 > ``` > > I also noticed that the values for 'epair' in the 'Workstreams' > section including drops do not change, while all others increase after > some time. I think I’ve triggered this problem by setting net.link.epair.netisr_maxqlen to an absurdly low value (2 in my case). It looks like there’s an issue with the handling over an overflow of the “hardware” queue, but I don’t really understand that code. Regards, Kristof
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87E6B8EC-F92B-49F4-A540-04E084FA0A33>