From owner-freebsd-net@FreeBSD.ORG Thu Jan 12 09:31:19 2012 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB672106566C; Thu, 12 Jan 2012 09:31:19 +0000 (UTC) (envelope-from lev@FreeBSD.org) Received: from onlyone.friendlyhosting.spb.ru (onlyone.friendlyhosting.spb.ru [IPv6:2a01:4f8:131:60a2::2]) by mx1.freebsd.org (Postfix) with ESMTP id ACCD38FC08; Thu, 12 Jan 2012 09:31:19 +0000 (UTC) Received: from lion.home.serebryakov.spb.ru (unknown [IPv6:2001:470:923f:1:1d3e:4d27:b4ee:e1e2]) (Authenticated sender: lev@serebryakov.spb.ru) by onlyone.friendlyhosting.spb.ru (Postfix) with ESMTPA id 5BCF44AC2D; Thu, 12 Jan 2012 13:31:18 +0400 (MSK) Date: Thu, 12 Jan 2012 13:31:12 +0400 From: Lev Serebryakov Organization: FreeBSD X-Priority: 3 (Normal) Message-ID: <1379921442.20120112133112@serebryakov.spb.ru> To: freebsd-current@freebsd.org, freebsd-net@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1251 Content-Transfer-Encoding: quoted-printable Cc: avg@FreeBSD.org, jhb@FreeBSD.org Subject: SCHED_ULE / NetGraph interaction broken somwhere between r227874 and r229818 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: lev@FreeBSD.org List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Jan 2012 09:31:20 -0000 Hello, Freebsd-current. I have router, which connects to upstream ISP with mpd5 from ports using PPPoE. I've used SCHED_ULE for long time without nay problems. Under heavy network load (router is not the fastest one -- 500Mhz Geode CPU) main consumer of CPU was "intr{swi1: netisr 0}" thread. But it never consumes more than 75% and even when upstream channel was competently saturated router was accessible and responsive. Latest "good" I'm sure about revision is about r227874 (yes, from November 2011, I didn't update router's system for long time). But revision r229818 behaves completely different: under network load 100% CPU is consumed by "ng_queue" thread (which is never ever consume any CPU on old system). System is unresponsive, DNS based on this system returns timeouts, I could not log-in via SSH or seral console (pause between login and passwd is so huge, that it leads to timeouts), etc. LA jumps up to 20+, pre-started `top' updates screen one time per 3-4 minutes, etc. Switching to 4BSD helps. 4BSD works as usual: all CPU time is interrupts and network thread, system is responsive under heaviest load, normal operations of DNS, DHCP and hostapd. There was NO significant changes in netgraph (svn log -r 227874:229818 sys/netgraph) and three changes (r229429, r228960, r228718) in kern/sched_*.c files. But I'm not sure, that these changes are only which could affect this behavior. Now I'm trying to find "bad" revision by binary search, but it is very hard to do: old mpd5 doesn't work on new kernel and vice versa, so I need to rebuild whole world, update my build-box, rebuild ports with new world, and only after that build NanoBSD image for my router. It takes about 5 hours per iteration and here is more than 512 revisions, so it is about 10 iterations :( I could provide any debug information from old and new systems. --=20 // Black Lion AKA Lev Serebryakov