From owner-freebsd-stable@FreeBSD.ORG Sun Mar 29 15:06:59 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 25754106564A; Sun, 29 Mar 2009 15:06:59 +0000 (UTC) (envelope-from kamikaze@bsdforen.de) Received: from mail.bsdforen.de (bsdforen.de [212.204.60.79]) by mx1.freebsd.org (Postfix) with ESMTP id BB1548FC17; Sun, 29 Mar 2009 15:06:58 +0000 (UTC) (envelope-from kamikaze@bsdforen.de) Received: from mobileKamikaze.norad (unknown [88.130.212.151]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.bsdforen.de (Postfix) with ESMTP id 550868A00D1; Sun, 29 Mar 2009 17:06:56 +0200 (CEST) Message-ID: <49CF8E8D.1080604@bsdforen.de> Date: Sun, 29 Mar 2009 17:06:53 +0200 From: Dominic Fandrey User-Agent: Thunderbird 2.0.0.21 (X11/20090321) MIME-Version: 1.0 To: Alexander Motin References: <1238293386.00093672.1238281804@10.7.7.3> <49CF0803.1070505@FreeBSD.org> <49CF2F8D.6000905@bsdforen.de> <49CF4EB9.60108@FreeBSD.org> <49CF49F5.6010800@bsdforen.de> <49CF615A.6050304@FreeBSD.org> <49CF595A.30805@bsdforen.de> <49CF6B28.2080400@FreeBSD.org> <49CF60AB.4040709@bsdforen.de> <49CF6899.2060002@bsdforen.de> In-Reply-To: <49CF6899.2060002@bsdforen.de> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-stable@freebsd.org Subject: Re: powerd broken X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Mar 2009 15:06:59 -0000 Dominic Fandrey wrote: > Dominic Fandrey wrote: >> Alexander Motin wrote: >>> Dominic Fandrey wrote: >>>> Alexander Motin wrote: >>>>> It means that one of your CPUs spent most of it's time in interrupt >>>>> processing and so far from idle. What does `top -P` shows you? Where >>>>> have you seen that ~6% CPU load? >>>> That is the load shown by the e17 CPU module. It's display has always >>>> been in sync with top in the past, no longer though, it appears. >>>> >>>> # top -PIS >>>> last pid: 68235; load averages: 0.09, 0.16, >>>> 0.17 >>>> up 0+05:17:29 13:05:10 >>>> 137 processes: 4 running, 117 sleeping, 16 waiting >>>> CPU 0: 1.1% user, 0.0% nice, 1.1% system, 61.4% interrupt, 36.3% idle >>>> CPU 1: 9.0% user, 0.0% nice, 3.4% system, 0.0% interrupt, 87.6% idle >>>> Mem: 419M Active, 415M Inact, 416M Wired, 3752K Cache, 183M Buf, 716M >>>> Free >>>> Swap: 4096M Total, 4096M Free >>>> >>>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU >>>> COMMAND >>>> 11 root 1 171 ki31 0K 16K RUN 1 286:42 93.46% >>>> idle: cpu1 >>>> 23 root 1 -80 - 0K 16K RUN 0 126:54 59.96% >>>> irq16: hdac0 uhci+ >>>> 12 root 1 171 ki31 0K 16K RUN 0 179:27 40.09% >>>> idle: cpu0 >>>> 1318 root 1 46 0 439M 312M select 1 12:55 1.76% Xorg >>>> 4361 musicpd 4 44 0 91164K 14412K ucond 1 1:57 0.78% mpd >>>> >>>> Some things strike me as odd. The difference between the load reported >>>> by powerd and top is still very significant and of course the high >>>> interrupt load. >>> powerd now reports/uses summary load of all CPUs (it can be bigger then >>> 100%), while top shows average. >>> >>>> I've got a mouse with a 1khz report rate (the only connected USB >>>> device), but unplugging it doesn't change the load. Neither does >>>> stopping moused (I'm running the system without HAL). There also >>>> is a fingerprint reader, but it is only detected by ugen. >>> I would start from identifying all devices sharing that IRQ and trying >>> to disable them (or unload their drivers) one by one. `systat -vm 1` >>> will show you how much interrupts actually happens there per second. >>> >>> On most of modern systems you can make hdac0 to not share that IRQ by >>> enabling MSI there with hint.hdac.0.msi=1 in loader.conf. >>> >> I already did unplug all devices to no avail. Afterwards I tried >> to 'kldunload -f' all usb devices, but most refused. However during >> this I recognized that /boot/modules holds an old u3g.ko, back from >> the time when it was not in stable. Apparently modules in >> /boot/modules have preference over those in /boot/kernels. I deleted >> the old u3g.ko and rebooted (because I couldn't get the system to >> unload it). Now everything is fine, the interrupt load is gone. >> >> I'll monitor this and come back here if it turns out this has just >> been coincidence. >> >> Thank you for all that help. >> >> Regards > > This has turned out to be an early call. I just happened to start > skype_devel and the whole thing started over, so I think it's > pretty certain, now, that this is a hdac issue. > However after a reboot I tried to reproduce this and now skype > and mpd are both running without causing an irq race. > > If the irq race reappears I will use the device hint you suggested > to rule out that this is an interrupt sharing issue. Even after setting the device hint the problem has reoccurred. It simply starts after a while, I cannot make out any cause. As it turned out uhci0 is the violent interrupt with an interrupt rate between 130k and 190k. This is so incredible, I wonder why the interrupt throttling doesn't kick in. 7 users Load 0.16 0.27 0.28 29 Mar 17:02 Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER Tot Share Tot Share Free in out in out Act 805472 76800 1252404 123696 615176 count All 1062992 97920 14120720 152840 pages Proc: Interrupts r p d s w Csw Trp Sys Int Sof Flt cow 182k total 6 84 360k 2770 14k 178k 224 26 12 zfod atkbd0 1 ozfod acpi0 irq9 1.1%Sys 25.9%Intr 0.4%User 0.0%Nice 72.6%Idle %ozfod psm0 irq12 | | | | | | | | | | | daefr ata0 irq14 =+++++++++++++ 15 prcfr 177k uhci0 drm0 9 dtbuf 117 totfr 23 wpi0 uhci1 Namei Name-cache Dir-cache 100000 desvn react ehci0 uhci Calls hits % hits % 7632 numvn pdwak uhci2 ehci 38 38 100 3663 frevn pdpgs 457 uhci3 21 intrn 2008 cpu0: time Disks ad4 cd0 sg0 pass0 473048 wire 188 hdac0 256 KB/t 0.00 0.00 0.00 0.00 653788 act bge0 257 tps 0 0 0 0 274796 inact 2008 cpu1: time MB/s 0.00 0.00 0.00 0.00 20192 cache %busy 0 0 0 0 594984 free I can rule out drm0 as the cause, because uhci0 is the only common presence in all occurrences of this problem.