Date: Wed, 03 Apr 2019 09:17:27 +0100 From: Andrew Gierth <andrew@tao11.riddles.org.uk> To: Ian Lepore <ian@freebsd.org> Cc: Karl Denninger <karl@denninger.net>, ticso@cicely.de, "freebsd-arm\@freebsd.org" <freebsd-arm@freebsd.org> Subject: Re: insanely-high interrupt rates Message-ID: <87bm1nzgot.fsf@news-spur.riddles.org.uk> In-Reply-To: <dc56a8964cae942354cbe2b5b0620f2eebb569bb.camel@freebsd.org> (Ian Lepore's message of "Tue, 02 Apr 2019 20:58:01 -0600") References: <004ddba628b94b80845d8e509ddcb648d21fd6c9.camel@freebsd.org> <C68D7E6E-03C1-448F-8638-8BD1717DBF44@jeditekunum.com> <ac7d434f16f3a89f5ef247678d6becdbeded5c3f.camel@freebsd.org> <CE40E2B5-2244-4EF9-B67F-34A54D71E2E8@jeditekunum.com> <f60ea6d2-b696-d896-7bcb-ac628f41f7b8@denninger.net> <20190319161423.GH57400@cicely7.cicely.de> <52df098fdc0caf5de1879c93239534fffbd49b56.camel@freebsd.org> <40f57de2-2b25-3981-a416-b9958cc97636@denninger.net> <669892ac3fc37b0843a156c0ab102316829103fd.camel@freebsd.org> <663f2566-b035-7011-70eb-4163b41e6e55@denninger.net> <20190325164827.GL57400@cicely7.cicely.de> <3db9cf8a-68ee-e339-67bf-760ee51464fd@denninger.net> <fc17ac0f77832e840b9fffa9b1074561f1e766d8.camel@freebsd.org> <d96c7f42-f01b-8990-a558-ee92d631b51d@denninger.net> <dc56a8964cae942354cbe2b5b0620f2eebb569bb.camel@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
>>>>> "Ian" == Ian Lepore <ian@freebsd.org> writes: Ian> I now have an rpi2b, and I see the same thing you do in the Ian> 'vmstat -i' output, including rates like 500K int/sec on cpu3 Ian> timer0. But while it was behaving like that, the output from Ian> 'vmstat 1' showed a perfectly steady 8800-9200 int/sec, which is Ian> certainly more consistent with top showing under 2% being used for Ian> combined system+interrupt. I'm seeing this too right now. The problem is that whatever this problem is, it's not being accounted for in the %cpu statistics, probably because it takes one CPU completely out of play. So here's what I see. First, top: last pid: 6110; load averages: 1.00, 1.00, 1.00 up 1+19:26:47 08:35:02 107 threads: 6 running, 81 sleeping, 20 waiting CPU: 0.3% user, 0.0% nice, 0.7% system, 0.2% interrupt, 98.8% idle Mem: 10M Active, 611M Inact, 32M Laundry, 168M Wired, 98M Buf, 115M Free Swap: 4096M Total, 4096M Free PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND 10 root 155 ki31 0 32K CPU2 2 43.3H 99.35% idle{idle: cpu2} 10 root 155 ki31 0 32K RUN 1 43.3H 98.71% idle{idle: cpu1} 10 root 155 ki31 0 32K CPU0 0 43.1H 98.49% idle{idle: cpu0} 1186 andrew 20 0 14M 8728K select 1 0:01 0.90% xterm Notice the load average of 1+, and that while it's reporting 98.8% idle, only three of the four cpu idle processes show up. The fourth idle process shows no accumulated CPU time, as shown here by ps axlH: 0 10 0 0 155 0 0 32 - RNL - 2589:43.57 [idle/idle: cpu0] 0 10 0 0 155 0 0 32 - RNL - 2597:14.64 [idle/idle: cpu1] 0 10 0 0 155 0 0 32 - RNL - 2597:19.11 [idle/idle: cpu2] 0 10 0 0 155 0 0 32 - RNL - 0:00.00 [idle/idle: cpu3] The CPU time spent handling these interrupts is not being accounted for anywhere. Also, to confirm that there is something odd going on, the CPU is running hotter than it would normally do while idle: ~42C rather than the ~38C that I would normally see with this level of activity. Ian> I think the problem here is with vmstat -i, not with actual Ian> interrupts. Something about the stats reporting is wrong on Ian> armv6/v7. There are two obvious issues with vmstat -i: it double- or triple-counts some interrupts, and it reports a false rate once the counters wrap (which happens quite quickly when this issue is in play). But these don't account for the problem described here. Here's lines from vmstat -i taken 10 seconds apart: local_intc0,3: + 3085878928 19685 local_intc0,3: + 3094431425 19738 That's 8552497 interrupts in 10 seconds, which matches systat -vm's report of ~850k interrupts/sec. Since these are 32-bit counters they will wrap in an hour and a half at that rate, so the ~19700/sec rate reported is entirely spurious (since it's just counter/uptime). Looking at the generic_timer driver, there's an obvious problem: it installs one interrupt handler on all the irqs and supposedly enables only one of the timers, but the interrupt handler code simply assumes that the timer causing the interrupt is the one that it enabled; if it gets an interrupt from one of the others, it'll never clear it or mask it, it'll just return with the interrupt unhandled, hence the interrupt storm. On 32-bit arm it is only supposed to enable the timer that's attached to local_intc0,1 and this indeed shows an interrupt rate of ~1100/sec which seems about right for the clock interrupt; the timer on local_intc0,3 (the virtual one) is never supposed to be enabled so far as I can tell. -- Andrew.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87bm1nzgot.fsf>