From owner-freebsd-arm@freebsd.org Wed Apr 3 08:17:41 2019 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 83C41156384C for ; Wed, 3 Apr 2019 08:17:41 +0000 (UTC) (envelope-from andrew@tao11.riddles.org.uk) Received: from lungold.riddles.org.uk (lungold.riddles.org.uk [82.68.208.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D22386B212; Wed, 3 Apr 2019 08:17:39 +0000 (UTC) (envelope-from andrew@tao11.riddles.org.uk) Received: from [192.168.127.1] (port=14247 helo=caithnard.riddles.org.uk) by lungold.riddles.org.uk with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.88 (FreeBSD)) (envelope-from ) id 1hBb5M-0004Ls-Rt; Wed, 03 Apr 2019 08:17:28 +0000 Received: from [127.0.0.1] (port=38716 helo=caithnard.riddles.org.uk) by caithnard.riddles.org.uk with esmtp (Exim 4.89 (FreeBSD)) (envelope-from ) id 1hBb5M-0001x2-8C; Wed, 03 Apr 2019 08:17:28 +0000 From: Andrew Gierth To: Ian Lepore Cc: Karl Denninger , ticso@cicely.de, "freebsd-arm\@freebsd.org" Subject: Re: insanely-high interrupt rates In-Reply-To: (Ian Lepore's message of "Tue, 02 Apr 2019 20:58:01 -0600") Message-ID: <87bm1nzgot.fsf@news-spur.riddles.org.uk> References: <004ddba628b94b80845d8e509ddcb648d21fd6c9.camel@freebsd.org> <20190319161423.GH57400@cicely7.cicely.de> <52df098fdc0caf5de1879c93239534fffbd49b56.camel@freebsd.org> <40f57de2-2b25-3981-a416-b9958cc97636@denninger.net> <669892ac3fc37b0843a156c0ab102316829103fd.camel@freebsd.org> <663f2566-b035-7011-70eb-4163b41e6e55@denninger.net> <20190325164827.GL57400@cicely7.cicely.de> <3db9cf8a-68ee-e339-67bf-760ee51464fd@denninger.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (berkeley-unix) Date: Wed, 03 Apr 2019 09:17:27 +0100 MIME-Version: 1.0 Content-Type: text/plain X-Rspamd-Queue-Id: D22386B212 X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-1.64 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.77)[-0.770,0]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; NEURAL_HAM_LONG(-0.89)[-0.889,0]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[riddles.org.uk]; AUTH_NA(1.00)[]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MX_GOOD(-0.01)[lungold.riddles.org.uk]; NEURAL_HAM_SHORT(-0.62)[-0.620,0]; R_SPF_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:13037, ipnet:82.68.0.0/14, country:GB]; RCVD_TLS_LAST(0.00)[]; IP_SCORE(-0.25)[ipnet: 82.68.0.0/14(-0.15), asn: 13037(-1.00), country: GB(-0.09)] X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Apr 2019 08:17:41 -0000 >>>>> "Ian" == Ian Lepore writes: Ian> I now have an rpi2b, and I see the same thing you do in the Ian> 'vmstat -i' output, including rates like 500K int/sec on cpu3 Ian> timer0. But while it was behaving like that, the output from Ian> 'vmstat 1' showed a perfectly steady 8800-9200 int/sec, which is Ian> certainly more consistent with top showing under 2% being used for Ian> combined system+interrupt. I'm seeing this too right now. The problem is that whatever this problem is, it's not being accounted for in the %cpu statistics, probably because it takes one CPU completely out of play. So here's what I see. First, top: last pid: 6110; load averages: 1.00, 1.00, 1.00 up 1+19:26:47 08:35:02 107 threads: 6 running, 81 sleeping, 20 waiting CPU: 0.3% user, 0.0% nice, 0.7% system, 0.2% interrupt, 98.8% idle Mem: 10M Active, 611M Inact, 32M Laundry, 168M Wired, 98M Buf, 115M Free Swap: 4096M Total, 4096M Free PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND 10 root 155 ki31 0 32K CPU2 2 43.3H 99.35% idle{idle: cpu2} 10 root 155 ki31 0 32K RUN 1 43.3H 98.71% idle{idle: cpu1} 10 root 155 ki31 0 32K CPU0 0 43.1H 98.49% idle{idle: cpu0} 1186 andrew 20 0 14M 8728K select 1 0:01 0.90% xterm Notice the load average of 1+, and that while it's reporting 98.8% idle, only three of the four cpu idle processes show up. The fourth idle process shows no accumulated CPU time, as shown here by ps axlH: 0 10 0 0 155 0 0 32 - RNL - 2589:43.57 [idle/idle: cpu0] 0 10 0 0 155 0 0 32 - RNL - 2597:14.64 [idle/idle: cpu1] 0 10 0 0 155 0 0 32 - RNL - 2597:19.11 [idle/idle: cpu2] 0 10 0 0 155 0 0 32 - RNL - 0:00.00 [idle/idle: cpu3] The CPU time spent handling these interrupts is not being accounted for anywhere. Also, to confirm that there is something odd going on, the CPU is running hotter than it would normally do while idle: ~42C rather than the ~38C that I would normally see with this level of activity. Ian> I think the problem here is with vmstat -i, not with actual Ian> interrupts. Something about the stats reporting is wrong on Ian> armv6/v7. There are two obvious issues with vmstat -i: it double- or triple-counts some interrupts, and it reports a false rate once the counters wrap (which happens quite quickly when this issue is in play). But these don't account for the problem described here. Here's lines from vmstat -i taken 10 seconds apart: local_intc0,3: + 3085878928 19685 local_intc0,3: + 3094431425 19738 That's 8552497 interrupts in 10 seconds, which matches systat -vm's report of ~850k interrupts/sec. Since these are 32-bit counters they will wrap in an hour and a half at that rate, so the ~19700/sec rate reported is entirely spurious (since it's just counter/uptime). Looking at the generic_timer driver, there's an obvious problem: it installs one interrupt handler on all the irqs and supposedly enables only one of the timers, but the interrupt handler code simply assumes that the timer causing the interrupt is the one that it enabled; if it gets an interrupt from one of the others, it'll never clear it or mask it, it'll just return with the interrupt unhandled, hence the interrupt storm. On 32-bit arm it is only supposed to enable the timer that's attached to local_intc0,1 and this indeed shows an interrupt rate of ~1100/sec which seems about right for the clock interrupt; the timer on local_intc0,3 (the virtual one) is never supposed to be enabled so far as I can tell. -- Andrew.