Date: Mon, 31 May 2004 00:31:29 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Kris Kennaway <kris@obsecurity.org> Cc: current@freebsd.org Subject: Re: stray irq13 at runtime Message-ID: <20040530233746.Q2376@gamplex.bde.org> In-Reply-To: <20040530155728.S979@gamplex.bde.org> References: <20040530043049.GA16224@xor.obsecurity.org> <20040530155728.S979@gamplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 30 May 2004, Bruce Evans wrote: > On Sat, 29 May 2004, Kris Kennaway wrote: > > > Since updating the i386 package machines the other day, they've all > > experienced the following: > > > > May 29 21:24:53 <user.err> gohan28 kernel: stray irq13 > > > > irq13: npx0 2 0 > > stray irq13 1 0 > > > > This is not appearing during boot - those machines have been up for > > hours before the interrupt occurs. > ... > I haven't figured out why the APIC case normally delivers both a normal > (fast) interrupt and stray interrupt when we don't wait for the one > interrupt that actually occurs. One is counted as stray because it > occurs after the bus_teardown_intr(), but both of them seem to occur > after that. So there seems to be a race or double counting somewhere. I have now figured this out. There is double counting. Interrupts are supposed to be counted per-device (more precisely, per group of devices sharing an interrupt at a given time), with interrupts that have no handler in effect being counted as for the special "stray" device and counts being maintained until reboot for all previous combinations of devices. This has been broken. Interrupts are now counted per-vector and reported as being for the last group of devices using the interrupt (so history is lost if the combination is changed), and then if their are no devices already using the interrupt, interrupts are counted again as "stray". In this case and some others, the stray interrupts really did come from the last group of devices causing the interrupt, but they shouldn't be counted twice. I can duplicate your counts of 2 and 1 and explain them as follows: - configure without "device apic" so that the other bug suite doesn't complicate things. This gives initial counts of 1 for npx0 and and stray irq13. - run any program that causes an unmasked NPX exception. This also causes an unmasked irq13 (because the recent optimization for edge triggering leaves irq13 enabled even when its handler has been torn down). The irq13 is double-counted as for npx0 and stray irq13.xi Further unmasked NPX exceptions don't cause further irq13 because the first one was not properly handled. The npx0 busy latch remains set, so further irq13's are masked by that although not by the PIC. Further irq13s for unmasked NPX exceptions don't happen for the APIC case, although one wants to happen according to the PIC's IRR. Summary: - this bug really was harmless - statistics for interrupt handling are more broken than I thought. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040530233746.Q2376>