From owner-freebsd-current@FreeBSD.ORG Wed Nov 5 04:07:49 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 516B816A4CE; Wed, 5 Nov 2003 04:07:49 -0800 (PST) Received: from mailhub.fokus.fraunhofer.de (mailhub.fokus.fraunhofer.de [193.174.154.14]) by mx1.FreeBSD.org (Postfix) with ESMTP id BF4EE43FCB; Wed, 5 Nov 2003 04:07:47 -0800 (PST) (envelope-from brandt@fokus.fraunhofer.de) Received: from beagle (beagle [193.175.132.100])hA5C7k022534; Wed, 5 Nov 2003 13:07:46 +0100 (MET) Date: Wed, 5 Nov 2003 13:07:46 +0100 (CET) From: Harti Brandt To: John Baldwin In-Reply-To: <20031105110813.J72398@beagle.fokus.fraunhofer.de> Message-ID: <20031105130339.U72398@beagle.fokus.fraunhofer.de> References: <20031105110813.J72398@beagle.fokus.fraunhofer.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: current@freebsd.org Subject: RE: New interrupt stuff breaks ASUS 2 CPU system X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: harti@freebsd.org List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Nov 2003 12:07:49 -0000 On Wed, 5 Nov 2003, Harti Brandt wrote: HB>On Tue, 4 Nov 2003, John Baldwin wrote: HB> HB>JB> HB>JB>On 04-Nov-2003 Harti Brandt wrote: HB>JB>> On Tue, 4 Nov 2003, Harti Brandt wrote: HB>JB>> HB>JB>> HB>On Tue, 4 Nov 2003, John Baldwin wrote: HB>JB>> HB> HB>JB>> HB>JB> HB>JB>> HB>JB>On 04-Nov-2003 Harti Brandt wrote: HB>JB>> HB>JB>> HB>JB>> HB>JB>> Hi, HB>JB>> HB>JB>> HB>JB>> HB>JB>> I have an ASUS system with 2 CPUs that I need to run at HZ=10000. This HB>JB>> HB>JB>> worked until yesterday, but with the new interrupt code it doesn't boot HB>JB>> HB>JB>> anymore. It works for the standard HZ, but if I set HZ=1000 I get a double HB>JB>> HB>JB>> fault. I suspect a race condition in the interrupt handling. My config HB>JB>> HB>JB>> file has HB>JB>> HB>JB>> HB>JB>> HB>JB>> options SMP HB>JB>> HB>JB>> device apic HB>JB>> HB>JB>> options HZ=1000 HB>JB>> HB>JB> HB>JB>> HB>JB>Ok, I can try to reproduce. HB>JB>> HB>JB> HB>JB>> HB>JB>> Device configuration finished. HB>JB>> HB>JB>> Timecounter "TSC" frequency 1380009492 Hz quality -100 HB>JB>> HB>JB>> Timecounters cpuid = 0; apic id = 00 HB>JB>> HB>JB>> instruction pointer = 0x8:0xc048995d HB>JB>> HB>JB>> stack pointer = 0x10:0xc0821bf4 HB>JB>> HB>JB>> frame pointer cpuid = 0; apic id = 00 HB>JB>> HB>JB>> HB>JB>> HB>JB>> 0xc048995d is in critical_exit. It is the jmp after the popf from HB>JB>> HB>JB>> cpu_critical_exit. HB>JB>> HB>JB> HB>JB>> HB>JB>This is where interrupts are re-enabled, so you are getting an interrupt. HB>JB>> HB>JB>It might be helpful to figure what type of fault you are actually getting. HB>JB>> HB> HB>JB>> HB>tf_err is 0, tf_trapno is 30 (decimal). HB>JB>> HB>JB>> More information: HB>JB>> HB>JB>> I have replaced all the reserved vectors with individual ones, that set HB>JB>> tf_err to the index (vector number). It appears the the vector number is HB>JB>> 39 decimal. What does that mean? HB>JB> HB>JB>IRQ 7. HB>JB>Can you post a verbose dmesg? Also, can you try both with and without HB>JB>ACPI? HB> HB>Attached are both dmesgs. HB> HB>More datapoints: HB> HB>I had the parallel port (irq7) and the second sio disabled in the BIOS. HB>After enabling both I now get a panic in lapic_handle_intr: Couldn't get HB>vector from ISR! After fetching the relevant docs from intel I checked the HB>registers of the apic pointed to by lapic. The interrupt taken is HB>Xapic_irq1. isr1 is zero, but irr1 is 0x100 (that was without ACPI). How HB>may that happen? As I understand ISR are the interrupts that have been HB>delivered to the CPU so if it is interrupted a bit should be set, correct? HB> HB>I then have replaced the panic by a printf() followed by a return. Now the HB>system comes to live, but I get a couple of these warnings. When the HB>system is idle everyting seems fine, but when I start my simulation HB>application (which normally generates between 20k and 250k interrupts/sec HB>depending on the MPSAFE setting of the ATM drivers) I get approx 1-2 of HB>these messages per second (this is with HZ=1000). HB> HB>A question while reading the code: what does the global lapic variable HB>refer to? As I understand every CPU has its local APIC. Does it point to HB>one of those two? To which? An additional point. In the above test where I got 1-2 message per second I have now disabled a debugging printout in the ATM driver that gave 3-4 messages per second (from the interrupt handler). Now the 'Couldn't get...' messages have disappeared. So this really looks like a race somewhere. Is it possible that the bit in the ISR gets somehow cleared between the point where the interrupt is handed to the processor but before the Xapic_irq1 really runs and sees that bit? Perhaps from another Xapic_irq1 instance or whatever? harti -- harti brandt, http://www.fokus.fraunhofer.de/research/cc/cats/employees/hartmut.brandt/private brandt@fokus.fraunhofer.de, harti@freebsd.org