From owner-freebsd-current@FreeBSD.ORG  Wed Nov  5 04:07:49 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 516B816A4CE; Wed,  5 Nov 2003 04:07:49 -0800 (PST)
Received: from mailhub.fokus.fraunhofer.de (mailhub.fokus.fraunhofer.de
	[193.174.154.14])	by mx1.FreeBSD.org (Postfix) with ESMTP
	id BF4EE43FCB; Wed,  5 Nov 2003 04:07:47 -0800 (PST)
	(envelope-from brandt@fokus.fraunhofer.de)
Received: from beagle (beagle [193.175.132.100])hA5C7k022534;
	Wed, 5 Nov 2003 13:07:46 +0100 (MET)
Date: Wed, 5 Nov 2003 13:07:46 +0100 (CET)
From: Harti Brandt <brandt@fokus.fraunhofer.de>
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <20031105110813.J72398@beagle.fokus.fraunhofer.de>
Message-ID: <20031105130339.U72398@beagle.fokus.fraunhofer.de>
References: <XFMail.20031104130232.jhb@FreeBSD.org>
 <20031105110813.J72398@beagle.fokus.fraunhofer.de>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: current@freebsd.org
Subject: RE: New interrupt stuff breaks ASUS 2 CPU system
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
Reply-To: harti@freebsd.org
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Nov 2003 12:07:49 -0000

On Wed, 5 Nov 2003, Harti Brandt wrote:

HB>On Tue, 4 Nov 2003, John Baldwin wrote:
HB>
HB>JB>
HB>JB>On 04-Nov-2003 Harti Brandt wrote:
HB>JB>> On Tue, 4 Nov 2003, Harti Brandt wrote:
HB>JB>>
HB>JB>> HB>On Tue, 4 Nov 2003, John Baldwin wrote:
HB>JB>> HB>
HB>JB>> HB>JB>
HB>JB>> HB>JB>On 04-Nov-2003 Harti Brandt wrote:
HB>JB>> HB>JB>>
HB>JB>> HB>JB>> Hi,
HB>JB>> HB>JB>>
HB>JB>> HB>JB>> I have an ASUS system with 2 CPUs that I need to run at HZ=10000. This
HB>JB>> HB>JB>> worked until yesterday, but with the new interrupt code it doesn't boot
HB>JB>> HB>JB>> anymore. It works for the standard HZ, but if I set HZ=1000 I get a double
HB>JB>> HB>JB>> fault. I suspect a race condition in the interrupt handling. My config
HB>JB>> HB>JB>> file has
HB>JB>> HB>JB>>
HB>JB>> HB>JB>> options SMP
HB>JB>> HB>JB>> device apic
HB>JB>> HB>JB>> options HZ=1000
HB>JB>> HB>JB>
HB>JB>> HB>JB>Ok, I can try to reproduce.
HB>JB>> HB>JB>
HB>JB>> HB>JB>> Device configuration finished.
HB>JB>> HB>JB>> Timecounter "TSC" frequency 1380009492 Hz quality -100
HB>JB>> HB>JB>> Timecounters cpuid = 0; apic id = 00
HB>JB>> HB>JB>> instruction pointer   = 0x8:0xc048995d
HB>JB>> HB>JB>> stack pointer         = 0x10:0xc0821bf4
HB>JB>> HB>JB>> frame pointer        cpuid = 0; apic id = 00
HB>JB>> HB>JB>>
HB>JB>> HB>JB>> 0xc048995d is in critical_exit. It is the jmp after the popf from
HB>JB>> HB>JB>> cpu_critical_exit.
HB>JB>> HB>JB>
HB>JB>> HB>JB>This is where interrupts are re-enabled, so you are getting an interrupt.
HB>JB>> HB>JB>It might be helpful to figure what type of fault you are actually getting.
HB>JB>> HB>
HB>JB>> HB>tf_err is 0, tf_trapno is 30 (decimal).
HB>JB>>
HB>JB>> More information:
HB>JB>>
HB>JB>> I have replaced all the reserved vectors with individual ones, that set
HB>JB>> tf_err to the index (vector number). It appears the the vector number is
HB>JB>> 39 decimal. What does that mean?
HB>JB>
HB>JB>IRQ 7.
HB>JB>Can you post a verbose dmesg?  Also, can you try both with and without
HB>JB>ACPI?
HB>
HB>Attached are both dmesgs.
HB>
HB>More datapoints:
HB>
HB>I had the parallel port (irq7) and the second sio disabled in the BIOS.
HB>After enabling both I now get a panic in lapic_handle_intr: Couldn't get
HB>vector from ISR! After fetching the relevant docs from intel I checked the
HB>registers of the apic pointed to by lapic. The interrupt taken is
HB>Xapic_irq1. isr1 is zero, but irr1 is 0x100 (that was without ACPI). How
HB>may that happen? As I understand ISR are the interrupts that have been
HB>delivered to the CPU so if it is interrupted a bit should be set, correct?
HB>
HB>I then have replaced the panic by a printf() followed by a return. Now the
HB>system comes to live, but I get a couple of these warnings. When the
HB>system is idle everyting seems fine, but when I start my simulation
HB>application (which normally generates between 20k and 250k interrupts/sec
HB>depending on the MPSAFE setting of the ATM drivers) I get approx 1-2 of
HB>these messages per second (this is with HZ=1000).
HB>
HB>A question while reading the code: what does the global lapic variable
HB>refer to? As I understand every CPU has its local APIC. Does it point to
HB>one of those two? To which?

An additional point. In the above test where I got 1-2 message per second
I have now disabled a debugging printout in the ATM driver that gave 3-4
messages per second (from the interrupt handler). Now the 'Couldn't
get...' messages have disappeared. So this really looks like a race
somewhere. Is it possible that the bit in the ISR gets somehow cleared
between the point where the interrupt is handed to the processor but
before the Xapic_irq1 really runs and sees that bit? Perhaps from another
Xapic_irq1 instance or whatever?

harti
-- 
harti brandt,
http://www.fokus.fraunhofer.de/research/cc/cats/employees/hartmut.brandt/private
brandt@fokus.fraunhofer.de, harti@freebsd.org