From owner-freebsd-hackers@FreeBSD.ORG  Wed Feb 20 18:34:01 2008
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3B99B16A402;
	Wed, 20 Feb 2008 18:34:01 +0000 (UTC) (envelope-from avg@icyb.net.ua)
Received: from falcon.cybervisiontech.com (falcon.cybervisiontech.com
	[217.20.163.9])
	by mx1.freebsd.org (Postfix) with ESMTP id 846E413C447;
	Wed, 20 Feb 2008 18:34:00 +0000 (UTC) (envelope-from avg@icyb.net.ua)
Received: from localhost (localhost [127.0.0.1])
	by falcon.cybervisiontech.com (Postfix) with ESMTP id 3033F43CA06;
	Wed, 20 Feb 2008 20:33:59 +0200 (EET)
X-Virus-Scanned: Debian amavisd-new at falcon.cybervisiontech.com
Received: from falcon.cybervisiontech.com ([127.0.0.1])
	by localhost (falcon.cybervisiontech.com [127.0.0.1]) (amavisd-new,
	port 10027)
	with ESMTP id mkO68n+gBezk; Wed, 20 Feb 2008 20:33:59 +0200 (EET)
Received: from [10.74.70.239] (unknown [193.138.145.53])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by falcon.cybervisiontech.com (Postfix) with ESMTP id DB83D43C9C9;
	Wed, 20 Feb 2008 20:33:57 +0200 (EET)
Message-ID: <47BC7287.6000301@icyb.net.ua>
Date: Wed, 20 Feb 2008 20:33:43 +0200
From: Andriy Gapon <avg@icyb.net.ua>
User-Agent: Thunderbird 2.0.0.9 (X11/20071208)
MIME-Version: 1.0
To: freebsd-acpi@freebsd.org, freebsd-hackers@freebsd.org
References: <479F0ED4.9030709@icyb.net.ua>
	<479F62D9.6080703@root.org>	<47A33CCB.3090902@icyb.net.ua>
	<47B0C10F.6000109@icyb.net.ua>	<47B4103A.6090902@icyb.net.ua>
	<uh7bq6jcveu.fsf@P142.sics.se> <47B4A103.7040801@icyb.net.ua>
	<47B4B31A.4020605@icyb.net.ua> <47B84E61.3060401@icyb.net.ua>
	<47BB375C.5010208@icyb.net.ua> <47BB4D5C.9000406@icyb.net.ua>
In-Reply-To: <47BB4D5C.9000406@icyb.net.ua>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit
Cc: 
Subject: Re: cx_lowest and CPU usage
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Feb 2008 18:34:01 -0000

on 19/02/2008 23:42 Andriy Gapon said the following:
> The last result most probably means that RTC IRQ was not the interrupt
> to wake CPU from sleeping state.
> The first possibility that comes to mind is that on this particular
> hardware RTC interrupt (IRQ8) is not able to wake the system from C2 state.

So it seems that this was true.
Here's a shortcut to the relevant info:
PIIX4E (FW82371EB) specification
DEVACTB — DEVICE ACTIVITY B (FUNCTION 3) pci register description
BRLD_EN_IRQ8, bit 5

$ pciconf -r pci0:0:7:3 0x58
03040c07

> Is this possible?
> Are there known instances of such breakage?
> Are there other possible explanations?
> 
> on 19/02/2008 22:09 Andriy Gapon said the following:
>> More news about the issue. Finally, I stopped being blind and tried to
>> use KTR.
>> Graphs produces by schedgraph confirmed my suspicion that system spends
>> most if its time in idle thread (sleeping).
>> But, but, RTC interrupt hits with high precision into the relatively
>> very short intervals when swi4 runs.
>> It seems that relatively long transition times of C2 on this hardware
>> cause this "synchronization" but I am not sure exactly how. It is even
>> harder to say how long does it take the hardware (chipset+cpu) to wake
>> from that sleep on interrupt.
>> One interesting thing that I immediately noticed is that having DUMMYNET
>> in my kernel caused a huge difference.
>> A typical KTR sequence looks like this:
>> 1. we are in idle thread (apparently sleeping in C2)
>> 2. clock (IRQ0) interrupt handler is executed (apparently after CPU is
>> waken and idle thread enabled interrupts after post-C2 processing)
>> 3. hardclock() checks if there is any timeout to be processed by
>> softclock, and thanks to dummynet, we have its callout scheduled each tick
>> 4. hardclock() schedules swi4 to run
>>
>> I am fuzzy about the following step, I see traces, but I don't know the
>> code. So I write my guess and knowledgeable people can correct me.
>> 5. near the end of interrupt handler routine we have critical_exit()
>> call, it can call mi_switch() if current thread needs to be preempted by
>> a different thread; in this case interrupt code is executed in context
>> of the system idle thread and the idle thread is "owing preemption" to
>> swi4 (see step4)
>> 6. so apparently mi_switch happens in the idle thread and execution
>> jumps to swi4 thread, which now exits from mi_switch() in soft interrupt
>> loop routine
>> 7. so far, starting from step 2, execution is with disabled interrupts;
>> now swi4 unlocks spin mutex lock and that enables interrupts (see the
>> previous message quoted below)
>> 8. "immediately", RTC (IRQ8) interrupt handler is executed, apparently
>> it has been pending for some time now
>> 9. statclock is called and because we are on a swi thread this rtc tick
>> is charged to "interrupts"
>>
>> So, here we are. But I am still not sure why the above happens so
>> "deterministically" - RTC interrupt can happen anywhere between 2 clock
>> interrupts. So we have 1000us range for RTC and 90us of C2 exit delay,
>> that shouldn't be enough to always put IRQ0 and IRQ8 "together".
>>
>> Anyway, I excluded DUMMYNET from kernel config, but kept SCHED_ULE and
>> HZ=1000. Now with C2 enabled I get 15-20% of (reported!) interrupt load
>> on completely idle system. That is still a lot, but is much less than
>> 99% I had before :-)
>>
>> Now, without DUMMYNET, I will get new KTR dumps for cx_lowest=C1 and
>> cx_lowest=C2. I hope that this much cleaner experiment would give more
>> insights.
>>
>> But even now we can be sure that our cpu statistics code is not always
>> adequate. This is because IRQ8 is not processed like some
>> super-interrupt that looks on a system from above. It is a regular
>> participant of the system, it competes with IRQ0 and is affected by
>> scheduler's decisions. Especially those made while in interrupt handler.
>>
>>
>>
>> on 17/02/2008 17:10 Andriy Gapon said the following:
>>> on 14/02/2008 23:31 Andriy Gapon said the following:
>>>> I ran a series of tests, repeating each twice to be sure that I didn't
>>>> make any mistake.
>>>> All tests were performed in single-user mode, so the system was as idle
>>>> as possible, top reported idle as 99.N% - 100%.
>>>> Then I set hw.acpi.cpu.cx_lowest=C2 and ran top again.
>>>> Here's the results:
>>>> GENERIC, SCHED_4BSD, default HZ=1000 ==> C2-interrupt 11-14% (!!)
>>>> GENERIC, SCHED_4BSD, kern.hz="100" ==> C2-interrupt 99-100%
>>>> customized kernel, SCHED_ULE, default HZ=1000 ==> C2-interrupt 99-100%
>>>> customized kernel, SCHED_ULE, kern.hz="100" ==> C2-interrupt 99-100%
>>> I did some testing and here are some additional data.
>>> Running with sched_ule and hz=1000 I measured that mean time spent in C2
>>> state (from starting to go into it, till completely returning from it)
>>> is ~860us (end_time - start_time, no overhead adjustments).
>>> Additionally, 98.1% of sleeps lasted 800-900us, 1.8% lasted 700-800us,
>>> the rest is spread almost uniformly over the remaining intervals of
>>> range 0-1000%.
>>>
>>> Here's a typical backtrace for statclock execution:
>>> #9  0xc04cce26 in statclock (usermode=0) at
>>> /system/src/sys/kern/kern_clock.c:447
>>> #10 0xc06b5f0c in rtcintr (frame=0xcfb30c78) at
>>> /system/src/sys/i386/isa/clock.c:236
>>> #11 0xc06a5275 in intr_execute_handlers (isrc=0xc0736900,
>>> frame=0xcfb30c78) at /system/src/sys/i386/i386/intr_machdep.c:362
>>> #12 0xc06b558c in atpic_handle_intr (vector=8, frame=0xcfb30c78) at
>>> /system/src/sys/i386/isa/atpic.c:596
>>> #13 0xc06a0d71 in Xatpic_intr8 () at atpic_vector.s:70
>>> #14 0xc06a70c9 in spinlock_exit () at cpufunc.h:365
>>> #15 0xc04e6993 in ithread_loop (arg=0xc22f5980) at
>>> /system/src/sys/kern/kern_intr.c:1137
>>> #16 0xc04e32c1 in fork_exit (callout=0xc04e6640 <ithread_loop>,
>>> arg=0xc22f5980, frame=0xcfb30d38) at /system/src/sys/kern/kern_fork.c:754
>>> #17 0xc06a0bc0 in fork_trampoline () at
>>> /system/src/sys/i386/i386/exception.s:205
>>>
>>> I.e. swi4 kernel thread is "caught" by IRQ8 just when it exits from
>>> thread_unlock() in the following snippet:
>>> 1132                    if (!ithd->it_need && !(ithd->it_flags & IT_DEAD)) {
>>> 1133                            TD_SET_IWAIT(td);
>>> 1134                            ie->ie_count = 0;
>>> 1135                            mi_switch(SW_VOL, NULL);
>>> 1136                    }
>>> 1137                    thread_unlock(td);
>>>
>>> This happens almost deterministically.
>>>
>>
> 
> 


-- 
Andriy Gapon