From owner-freebsd-acpi@FreeBSD.ORG  Mon Jul 30 08:55:57 2012
Return-Path: <owner-freebsd-acpi@FreeBSD.ORG>
Delivered-To: freebsd-acpi@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 53717106566B;
	Mon, 30 Jul 2012 08:55:57 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au
	[211.29.132.186])
	by mx1.freebsd.org (Postfix) with ESMTP id BE2908FC08;
	Mon, 30 Jul 2012 08:55:56 +0000 (UTC)
Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246])
	by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q6U8tkcJ007031
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Mon, 30 Jul 2012 18:55:47 +1000
Date: Mon, 30 Jul 2012 18:55:46 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Alexander Motin <mav@FreeBSD.org>
In-Reply-To: <501628D2.2090507@FreeBSD.org>
Message-ID: <20120730171246.Y1715@besplex.bde.org>
References: <5014DD00.3000307@FreeBSD.org>
	<20120729175031.U2084@besplex.bde.org>
	<50150CF5.4070605@FreeBSD.org> <20120729221526.H2941@besplex.bde.org>
	<50154C58.4060408@FreeBSD.org> <20120730141426.D1219@besplex.bde.org>
	<501628D2.2090507@FreeBSD.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-acpi@FreeBSD.org
Subject: Re: Using bintime() in acpi_cpu_idle()?
X-BeenThere: freebsd-acpi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: ACPI and power management development <freebsd-acpi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-acpi>,
	<mailto:freebsd-acpi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-acpi>
List-Post: <mailto:freebsd-acpi@freebsd.org>
List-Help: <mailto:freebsd-acpi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-acpi>,
	<mailto:freebsd-acpi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Jul 2012 08:55:57 -0000

On Mon, 30 Jul 2012, Alexander Motin wrote:

> On 30.07.2012 07:33, Bruce Evans wrote:
>> On Sun, 29 Jul 2012, Alexander Motin wrote:
>>> ...
>>> Timecounter already has detection logic to disable TSC in cases where
>>> it is unreliable. I don't want to replicate it here. I need not
>>> precise and not synchronized by reliable and fast time source.
>> 
>> Yes, this logic gives exactly what you don't want (an inefficient
>> timecounter), by preventing use of the TSC for the timecounter, although
>> the TSC is perfectly usable for the ticker and here.
>
> Can you teach me how to use ticker that is not ticking? If TSC was considered 
> unusable for timecounter for reasons unrelated to SMP, how can I use it as 
> ticker.

No :-).  I can't teach you how to use either the ticker or the timecounter
if their clock is not ticking.  I'm just saying that if you use can
blindly use a timecounter, then you can blindly use the ticker.  The
working of both depends on their clock not stopping ticking, and that in
many cases their clock is the same (the TSC).

The TSC is considered usable for the ticker under weak conditions:
- it exists according to CPUID_TSC
- it is not disabled by the machdep.disable_tsc tuneable
- its dynamic probe finds that its frequency is nonzero.  The probe
   has some more cpuid tests and other complications which may prevent
   it being fuly dynamic.  There is another tuneable,
   machdep.disable_tsc_calibration which prevents the dynamic frequency
   determination.  I think the frequency comes from a table then, and
   is never zero, so this doesn't prevent the TSC being used for the
   ticker.
- the 2 tuneables are of course undocumented in /usr/share/man.  There is
   hardly any useful documentation of the TSC there either.  zgrep finds
   "TSC" mainly in timercounters(4) and hwpmc(4).  In timecounters(4),
   the references to the TSC are useless since they are just literal
   output of $(sysctl kern.timecounter).  In hwpmc(4), the READTSC
   instruction but not much more is mentioned.

The TSC is considered usable for a timecounter under the above conditions,
but its default quality is low so it rarely gets used.  Its quality is
changed under the following conditions:
- APM enabled: reduce quality to nearly -infinity
- CPU can deep sleep, and Intel CPU, and TSC not invariant: reduce quality
   to nearly -infinity, because (only) Intel CPUs are known to stop the
   TSC in deep sleeps under these conditions.  This is what you should have
   told me to justify use of binuptime() :-).  Users can still configure
   the TSC as a timecounter, but this would break more than your use of
   binuptime() if the TSC actually stops.
- SMP configured, and > 1 CPU:
   - vm guest: reduce quality significantly, but not to nearly -infinity
   - else do cpuid and dynamic synchronization tests:
     - fail tests: reduce as for vm guest
     - pass tests: increase a little, to just above ACPI-fast IIRC
     - pass synchronization tests, but not invariant: keep default.
- SMP not configured, or only 1 CPU: increase a little iff invariant.
   Invariant means P-state invariant.  I forgot that the invariance flag
   was a tuneable.  This tuneable, kern.timecounter.tsc_invariant, is
   of course undocumented.  It conditionalizes more than this case.
   Other bugs in it are:
   - it is in a different namespace than the tuneables described above.
   - this different namespace is worse, since the flag applies to more
     than the timecounter decision.  It also gives the ticker invariance,
     flag and controls whether there are event handlers for frequency
     changes.
   - you can force the flag on using the tuneable, but you can't force
     it off.
- for SMP, there is also the kern.timecounter.smp tunable.  This has
   much the same bugs as kern.timecounter.tsc_invariant:
   - it is of course undocumented
   - you can force it on, but you can't force it off
   - however, its namespace seems to be not incorrect, since it seems
     to only control timecounter quality (very indirectly now, by
     modifying the dynamic probes.  It used to be a simple flag to
     modify the SMP config option).

Stopping of the TSC in deep sleeps doesn't prevent its use as the ticker.
This should mostly work for the main use of the ticker, for thread
runtimes, because most threads never idle directly, but switch to the
idle thread for some CPU.  I think deep sleeps break runtime accounting
for idle threads (if the ticker stops).  Has anyone seen this (idle
times near 0 on mostly-idle systems that have spent days idling)?

>>>> I wouldn't trust timecounters for some time after waking up after a
>>>> deep sleep.  If their clock stopped then the times read might only be
>>>> ...
>>> I am not sure what reinitialization are you talking about. IIRC, there
>>> is no any waking up code for TSC. None other time counters have
>>> problems with C-states.
>> 
>> It is the timecounter code that needs reinitializing.  If the TSC stops,
>> or wraps mod 2**32, then its counts become garbage for the purpose of
>> timecounting.  Maybe it is not used for timecounting in either of these
>> cases.  But these cases shouldn't prevent its use for timecounting.
> ...
> At this moment I am not talking about S-states sleeping for hours. I am 
> talking about C-states for milliseconds. It means that TSC may stop and start 
> 10K times each second or even more. Attempt to save and restore its state 
> will consume so much resources, that probably make it useless.

You should have told me the lengths of the sleeps early in this thread :-).
I only know enough about this to ask annoying questions.

> What's about wrap after 2 seconds, I would be happy to make CPU sleep for so 
> long, but now 100ms is all I can hope even on idle system.

Covered by the above, but future-proofing requires supporting arbitrary
sleep lengths.  Use a less efficient timer that works over long sleeps
iff the sleep was long.  The problems are to determine whether the sleep
was long, and to switch timers.

>> At boot time there is a dummy timecounter that returns bogo-times.
>> Apparently sleeping doesn't occur before the timecounter is switched to
>> a real one.  The dummy timecounter isn't switched back to after boot
>> time.  But it probably should be, since the hardware timecounter may
>> have stopped or wrapped.  Sleeping could just set a flag to indicate
>> this state, but then you would have to provide a fake time anyway on
>> finding the flag set.  Boot time just points to the dummy timecounter
>> so as not to check this flag in all early timecounter "hardware" calls.
>
> And how dummy timecounter that counts something, but not time, can help me to 
> measure sleep time?

It helps negatively.  You can't use a dummy timecounter any more than you
can used a stopped or wrapped timecounter if you actually need to know
the time.

In low-level code it is unclear whether timecounters can be used.
Where binuptime() can be used is of course undocumented.  binuptime()
actually has a man page, but it is just a stub.  Timecounters actually
can be used in most low-level code, partly because they need to work
in fast interrupt handlers for PPS timestamps, but I wouldn't want to
make their normal use slower to support this.

Bruce