From nobody Tue Nov  9 15:56:19 2021
X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 2BDB51854FE0
	for <freebsd-current@mlmmj.nyi.freebsd.org>; Tue,  9 Nov 2021 15:56:22 +0000 (UTC)
	(envelope-from avg@freebsd.org)
Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "smtp.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4HpXdx6tYbz3LWx;
	Tue,  9 Nov 2021 15:56:21 +0000 (UTC)
	(envelope-from avg@freebsd.org)
Received: from [192.168.0.88] (unknown [195.64.148.76])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(Client did not present a certificate)
	(Authenticated sender: avg/mail)
	by smtp.freebsd.org (Postfix) with ESMTPSA id 78FCB27356;
	Tue,  9 Nov 2021 15:56:21 +0000 (UTC)
	(envelope-from avg@freebsd.org)
From: Andriy Gapon <avg@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: FreeBSD Current <freebsd-current@freebsd.org>
References: <aff7b1e5-c380-9d86-d638-047e618894e6@FreeBSD.org>
 <20191022104434.GM73312@kib.kiev.ua>
 <3a67f9a9-31cf-5814-4a68-8bdd6063b21e@FreeBSD.org>
 <20191022131633.GN73312@kib.kiev.ua>
 <9c131a2a-cc94-4d93-1ba8-595c0151e366@FreeBSD.org>
 <32fe4f76-155b-8b99-5782-4daafd4219d5@FreeBSD.org>
 <YYpg8p7f5q9JNEjE@kib.kiev.ua>
Subject: Re: thread on sleepqueue does not wake up after timeout
Message-ID: <fc651014-4f9d-4bc8-5c3b-07014208104f@FreeBSD.org>
Date: Tue, 9 Nov 2021 17:56:19 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:78.0) Gecko/20100101
 Firefox/78.0 Thunderbird/78.14.0
List-Id: Discussions about the use of FreeBSD-current <freebsd-current.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-current
List-Help: <mailto:freebsd-current+help@freebsd.org>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Subscribe: <mailto:freebsd-current+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-current+unsubscribe@freebsd.org>
Sender: owner-freebsd-current@freebsd.org
MIME-Version: 1.0
In-Reply-To: <YYpg8p7f5q9JNEjE@kib.kiev.ua>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-ThisMailContainsUnwantedMimeParts: N

On Tue, Nov 09, 2021 at 11:58:30AM +0200, Andriy Gapon wrote:
> Here is an explanation for the numbers reported in the panic message (sorted
> from earliest to latest):
> 190543869603412 - 'now' as seen in sleepq_timeout(), returned by sbinuptime();
> 190543869738008 - td_sleeptimo, also c_time in the callout;
> 190543869798505 - 'now' as captured when the LAPIC timer fired, seen in the
> stack trace and also recorded as c_exec_time in the callout.

Kostik,

thank you very much for the pointers!

I spent some more time staring at the code and at the timehands data from the 
dump (which I neglected to share earlier).  I am starting to think that there is 
a bug in the FreeBSD code (at least in the copy that we use) unless I got 
confused somewhere or made a mistake in calculations.

So, I think that there is a discrepancy between how "large deltas" are handled 
in tc_windup and bintime_off.
As large deltas happen very rarely, especially on good hardware, the bug should 
be very rare as well.

Now to the data.

(kgdb) p *timehands->th_counter
$28 = {tc_get_timecount = 0xffffffff809de380 <hpet_get_timecount>, tc_poll_pps = 
0x0, tc_counter_mask = 4294967295, tc_frequency = 14318180, tc_name = 
0xffffffff80b0ff97 "HPET", tc_quality = 950, tc_flags = 0,
   tc_priv = 0xfffffe0010916000, tc_next = 0xffffffff810f6c30 <tsc_timecounter>, 
tc_fill_vdso_timehands = 0xffffffff809dc6b0 <hpet_vdso_timehands>, 
tc_fill_vdso_timehands32 = 0x0}

(kgdb) p timehands_count
$76 = 2
(kgdb) p timehands
$75 = (struct timehands * volatile) 0xffffffff8109e1a0 <ths+128>
(kgdb) p &ths[0]
$77 = (struct timehands *) 0xffffffff8109e120 <ths>
(kgdb) p &ths[1]
$78 = (struct timehands *) 0xffffffff8109e1a0 <ths+128>

(kgdb) p ths[0]
$79 = {th_counter = 0xfffffe0010916060, th_adjustment = 254493021346896, 
th_scale = 1288420532592, th_large_delta = 14317331, th_offset_count = 
3817197766, th_offset = {sec = 44363, frac = 7084573033620442688}, th_bintime = {
     sec = 1636195324, frac = 14622574300909856022}, th_microtime = {tv_sec = 
1636195324, tv_usec = 792691}, th_nanotime = {tv_sec = 1636195324, tv_nsec = 
792691341}, th_boottime = {sec = 1636150961, frac = 7538001267289413334},
   th_generation = 2204358, th_next = 0xffffffff8109e1a0 <ths+128>}

(kgdb) p ths[1]
$80 = {th_counter = 0xfffffe0010916060, th_adjustment = 254492583661022, 
th_scale = 1288420532460, th_large_delta = 14317331, th_offset_count = 
3832485779, th_offset = {sec = 44364, frac = 8334125784005739824}, th_bintime = {
     sec = 1636195325, frac = 15872127051295153158}, th_microtime = {tv_sec = 
1636195325, tv_usec = 860429}, th_nanotime = {tv_sec = 1636195325, tv_nsec = 
860429731}, th_boottime = {sec = 1636150961, frac = 7538001267289413334},
   th_generation = 2204358, th_next = 0xffffffff8109e120 <ths>}


th_offset_count difference between the hands is 15288013.
It's a bit above tc_frequency of 14318180, so before the latest wind-up there 
hasn't been a wind-up for more than a second, a rare situation indeed.
The difference is also greater than th_large_delta of 14317331.

I redid the th_offset calculations in tc_windup by hand and arrived at exactly 
the same value of ths[1].th_offset as seen in kgdb using ths[0].th_offset and 
the delta as inputs.  So, this is consistent.

Then I did a thought experiment: what would binuptime() return at exactly the 
same moment when tc_windup was called?  That binuptime() would still use ths[0] 
as the timehands because the hands have not been switched yet and it would also 
see exactly the same timecounter delta.

So, starting conditions:
delta = 15288013
th_large_delta = 14317331
th_offset = {sec = 44363, frac = 7084573033620442688}
th_scale = 1288420532592

The calculations in the code (bintime_off) are:
	if (__predict_false(delta >= large_delta)) {
		/* Avoid overflow for scale * delta. */
		x = (scale >> 32) * delta;
		bt->sec += x >> 32;
		bintime_addx(bt, x << 32);
		bintime_addx(bt, (scale & 0xffffffff) * delta);
	} else {


My manual calculations:
x = (1288420532592 >> 32) * 15288013 == 4571115887
bt->sec += 4571115887 >> 32 == 1
   bt = { 44364, 7084573033620442688 }
bt <bintime_addx> 4571115887 << 32 == 1186049167181479936
   bt = { 44364, 8270622200801922624 }
bt <bintime_addx> (scale & 0xffffffff) * delta == 4225311088 * 15288013 == 
64596610842388144
   bt = { 44364, 8335218811644310768 }

So, comparing to ths[1].th_offset the resulting time has larger 'frac' part:
8335218811644310768 - 8334125784005739824 == 1093027638570944

So, IMO, this means that at the moment of the hands switch the binuptime (and 
all other times) would jump backwards.

Converting both times to sbintime_t I got:
190543869814104 is sbinuptime using ths[0] (and delta of 15288013)
190543869559614 is sbinuptime using ths[1] (and delta of 0)
This is a jump backwards by 254490 parts.

If I put these times together with the times found in the crash dump stack 
(quoted at the start), then I get:

190543869559614 - hand switch time calculated using ths[1] (and delta of 0)
190543869603412 - 'now' as seen in sleepq_timeout(), returned by sbinuptime();
190543869738008 - td_sleeptimo, also c_time in the callout;
190543869798505 - 'now' as captured when the LAPIC timer fired
190543869814104 - hand switch time calculated using ths[0] (and delta of 15288013)

So, it seems that the original (greater) 'now' was obtained some time before the 
switch.  And the second (lesser) 'now' was obtained some time after the switch.

So, as I was saying, when the delta is large the calculations in tc_windup and 
bintime_off give slightly different results and that can lead to a discontinuity 
of the time when timehands are switched.

Just in case, the code in tc_windup for comparison:
	th->th_offset_count += delta;
	th->th_offset_count &= th->th_counter->tc_counter_mask;
	while (delta > th->th_counter->tc_frequency) {
		/* Eat complete unadjusted seconds. */
		delta -= th->th_counter->tc_frequency;
		th->th_offset.sec++;
	}
	if ((delta > th->th_counter->tc_frequency / 2) &&
	    (th->th_scale * delta < ((uint64_t)1 << 63))) {
		/* The product th_scale * delta just barely overflows. */
		th->th_offset.sec++;
	}
	bintime_addx(&th->th_offset, th->th_scale * delta);

In my case there was a single iteration of the while loop and the (delta > 
th->th_counter->tc_frequency / 2) branch was not taken.

-- 
Andriy Gapon