From nobody Tue Nov  9 09:58:30 2021
X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 7FD53185060D
	for <freebsd-current@mlmmj.nyi.freebsd.org>; Tue,  9 Nov 2021 09:58:33 +0000 (UTC)
	(envelope-from avg@FreeBSD.org)
Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "smtp.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4HpNj53FwQz3HTM;
	Tue,  9 Nov 2021 09:58:33 +0000 (UTC)
	(envelope-from avg@FreeBSD.org)
Received: from [192.168.0.88] (unknown [195.64.148.76])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(Client did not present a certificate)
	(Authenticated sender: avg/mail)
	by smtp.freebsd.org (Postfix) with ESMTPSA id DFF4B23AF9;
	Tue,  9 Nov 2021 09:58:32 +0000 (UTC)
	(envelope-from avg@FreeBSD.org)
From: Andriy Gapon <avg@FreeBSD.org>
To: Konstantin Belousov <kostikbel@gmail.com>,
 FreeBSD Current <freebsd-current@FreeBSD.org>
References: <aff7b1e5-c380-9d86-d638-047e618894e6@FreeBSD.org>
 <20191022104434.GM73312@kib.kiev.ua>
 <3a67f9a9-31cf-5814-4a68-8bdd6063b21e@FreeBSD.org>
 <20191022131633.GN73312@kib.kiev.ua>
 <9c131a2a-cc94-4d93-1ba8-595c0151e366@FreeBSD.org>
Subject: Re: thread on sleepqueue does not wake up after timeout
Message-ID: <32fe4f76-155b-8b99-5782-4daafd4219d5@FreeBSD.org>
Date: Tue, 9 Nov 2021 11:58:30 +0200
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:78.0) Gecko/20100101
 Firefox/78.0 Thunderbird/78.14.0
List-Id: Discussions about the use of FreeBSD-current <freebsd-current.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-current
List-Help: <mailto:freebsd-current+help@freebsd.org>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Subscribe: <mailto:freebsd-current+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-current+unsubscribe@freebsd.org>
Sender: owner-freebsd-current@freebsd.org
MIME-Version: 1.0
In-Reply-To: <9c131a2a-cc94-4d93-1ba8-595c0151e366@FreeBSD.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-ThisMailContainsUnwantedMimeParts: N


Two years have passed since my original report, some things have changed, but we 
are still seeing the problem.

To recap, the problem is that sometimes a callout for a sleepq timeout would 
fire to early.  The code in sleepq_timeout would ignore such a wake-up.  And, as 
there would not be another callout, there will never be a wake-up.

This has been observed on VMWare only.  We have seen it on different versions of 
ESXi (different environments, upgrades, etc) starting with 5.0, through 5.5 and 
6.0, and now on 6.5.

We use 12.2 now and are still seeing the problem occasionally.

On advice from Kostik I added some diagnostic code.
The code checks for the too early condition and panics when it happens.
It also provides some additional information.

Here is an example of such a panic.
Unread portion of the kernel message buffer:
panic: too early sleepq timeout, 190543869738008 > 190543869603412 (190543869798505)
cpuid = 1
time = 1636195325

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=153080096) at /usr/src/sys/kern/kern_shutdown.c:447
#2  0xffffffff8038f1cc in db_fncall_generic (addr=<optimized out>, nargs=0, 
args=<optimized out>, rv=<optimized out>) at /usr/src/sys/ddb/db_command.c:622
#3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized 
out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:670
#4  0xffffffff8038eaf6 in db_command (last_cmdp=<optimized out>, 
cmd_table=<optimized out>, dopager=0) at /usr/src/sys/ddb/db_command.c:494
#5  0xffffffff80393a68 in db_script_exec (scriptname=<optimized out>, 
warnifnotfound=<optimized out>) at /usr/src/sys/ddb/db_script.c:304
#6  0xffffffff80393892 in db_script_kdbenter (eventname=<optimized out>) at 
/usr/src/sys/ddb/db_script.c:326
#7  0xffffffff80391ac3 in db_trap (type=<optimized out>, code=<optimized out>) 
at /usr/src/sys/ddb/db_main.c:251
#8  0xffffffff807821e2 in kdb_trap (type=3, code=0, tf=0xfffffe01091fd410) at 
/usr/src/sys/kern/subr_kdb.c:700
#9  0xffffffff809d870e in trap (frame=0xfffffe01091fd410) at 
/usr/src/sys/amd64/amd64/trap.c:583
#10 <signal handler called>
#11 kdb_enter (why=0xffffffff80b17af9 "panic", msg=0xffffffff80b17af9 "panic") 
at /usr/src/sys/kern/subr_kdb.c:486
#12 0xffffffff8073ddce in vpanic (fmt=<optimized out>, ap=<optimized out>) at 
/usr/src/sys/kern/kern_shutdown.c:975
#13 0xffffffff8073dc23 in panic (fmt=0xffffffff81178120 <cnputs_mtx+24> "") at 
/usr/src/sys/kern/kern_shutdown.c:909
#14 0xffffffff8078f09c in sleepq_timeout (arg=0xfffff8005b6a1000) at 
/usr/src/sys/kern/subr_sleepqueue.c:1042
#15 0xffffffff8075594a in softclock_call_cc (c=0xfffff8005b6a1470, 
cc=0xffffffff811a0580 <cc_cpu+320>, now=190543869798505, direct=1) at 
/usr/src/sys/kern/kern_timeout.c:768
#16 0xffffffff8075567c in callout_process (now=190543869798505) at 
/usr/src/sys/kern/kern_timeout.c:511
#17 0xffffffff80a3d455 in handleevents (now=190543869798505, fake=0) at 
/usr/src/sys/kern/kern_clocksource.c:213
#18 0xffffffff80a3da46 in timercb (et=0xffffffff81251238 <lapic_et>, 
arg=<optimized out>) at /usr/src/sys/kern/kern_clocksource.c:357
#19 0xffffffff80a680cb in lapic_handle_timer (frame=0xfffffe01091fd7d0) at 
/usr/src/sys/x86/x86/local_apic.c:1339
#20 <signal handler called>
#21 0xffffffff81d90654 in buf_hash_find
...

(kgdb) fr 18
#18 0xffffffff80a3da46 in timercb (et=0xffffffff81251238 <lapic_et>, 
arg=<optimized out>) at /usr/src/sys/kern/kern_clocksource.c:357
357     /usr/src/sys/kern/kern_clocksource.c: No such file or directory.
(kgdb) i loc
state = <optimized out>
next = 0xfffffe00015b7b80
now = 190543869798505
bcast = <error reading variable bcast (Cannot access memory at address 0x0)>
cpu = <optimized out>

(kgdb) fr 15
#15 0xffffffff8075594a in softclock_call_cc (c=0xfffff8005b6a1470, 
cc=0xffffffff811a0580 <cc_cpu+320>, now=190543869798505, direct=1) at 
/usr/src/sys/kern/kern_timeout.c:768
768     /usr/src/sys/kern/kern_timeout.c: No such file or directory.
(kgdb) p *c
$1 = {c_links = {le = {le_next = 0x0, le_prev = 0xfffffe000029f398}, sle = 
{sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xfffffe000029f398}}, c_time 
= 190543869738008, c_precision = 2684354, c_arg = 0xfffff8005b6a1000,
   c_func = 0xffffffff8078f000 <sleepq_timeout>, c_lock = 0x0, c_flags = 2, 
c_iflags = 272, c_cpu = 1, c_exec_time = 190543869798505, c_lines = {u128 = 
5514281696343066435261632100963648521, u16 = {1033, 755, 1225, 628, 409,
       651, 788, 1062}}}

Here is an explanation for the numbers reproted in the panic message (sorted 
from earliest to latest):
190543869603412 - 'now' as seen in sleepq_timeout(), returned by sbinuptime();
190543869738008 - td_sleeptimo, also c_time in the callout;
190543869798505 - 'now' as captured when the LAPIC timer fired, seen in the 
stack trace and also recorded as c_exec_time in the callout.

Just in case, here is the code that printed the panic message:
                         panic("too early sleepq timeout, %jd > %jd (%jd)", 
 

                             (intmax_t)td->td_sleeptimo, 
 

                             (intmax_t)sbinuptime(), 
 

                             (intmax_t)td->td_slpcallout.c_exec_time);

As can be seen, sbinuptime() in sleepq_timeout() returned a value smaller than 
what sbinuptime() returned in timercb().

It seems that the code in callout_process() was right to fire the callout 
because now > td_sleeptimo there.
But sleepq_timeout() thought that it was premature as sbinuptime() < 
td_sleeptimo there.

In the above case the eventtimer is obviously LAPIC, the timecounter is HPET.
But we also saw the same issue when we changed the timecounter to ACPI-fast.
I think that we haven't tried any other choices because:
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.TSC-low.quality: -100

I see three possibilities of why sbinuptime() could go backwards:
- a bug in FreeBSD time keeping code that gets triggered only under very 
specific conditions that don't normally happen
- a bug in VMWare, e.g. in HPET and ACPI timer emulation
- something exotic, like correct C code miscompiled to incorrect machine code or 
weird memory model violation or etc...

I'd appreciate any suggestions on additional diagnostics to narrow down / rule 
out the possibilities.

P.S.
As a workaround I could modify sleepq_timeout() to get "current time" from 
c_exec_time (added by us) instead of sbinuptime().  c_exec_time is the value of 
'now' in callout_process() when it decides that the callout should fire.
But I'd like to get to the bottom of the issue.

-- 
Andriy Gapon