Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Oct 2019 16:16:33 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Andriy Gapon <avg@FreeBSD.org>
Cc:        FreeBSD Current <freebsd-current@FreeBSD.org>
Subject:   Re: thread on sleepqueue does not wake up after timeout
Message-ID:  <20191022131633.GN73312@kib.kiev.ua>
In-Reply-To: <3a67f9a9-31cf-5814-4a68-8bdd6063b21e@FreeBSD.org>
References:  <aff7b1e5-c380-9d86-d638-047e618894e6@FreeBSD.org> <20191022104434.GM73312@kib.kiev.ua> <3a67f9a9-31cf-5814-4a68-8bdd6063b21e@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 22, 2019 at 02:48:56PM +0300, Andriy Gapon wrote:
> On 22/10/2019 13:44, Konstantin Belousov wrote:
> > On Tue, Oct 22, 2019 at 01:08:59PM +0300, Andriy Gapon wrote:
> >>
> >> We observe a problem that happens very rarely (about once a month across many
> >> test machines).  The problem is that a thread remain in sleepq_timedwait() even
> >> after its timeout expires.  The thread's td_slpcallout looks like the callout
> >> has fired.  But the thread's state looks like it was never notified.
> >> E.g.:
> >> (kgdb) p td->td_slpcallout
> >> $1 = {c_links = {le = {le_next = 0xfffff800108e6470, le_prev =
> >> 0xfffffe0000be6ea8}, sle = {sle_next = 0xfffff800108e6470}, tqe = {tqe_next =
> >> 0xfffff800108e6470, tqe_prev = 0xfffffe0000be6ea8}}, c_time = 160957479343159,
> >>   c_precision = 268435450, c_arg = 0xfffff80184602000, c_func =
> >> 0xffffffff807481d0 <sleepq_timeout>, c_lock = 0x0, c_flags = 2, c_iflags = 272,
> >> c_cpu = 6, c_exec_time = 160957506517070} [*]
> >> (kgdb) p/x td->td_flags
> >> $5 = 0x80000004
> > What is the bit 31 in your flags ?  FreeBSD does not use the bit.
> 
> It's TDF_NOSWAP, a local addition.
> We use it to prohibit full process swapout (I guess that means kernel stacks).
> 
> >> (kgdb) p td->td_sqqueue
> >> $8 = 0
> >> (kgdb) p td->td_sleepqueue
> >> $9 = (struct sleepqueue *) 0x0
> >> (kgdb) p td->td_wchan
> >> $10 = (void *) 0xfffff802b990df38
> >>
> >>
> >> Has anyone seen anything like this problem?
> > Yes, but it was very long time ago.  See r303426.
> 
> Yeah, we are based off r329000 plus a bunch of merges for various fixes.
> One thing I forgot to mention is that it seems to happen only on VMware guests,
> but maybe it's only because we have many more virtual test boxes than we have
> physical ones.
> One thing I suspected was that binuptime() could somehow jump backwards...
Do you use any of suspend/migration ?

Perhaps record sbinuptime() in the struct thread in sleepq_timeout() and
keep the original value of td_sleeptimo around to see what did happen.

> 
> >> Any advice on how to diagnose it?
> >>
> >> Thanks!
> >>
> >> P.S.
> >> c_exec_time is our addition, we set this field right before firing a callback
> >> and we reset it to zero when a callout is (re-)scheduled.
> 
> 
> -- 
> Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20191022131633.GN73312>