Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 8 Feb 2005 14:40:49 -0500
From:      John Baldwin <jhb@FreeBSD.org>
To:        Peter Holm <peter@holm.cc>
Cc:        freebsd-current@FreeBSD.org
Subject:   Re: Assertion td->td_sleepqueue != NULL failed at kern/subr_sleepqueue.c:270
Message-ID:  <200502081440.49979.jhb@FreeBSD.org>
In-Reply-To: <20050106214554.GA45533@peter.osted.lan>
References:  <20050105122636.GA31684@peter.osted.lan> <200501061617.49967.jhb@FreeBSD.org> <20050106214554.GA45533@peter.osted.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 06 January 2005 04:45 pm, Peter Holm wrote:
> On Thu, Jan 06, 2005 at 04:17:49PM -0500, John Baldwin wrote:
> > On Wednesday 05 January 2005 07:26 am, Peter Holm wrote:
> > > With GENERIC HEAD from Dec 31 09:28 UTC + bmilekic@'s uma_core
> > > patch + alc's patch I got the following strange assert:
> > >
> > > panic(c0827c46,c082dd18,c082dc8d,10e,c08f4660) at panic+0x190
> > > sleepq_add(c08eec90,c08ee6e8,c082a9bf,1,c08ee6e8,0,c0827ca9,7d)
> > >    at sleepq_add+0x156
> > > cv_wait(c08eec90,c08ee6e8,c151de30,0,ffffffff) at cv_wait+0x100
> > > _sx_xlock(c08eec60,c0828867,247,0,c151ddc8) at _sx_xlock+0x59
> > > kern_wait(c151e450,ffffffff,cbc67c90,0,0) at kern_wait+0x4b
> > > wait4(c151e450,cbc67d14,4,3f8,282) at wait4+0x29
> > > syscall(2f,2f,bfbf002f,2,0) at syscall+0x128
> > > Xint0x80_syscall() at Xint0x80_syscall+0x1f
> > > --- syscall (7, FreeBSD ELF32, wait4), eip = 0x805170b, esp =
> > > 0xbfbfedbc, ebp = 0xbfbfedd8 ---
> > >
> > > Looks like td->td_sleepqueue is NULL!
> > >
> > > Details at http://www.holm.cc/stress/log/cons100.html
> >
> > This is a truly odd panic.  The basic theory of operation with sleep
> > queues is that every thread that is not already queued on a sleep queue
> > carries a sleep queue structure around that they donate to a wait channel
> > when they block on it.  Once they are resumed, they reclaim a sleep queue
> > from the waitchannel. This resuming bit happens in sleepq_remove_thread()
> > in subr_sleepqueue.c.  As you can see, in addition to assigning a
> > sleepqueue to the thread being removed from a queue, it also clears
> > td_wchan and td_wmesg.  The thread in question has both fields set (as if
> > it were asleep on "proctree", which is what it is trying to back to sleep
> > on now).  However, it is not on a sleep queue (td_slpq.tqe_next is NULL).
> >  So, apparently, it seems that a thread was removed from the sleep queue
> > and resumed (made runnable) but
> > sleepq_remove_thread() wasn't called.  Do you have any local patches that
> > might affect this btw?  I notice you get a lot of trap 9's in your dmesg
> > which is somewhat unsettling.
>
> These are the modifications:
> http://www.holm.cc/stress/log/mods.html
>
> The trap 9 are not uncommon for the test suite.

I'm still thinking about this FYI as I've seen this at least once or twice, 
but I still don't understand how it is happend.  In the other case I've 
looked at, it is as if the thread has been awakened by someone outside of the 
sleep queue code because td_wchan and td_wmesg are still set 
(sleepq_remove_thread() clears them) and the associated wait channel 
(proctree, which is another common theme) has a sleep queue with no waiters 
attached to it.  That is, the sleep queue that curthread should have is still 
sitting on a sleep queue chain, which is consistent with the thread being 
made runnable without going through sleepq_remove_thread().  Are you able to 
reproduce this at all?  If so, can you do it with KTR enabled and KTR_PROC 
tracing turned on perhaps?  Thanks.

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200502081440.49979.jhb>