Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 29 Aug 2008 00:26:01 +0930
From:      Benjamin Close <Benjamin.Close@clearchain.com>
To:        Attilio Rao <attilio@freebsd.org>
Cc:        kevinxlinuz <kevinxlinuz@163.com>, freebsd-current@freebsd.org
Subject:   Re: [BUG] I think sleepqueue need to be protected in sleepq_broadcast
Message-ID:  <48B6BC81.5060300@clearchain.com>
In-Reply-To: <3bbf2fe10808230233u195f3530wf4e3b6e007b638d9@mail.gmail.com>
References:  <11617822.2511219426408994.JavaMail.coremail@bj163app64.163.com>	<200808230003.44081.jhb@freebsd.org> <3bbf2fe10808230233u195f3530wf4e3b6e007b638d9@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Attilio Rao wrote:
> 2008/8/23, John Baldwin <jhb@freebsd.org>:
>   
>> On Friday 22 August 2008 01:33:28 pm kevinxlinuz wrote:
>>     
>>> Hi,
>>>   I'm looking in the problem ( amd64/124200: kernel panic on mutex sleepq
>>> chain).It troubles me for a long time.I add a KASSERT in sleepq_broadcast()
>>> to check the sleepqueue's wait channel.At last it turn out that the
>>> sleepqueue's wait channel was changed before sleepq_resume_thread(). In
>>> sleepq_lookup(),We can easily find sq->sq_wchan == wchan.But after a short
>>> time,the sq->sq_wchan nolonger equal with wchan,so I think it was changed
>>> by other threads.
>>>       
>> The sleepq chain lock is already held for all of sleepq_broadcast() by the
>> caller (see wakeup() and cv_broadcastpri()).  That said, I don't have any
>> other good ideas for the panic you are seeing.  Do you have a crash dump?  It
>> might be interesting to see what other thread is using that sleep queue.
>>     
>
> Ben Close and me investigated this bug extensively and still didn't
> find the source.
> Factors we have now:
> 1) The lock, when accessing with DDB, is exactly locked by another
> thread even if it should be held by the curthread. It is like the
> mutex cookie gets overwritten by the other thread like if it was free.
> An extra drop (and subsequent acquire) is not very likely because of
> (2).
> 2) KTR traces doesn't show anything wrong. Accesses to sleepqueue
> chain lock are paired (both on via mtx_* interface and thread_lock
> respectively). This is very strange because it excludes a wrong locks
> semantic.
> 3) The problem is reproducible even on 4BSD, without PREEMPTION and
> even with smp sysctl disabled (it just brings more time).
> 4) The bug seems triggered by sx + waitchannel when used in the
> sx_sleep() and such.
>
> I'm thinking this can be some nasty, but sorta of deterministic, race
> between sleepqueue accesses between the sx sleepqueue and the
> waitchannel sleepqueue.
> I have still to think better about it, but actually I'm pretty busy
> and if you have good ideas please let me know.
>   
The other common factor, though not 100% verified is everyone 
experiencing the race is running amd64.

Cheers,
    Benjamin




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?48B6BC81.5060300>