Date: Fri, 29 Aug 2008 00:26:01 +0930 From: Benjamin Close <Benjamin.Close@clearchain.com> To: Attilio Rao <attilio@freebsd.org> Cc: kevinxlinuz <kevinxlinuz@163.com>, freebsd-current@freebsd.org Subject: Re: [BUG] I think sleepqueue need to be protected in sleepq_broadcast Message-ID: <48B6BC81.5060300@clearchain.com> In-Reply-To: <3bbf2fe10808230233u195f3530wf4e3b6e007b638d9@mail.gmail.com> References: <11617822.2511219426408994.JavaMail.coremail@bj163app64.163.com> <200808230003.44081.jhb@freebsd.org> <3bbf2fe10808230233u195f3530wf4e3b6e007b638d9@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Attilio Rao wrote: > 2008/8/23, John Baldwin <jhb@freebsd.org>: > >> On Friday 22 August 2008 01:33:28 pm kevinxlinuz wrote: >> >>> Hi, >>> I'm looking in the problem ( amd64/124200: kernel panic on mutex sleepq >>> chain).It troubles me for a long time.I add a KASSERT in sleepq_broadcast() >>> to check the sleepqueue's wait channel.At last it turn out that the >>> sleepqueue's wait channel was changed before sleepq_resume_thread(). In >>> sleepq_lookup(),We can easily find sq->sq_wchan == wchan.But after a short >>> time,the sq->sq_wchan nolonger equal with wchan,so I think it was changed >>> by other threads. >>> >> The sleepq chain lock is already held for all of sleepq_broadcast() by the >> caller (see wakeup() and cv_broadcastpri()). That said, I don't have any >> other good ideas for the panic you are seeing. Do you have a crash dump? It >> might be interesting to see what other thread is using that sleep queue. >> > > Ben Close and me investigated this bug extensively and still didn't > find the source. > Factors we have now: > 1) The lock, when accessing with DDB, is exactly locked by another > thread even if it should be held by the curthread. It is like the > mutex cookie gets overwritten by the other thread like if it was free. > An extra drop (and subsequent acquire) is not very likely because of > (2). > 2) KTR traces doesn't show anything wrong. Accesses to sleepqueue > chain lock are paired (both on via mtx_* interface and thread_lock > respectively). This is very strange because it excludes a wrong locks > semantic. > 3) The problem is reproducible even on 4BSD, without PREEMPTION and > even with smp sysctl disabled (it just brings more time). > 4) The bug seems triggered by sx + waitchannel when used in the > sx_sleep() and such. > > I'm thinking this can be some nasty, but sorta of deterministic, race > between sleepqueue accesses between the sx sleepqueue and the > waitchannel sleepqueue. > I have still to think better about it, but actually I'm pretty busy > and if you have good ideas please let me know. > The other common factor, though not 100% verified is everyone experiencing the race is running amd64. Cheers, Benjamin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?48B6BC81.5060300>