From owner-freebsd-current@FreeBSD.ORG  Thu Aug 28 14:56:10 2008
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 444201065682;
	Thu, 28 Aug 2008 14:56:10 +0000 (UTC)
	(envelope-from Benjamin.Close@clearchain.com)
Received: from ipmail04.adl2.internode.on.net (ipmail04.adl2.internode.on.net
	[203.16.214.57])
	by mx1.freebsd.org (Postfix) with ESMTP id 663038FC22;
	Thu, 28 Aug 2008 14:56:09 +0000 (UTC)
	(envelope-from Benjamin.Close@clearchain.com)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: As4EAItXtkh5LQBb/2dsb2JhbACBZbhPgWo
X-IronPort-AV: E=Sophos;i="4.32,286,1217773800"; d="scan'208";a="196489033"
Received: from ppp121-45-0-91.lns10.adl2.internode.on.net (HELO
	mail.clearchain.com) ([121.45.0.91])
	by ipmail04.adl2.internode.on.net with ESMTP; 29 Aug 2008 00:26:06 +0930
Received: from [192.168.155.234] (taurus.internal.clearchain.com
	[192.168.155.234]) (authenticated bits=0)
	by mail.clearchain.com (8.14.2/8.14.2) with ESMTP id m7SEu3HV040109
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Fri, 29 Aug 2008 00:26:04 +0930 (CST)
	(envelope-from Benjamin.Close@clearchain.com)
Message-ID: <48B6BC81.5060300@clearchain.com>
Date: Fri, 29 Aug 2008 00:26:01 +0930
From: Benjamin Close <Benjamin.Close@clearchain.com>
User-Agent: Thunderbird 2.0.0.16 (Windows/20080708)
MIME-Version: 1.0
To: Attilio Rao <attilio@freebsd.org>
References: <11617822.2511219426408994.JavaMail.coremail@bj163app64.163.com>	<200808230003.44081.jhb@freebsd.org>
	<3bbf2fe10808230233u195f3530wf4e3b6e007b638d9@mail.gmail.com>
In-Reply-To: <3bbf2fe10808230233u195f3530wf4e3b6e007b638d9@mail.gmail.com>
X-Enigmail-Version: 0.95.6
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV version 0.93.3,
	clamav-milter version 0.93.3 on pegasus.clearchain.com
X-Virus-Status: Clean
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0
	(mail.clearchain.com [192.168.154.1]);
	Fri, 29 Aug 2008 00:26:04 +0930 (CST)
Cc: kevinxlinuz <kevinxlinuz@163.com>, freebsd-current@freebsd.org
Subject: Re: [BUG] I think sleepqueue need to be protected in
	sleepq_broadcast
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 28 Aug 2008 14:56:10 -0000

Attilio Rao wrote:
> 2008/8/23, John Baldwin <jhb@freebsd.org>:
>   
>> On Friday 22 August 2008 01:33:28 pm kevinxlinuz wrote:
>>     
>>> Hi,
>>>   I'm looking in the problem ( amd64/124200: kernel panic on mutex sleepq
>>> chain).It troubles me for a long time.I add a KASSERT in sleepq_broadcast()
>>> to check the sleepqueue's wait channel.At last it turn out that the
>>> sleepqueue's wait channel was changed before sleepq_resume_thread(). In
>>> sleepq_lookup(),We can easily find sq->sq_wchan == wchan.But after a short
>>> time,the sq->sq_wchan nolonger equal with wchan,so I think it was changed
>>> by other threads.
>>>       
>> The sleepq chain lock is already held for all of sleepq_broadcast() by the
>> caller (see wakeup() and cv_broadcastpri()).  That said, I don't have any
>> other good ideas for the panic you are seeing.  Do you have a crash dump?  It
>> might be interesting to see what other thread is using that sleep queue.
>>     
>
> Ben Close and me investigated this bug extensively and still didn't
> find the source.
> Factors we have now:
> 1) The lock, when accessing with DDB, is exactly locked by another
> thread even if it should be held by the curthread. It is like the
> mutex cookie gets overwritten by the other thread like if it was free.
> An extra drop (and subsequent acquire) is not very likely because of
> (2).
> 2) KTR traces doesn't show anything wrong. Accesses to sleepqueue
> chain lock are paired (both on via mtx_* interface and thread_lock
> respectively). This is very strange because it excludes a wrong locks
> semantic.
> 3) The problem is reproducible even on 4BSD, without PREEMPTION and
> even with smp sysctl disabled (it just brings more time).
> 4) The bug seems triggered by sx + waitchannel when used in the
> sx_sleep() and such.
>
> I'm thinking this can be some nasty, but sorta of deterministic, race
> between sleepqueue accesses between the sx sleepqueue and the
> waitchannel sleepqueue.
> I have still to think better about it, but actually I'm pretty busy
> and if you have good ideas please let me know.
>   
The other common factor, though not 100% verified is everyone 
experiencing the race is running amd64.

Cheers,
    Benjamin