From owner-freebsd-current@FreeBSD.ORG Thu Aug 28 14:56:10 2008 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 444201065682; Thu, 28 Aug 2008 14:56:10 +0000 (UTC) (envelope-from Benjamin.Close@clearchain.com) Received: from ipmail04.adl2.internode.on.net (ipmail04.adl2.internode.on.net [203.16.214.57]) by mx1.freebsd.org (Postfix) with ESMTP id 663038FC22; Thu, 28 Aug 2008 14:56:09 +0000 (UTC) (envelope-from Benjamin.Close@clearchain.com) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: As4EAItXtkh5LQBb/2dsb2JhbACBZbhPgWo X-IronPort-AV: E=Sophos;i="4.32,286,1217773800"; d="scan'208";a="196489033" Received: from ppp121-45-0-91.lns10.adl2.internode.on.net (HELO mail.clearchain.com) ([121.45.0.91]) by ipmail04.adl2.internode.on.net with ESMTP; 29 Aug 2008 00:26:06 +0930 Received: from [192.168.155.234] (taurus.internal.clearchain.com [192.168.155.234]) (authenticated bits=0) by mail.clearchain.com (8.14.2/8.14.2) with ESMTP id m7SEu3HV040109 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 29 Aug 2008 00:26:04 +0930 (CST) (envelope-from Benjamin.Close@clearchain.com) Message-ID: <48B6BC81.5060300@clearchain.com> Date: Fri, 29 Aug 2008 00:26:01 +0930 From: Benjamin Close User-Agent: Thunderbird 2.0.0.16 (Windows/20080708) MIME-Version: 1.0 To: Attilio Rao References: <11617822.2511219426408994.JavaMail.coremail@bj163app64.163.com> <200808230003.44081.jhb@freebsd.org> <3bbf2fe10808230233u195f3530wf4e3b6e007b638d9@mail.gmail.com> In-Reply-To: <3bbf2fe10808230233u195f3530wf4e3b6e007b638d9@mail.gmail.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV version 0.93.3, clamav-milter version 0.93.3 on pegasus.clearchain.com X-Virus-Status: Clean X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0 (mail.clearchain.com [192.168.154.1]); Fri, 29 Aug 2008 00:26:04 +0930 (CST) Cc: kevinxlinuz , freebsd-current@freebsd.org Subject: Re: [BUG] I think sleepqueue need to be protected in sleepq_broadcast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Aug 2008 14:56:10 -0000 Attilio Rao wrote: > 2008/8/23, John Baldwin : > >> On Friday 22 August 2008 01:33:28 pm kevinxlinuz wrote: >> >>> Hi, >>> I'm looking in the problem ( amd64/124200: kernel panic on mutex sleepq >>> chain).It troubles me for a long time.I add a KASSERT in sleepq_broadcast() >>> to check the sleepqueue's wait channel.At last it turn out that the >>> sleepqueue's wait channel was changed before sleepq_resume_thread(). In >>> sleepq_lookup(),We can easily find sq->sq_wchan == wchan.But after a short >>> time,the sq->sq_wchan nolonger equal with wchan,so I think it was changed >>> by other threads. >>> >> The sleepq chain lock is already held for all of sleepq_broadcast() by the >> caller (see wakeup() and cv_broadcastpri()). That said, I don't have any >> other good ideas for the panic you are seeing. Do you have a crash dump? It >> might be interesting to see what other thread is using that sleep queue. >> > > Ben Close and me investigated this bug extensively and still didn't > find the source. > Factors we have now: > 1) The lock, when accessing with DDB, is exactly locked by another > thread even if it should be held by the curthread. It is like the > mutex cookie gets overwritten by the other thread like if it was free. > An extra drop (and subsequent acquire) is not very likely because of > (2). > 2) KTR traces doesn't show anything wrong. Accesses to sleepqueue > chain lock are paired (both on via mtx_* interface and thread_lock > respectively). This is very strange because it excludes a wrong locks > semantic. > 3) The problem is reproducible even on 4BSD, without PREEMPTION and > even with smp sysctl disabled (it just brings more time). > 4) The bug seems triggered by sx + waitchannel when used in the > sx_sleep() and such. > > I'm thinking this can be some nasty, but sorta of deterministic, race > between sleepqueue accesses between the sx sleepqueue and the > waitchannel sleepqueue. > I have still to think better about it, but actually I'm pretty busy > and if you have good ideas please let me know. > The other common factor, though not 100% verified is everyone experiencing the race is running amd64. Cheers, Benjamin