From owner-freebsd-stable@FreeBSD.ORG Wed Jun 18 16:03:17 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8D9931065675; Wed, 18 Jun 2008 16:03:17 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 154368FC12; Wed, 18 Jun 2008 16:03:16 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [IPv6:::1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m5IG38Hc058177; Wed, 18 Jun 2008 12:03:09 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Aristedes Maniatis Date: Wed, 18 Jun 2008 11:16:31 -0400 User-Agent: KMail/1.9.7 References: <77E81AD6-FBCC-4D30-A5CB-A9B918D4793F@ish.com.au> <200804221334.35001.jhb@freebsd.org> <0DB3A235-DF87-4413-90ED-E38BC44CA2B3@ish.com.au> In-Reply-To: <0DB3A235-DF87-4413-90ED-E38BC44CA2B3@ish.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200806181116.32450.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [IPv6:::1]); Wed, 18 Jun 2008 12:03:10 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/7499/Wed Jun 18 09:02:05 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,NO_RELAYS autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: bzeeb+freebsd+lor@zabbadoz.net, jeff@freebsd.org, Jurgen Weber , freebsd-stable@freebsd.org, davidxu@freebsd.org Subject: Re: LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2008 16:03:17 -0000 On Friday 09 May 2008 10:53:15 pm Aristedes Maniatis wrote: > > On 23/04/2008, at 3:34 AM, John Baldwin wrote: > > >>> The > >>> real problem at the bottom of the screen though is a real issue. > >>> It's a LOR > >>> of two different sleepqueue chain locks. The problem is that when > >>> setrunnable() encounters a swapped out thread it tries to wakeup > >>> proc0, but > >>> if proc0 is asleep (which is typical) then its thread lock is a > >>> sleep queue > >>> chain lock, so waking up a swapped out thread from wakeup() will > >>> usually > >>> trigger this LOR. > >>> > >>> I think the best fix is to not have setrunnable() kick proc0 > >>> directly. > >>> Perhaps setrunnable() should return an int and return true if proc0 > >>> needs to > >>> be awakened and false otherwise. Then the the sleepq code (b/c only > >>> sleeping > >>> threads can be swapped out anyway) can return that value from > >>> sleepq_resume_thread() and can call kick_proc0() directly once it > >>> has dropped > >>> all of its own locks. > >>> > >>> -- > >>> John Baldwin > >> > >> The way you describe it, it almost sounds like this LOR should be > >> happening for everyone, all the time. To try and eliminate the > >> factors > >> which trigger it for us, we tried the following: removed PAE from > >> kernel, disabled PF. Neither of these things made any difference and > >> the error is fairly quickly reproducible (within a couple of hours > >> running various things to load the machine). The one thing we did not > >> test yet is removing ZFS from the picture. Note also that this box > >> ran > >> for years and years on FreeBSD 4.x without a hiccup (non PAE, ipfw > >> instead of pf and no ZFS of course). > > > > There are two things. 1) Most people who run witness (that I know > > of) don't > > run it on spinlocks because of the overhead, so LORs of spin locks > > are less > > well-reported than LORs of other locks (mutexes, rwlocks, etc.). 2) > > You have > > to have enough load on the box to swap out active processes to get > > into this > > situation. Between those I think that is why this is not more widely > > reported. > > > Hi John, > > Thanks for your efforts so far to track this LOR down. I've been > keeping an eye on cvs logs, but haven't seen anything which looks like > a patch for this. > > * is this still outstanding? > * or will it be addressed soon? > * if not, should I create a PR so that it doesn't get forgotten? > * in our case, although we can trigger it quickly with some load, the > problem occurs (and causes a complete machine lock) even under < 10% > load. Not sure if the combination of PAE/ZFS/SCHED ULE exacerbates > that in any way compared to a 'standard' build. Try http://www.FreeBSD.org/~jhb/patches/sleepq.patch -- John Baldwin