From owner-freebsd-stable@FreeBSD.ORG  Wed Jun 18 16:03:17 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8D9931065675;
	Wed, 18 Jun 2008 16:03:17 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
	[IPv6:2001:470:1f10:75::2])
	by mx1.freebsd.org (Postfix) with ESMTP id 154368FC12;
	Wed, 18 Jun 2008 16:03:16 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from localhost.corp.yahoo.com (john@localhost [IPv6:::1])
	(authenticated bits=0)
	by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m5IG38Hc058177;
	Wed, 18 Jun 2008 12:03:09 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: Aristedes Maniatis <ari@ish.com.au>
Date: Wed, 18 Jun 2008 11:16:31 -0400
User-Agent: KMail/1.9.7
References: <77E81AD6-FBCC-4D30-A5CB-A9B918D4793F@ish.com.au>
	<200804221334.35001.jhb@freebsd.org>
	<0DB3A235-DF87-4413-90ED-E38BC44CA2B3@ish.com.au>
In-Reply-To: <0DB3A235-DF87-4413-90ED-E38BC44CA2B3@ish.com.au>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200806181116.32450.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [IPv6:::1]);
	Wed, 18 Jun 2008 12:03:10 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.91.2/7499/Wed Jun 18 09:02:05 2008 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,NO_RELAYS 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: bzeeb+freebsd+lor@zabbadoz.net, jeff@freebsd.org,
	Jurgen Weber <jurgen@ish.com.au>, freebsd-stable@freebsd.org,
	davidxu@freebsd.org
Subject: Re: LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Jun 2008 16:03:17 -0000

On Friday 09 May 2008 10:53:15 pm Aristedes Maniatis wrote:
> 
> On 23/04/2008, at 3:34 AM, John Baldwin wrote:
> 
> >>>  The
> >>> real problem at the bottom of the screen though is a real issue.
> >>> It's a LOR
> >>> of two different sleepqueue chain locks.  The problem is that when
> >>> setrunnable() encounters a swapped out thread it tries to wakeup
> >>> proc0, but
> >>> if proc0 is asleep (which is typical) then its thread lock is a
> >>> sleep queue
> >>> chain lock, so waking up a swapped out thread from wakeup() will
> >>> usually
> >>> trigger this LOR.
> >>>
> >>> I think the best fix is to not have setrunnable() kick proc0  
> >>> directly.
> >>> Perhaps setrunnable() should return an int and return true if proc0
> >>> needs to
> >>> be awakened and false otherwise.  Then the the sleepq code (b/c only
> >>> sleeping
> >>> threads can be swapped out anyway) can return that value from
> >>> sleepq_resume_thread() and can call kick_proc0() directly once it
> >>> has dropped
> >>> all of its own locks.
> >>>
> >>> -- 
> >>> John Baldwin
> >>
> >> The way you describe it, it almost sounds like this LOR should be
> >> happening for everyone, all the time. To try and eliminate the  
> >> factors
> >> which trigger it for us, we tried the following: removed PAE from
> >> kernel, disabled PF. Neither of these things made any difference and
> >> the error is fairly quickly reproducible (within a couple of hours
> >> running various things to load the machine). The one thing we did not
> >> test yet is removing ZFS from the picture. Note also that this box  
> >> ran
> >> for years and years on FreeBSD 4.x without a hiccup (non PAE, ipfw
> >> instead of pf and no ZFS of course).
> >
> > There are two things.  1) Most people who run witness (that I know  
> > of) don't
> > run it on spinlocks because of the overhead, so LORs of spin locks  
> > are less
> > well-reported than LORs of other locks (mutexes, rwlocks, etc.).  2)  
> > You have
> > to have enough load on the box to swap out active processes to get  
> > into this
> > situation.  Between those I think that is why this is not more widely
> > reported.
> 
> 
> Hi John,
> 
> Thanks for your efforts so far to track this LOR down. I've been  
> keeping an eye on cvs logs, but haven't seen anything which looks like  
> a patch for this.
> 
> * is this still outstanding?
> * or will it be addressed soon?
> * if not, should I create a PR so that it doesn't get forgotten?
> * in our case, although we can trigger it quickly with some load, the  
> problem occurs (and causes a complete machine lock) even under < 10%  
> load. Not sure if the combination of PAE/ZFS/SCHED ULE exacerbates  
> that in any way compared to a 'standard' build.

Try http://www.FreeBSD.org/~jhb/patches/sleepq.patch

-- 
John Baldwin