From owner-freebsd-hackers@FreeBSD.ORG  Tue Jun 24 21:53:51 2008
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EDCCD1065679;
	Tue, 24 Jun 2008 21:53:50 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from server.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
	[IPv6:2001:470:1f10:75::2])
	by mx1.freebsd.org (Postfix) with ESMTP id 9D9458FC0C;
	Tue, 24 Jun 2008 21:53:50 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from localhost.corp.yahoo.com (john@localhost [IPv6:::1])
	(authenticated bits=0)
	by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m5OLrhMF049681;
	Tue, 24 Jun 2008 17:53:44 -0400 (EDT) (envelope-from jhb@freebsd.org)
From: John Baldwin <jhb@freebsd.org>
To: James Gritton <jamie@gritton.org>
Date: Tue, 24 Jun 2008 15:55:48 -0400
User-Agent: KMail/1.9.7
References: <20080615112318.146C1F18512@mx.npubs.com>
	<200806231451.52340.jhb@freebsd.org> <485FF698.103@gritton.org>
In-Reply-To: <485FF698.103@gritton.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200806241555.48280.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by
	milter-greylist-2.0.2 (server.baldwin.cx [IPv6:::1]);
	Tue, 24 Jun 2008 17:53:44 -0400 (EDT)
X-Virus-Scanned: ClamAV 0.93.1/7542/Mon Jun 23 12:42:14 2008 on
	server.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,NO_RELAYS 
	autolearn=ham version=3.1.3
X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx
Cc: freebsd-hackers@freebsd.org, freebsd-stable@freebsd.org
Subject: Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Jun 2008 21:53:51 -0000

On Monday 23 June 2008 03:16:40 pm James Gritton wrote:
> John Baldwin wrote:
> > On Thursday 19 June 2008 11:57:51 am James Gritton wrote:
> >   
> >> John Baldwin wrote:
> >>     
> >>> On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:
> >>>   
> >>>       
> >>>> I've been trying to track down a deadlock on some newish production
> >>>> servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
> >>>> specific (although mundane) hardware configuration, and each of several
> >>>> servers running this hardware deadlock about once per week.
> >>>>
> >>>> Although I suspect that this is not hardware related, from a (naive)
> >>>> perusal of the attached stack traces.
> >>>>
> >>>> Forgive me if my interpretation of this is all wrong, but I'm pretty
> >>>> desperate for help. So here's my basic understanding of the deadlock:
> >>>>
> >>>> These processes seem to be waiting on the page queue mutex:
> >>>>  sendmail (in vm_mmap > vm_map_find > vm_map_insert > 
vm_map_pmap_enter)
> >>>>  bsnmpd (in malloc, uma_large_malloc > page_alloc > kmem_malloc)
> >>>>  httpd (in trap > trap_pfault > vm_fault)
> >>>>  [g_up] (in g_vfs_done > bufdone)
> >>>>
> >>>> The page queue mutex is held by rsync process:
> >>>>  rsync (in trap > trap_pfault > vm_fault > pmap_enter)
> >>>>
> >>>> Rsync kernel process (in pmap_enter) was interrupted while holding the
> >>>> page queue lock?
> >>>>
> >>>>
> >>>> Giant is enabled in loader.conf due to the needs of the pf firewall 
when
> >>>> dealing with user credentials lookups. I do not believe that Giant 
plays
> >>>> into this deadlock. Kernel config attached.
> >>>>
> >>>> Any and all help or info is welcome. Thanks in advance.
> >>>>     
> >>>>         
> >>> Try this change:
> >>>
> >>> jhb         2007-10-27 22:07:40 UTC
> >>>
> >>>   FreeBSD src repository
> >>>
> >>>   Modified files:
> >>>     sys/kern             sched_4bsd.c
> >>>   Log:
> >>>   Change the roundrobin implementation in the 4BSD scheduler to trigger 
a
> >>>   userland preemption directly from hardclock() via sched_clock() when a
> >>>   thread uses up a full quantum instead of using a periodic timeout to 
> >>>       
> > cause
> >   
> >>>   a userland preemption every so often.  This fixes a potential deadlock
> >>>   when IPI_PREEMPTION isn't enabled where softclock blocks on a lock 
held
> >>>   by a thread pinned or bound to another CPU.  The current thread on 
that
> >>>   CPU will never be preempted while softclock is blocked.
> >>>
> >>>   Note that ULE already drives its round-robin userland preemption from
> >>>   sched_clock() as well and always enables IPI_PREEMPT.
> >>>
> >>>   MFC after:      1 week
> >>>
> >>>   Revision  Changes    Path
> >>>   1.108     +8 -29     src/sys/kern/sched_4bsd.c
> >>>
> >>> We use it at work on 6.x.  W/o this fix, round-robin stops working on 
4BSD 
> >>> when softclock() (swi4: clock) blocks on a lock like Giant.
> >>>   
> >>>       
> >> I've been seeing similar troubles on 6.2 and I'll have to give this a 
> >> try as we upgrade to 6.3.  I notice "MFC after: 1 week" in the log; it's 
> >> been a week - any chance of seeing this fix rolled into 6.x?
> >>     
> >
> > If people confirm it fixes issues I will MFC it.  There was some pushback 
when 
> > I first committed it so I waited on the MFC.
> 
> I can confirm that on 6.3 I can recreate the deadlock without the patch, 
> and can't recreate it with the patch.

Ok, I've merged it to RELENG_[67].

-- 
John Baldwin