From owner-freebsd-hackers@FreeBSD.ORG  Mon Jun 23 19:16:47 2008
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8DC73106567F;
	Mon, 23 Jun 2008 19:16:47 +0000 (UTC)
	(envelope-from jamie@gritton.org)
Received: from gritton.org (gritton.org [161.58.222.4])
	by mx1.freebsd.org (Postfix) with ESMTP id 433F38FC18;
	Mon, 23 Jun 2008 19:16:47 +0000 (UTC)
	(envelope-from jamie@gritton.org)
Received: from guppy.corp.verio.net (fw.oremut02.us.wh.verio.net
	[198.65.168.24]) (authenticated bits=0)
	by gritton.org (8.13.6.20060614/8.13.6) with ESMTP id m5NJGj0t056021;
	Mon, 23 Jun 2008 13:16:46 -0600 (MDT)
Message-ID: <485FF698.103@gritton.org>
Date: Mon, 23 Jun 2008 13:16:40 -0600
From: James Gritton <jamie@gritton.org>
User-Agent: Thunderbird 2.0.0.9 (X11/20080228)
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
References: <20080615112318.146C1F18512@mx.npubs.com>
	<200806180917.05689.jhb@freebsd.org> <485A81FF.1000000@gritton.org>
	<200806231451.52340.jhb@freebsd.org>
In-Reply-To: <200806231451.52340.jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV version 0.93, clamav-milter version 0.93 on gritton.org
X-Virus-Status: Clean
Cc: freebsd-hackers@freebsd.org, freebsd-stable@freebsd.org
Subject: Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Jun 2008 19:16:47 -0000

John Baldwin wrote:
> On Thursday 19 June 2008 11:57:51 am James Gritton wrote:
>   
>> John Baldwin wrote:
>>     
>>> On Sunday 15 June 2008 07:23:19 am Stef Walter wrote:
>>>   
>>>       
>>>> I've been trying to track down a deadlock on some newish production
>>>> servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a
>>>> specific (although mundane) hardware configuration, and each of several
>>>> servers running this hardware deadlock about once per week.
>>>>
>>>> Although I suspect that this is not hardware related, from a (naive)
>>>> perusal of the attached stack traces.
>>>>
>>>> Forgive me if my interpretation of this is all wrong, but I'm pretty
>>>> desperate for help. So here's my basic understanding of the deadlock:
>>>>
>>>> These processes seem to be waiting on the page queue mutex:
>>>>  sendmail (in vm_mmap > vm_map_find > vm_map_insert > vm_map_pmap_enter)
>>>>  bsnmpd (in malloc, uma_large_malloc > page_alloc > kmem_malloc)
>>>>  httpd (in trap > trap_pfault > vm_fault)
>>>>  [g_up] (in g_vfs_done > bufdone)
>>>>
>>>> The page queue mutex is held by rsync process:
>>>>  rsync (in trap > trap_pfault > vm_fault > pmap_enter)
>>>>
>>>> Rsync kernel process (in pmap_enter) was interrupted while holding the
>>>> page queue lock?
>>>>
>>>>
>>>> Giant is enabled in loader.conf due to the needs of the pf firewall when
>>>> dealing with user credentials lookups. I do not believe that Giant plays
>>>> into this deadlock. Kernel config attached.
>>>>
>>>> Any and all help or info is welcome. Thanks in advance.
>>>>     
>>>>         
>>> Try this change:
>>>
>>> jhb         2007-10-27 22:07:40 UTC
>>>
>>>   FreeBSD src repository
>>>
>>>   Modified files:
>>>     sys/kern             sched_4bsd.c
>>>   Log:
>>>   Change the roundrobin implementation in the 4BSD scheduler to trigger a
>>>   userland preemption directly from hardclock() via sched_clock() when a
>>>   thread uses up a full quantum instead of using a periodic timeout to 
>>>       
> cause
>   
>>>   a userland preemption every so often.  This fixes a potential deadlock
>>>   when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held
>>>   by a thread pinned or bound to another CPU.  The current thread on that
>>>   CPU will never be preempted while softclock is blocked.
>>>
>>>   Note that ULE already drives its round-robin userland preemption from
>>>   sched_clock() as well and always enables IPI_PREEMPT.
>>>
>>>   MFC after:      1 week
>>>
>>>   Revision  Changes    Path
>>>   1.108     +8 -29     src/sys/kern/sched_4bsd.c
>>>
>>> We use it at work on 6.x.  W/o this fix, round-robin stops working on 4BSD 
>>> when softclock() (swi4: clock) blocks on a lock like Giant.
>>>   
>>>       
>> I've been seeing similar troubles on 6.2 and I'll have to give this a 
>> try as we upgrade to 6.3.  I notice "MFC after: 1 week" in the log; it's 
>> been a week - any chance of seeing this fix rolled into 6.x?
>>     
>
> If people confirm it fixes issues I will MFC it.  There was some pushback when 
> I first committed it so I waited on the MFC.

I can confirm that on 6.3 I can recreate the deadlock without the patch, 
and can't recreate it with the patch.