From owner-freebsd-hackers@FreeBSD.ORG Mon Jun 23 19:16:47 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8DC73106567F; Mon, 23 Jun 2008 19:16:47 +0000 (UTC) (envelope-from jamie@gritton.org) Received: from gritton.org (gritton.org [161.58.222.4]) by mx1.freebsd.org (Postfix) with ESMTP id 433F38FC18; Mon, 23 Jun 2008 19:16:47 +0000 (UTC) (envelope-from jamie@gritton.org) Received: from guppy.corp.verio.net (fw.oremut02.us.wh.verio.net [198.65.168.24]) (authenticated bits=0) by gritton.org (8.13.6.20060614/8.13.6) with ESMTP id m5NJGj0t056021; Mon, 23 Jun 2008 13:16:46 -0600 (MDT) Message-ID: <485FF698.103@gritton.org> Date: Mon, 23 Jun 2008 13:16:40 -0600 From: James Gritton User-Agent: Thunderbird 2.0.0.9 (X11/20080228) MIME-Version: 1.0 To: John Baldwin References: <20080615112318.146C1F18512@mx.npubs.com> <200806180917.05689.jhb@freebsd.org> <485A81FF.1000000@gritton.org> <200806231451.52340.jhb@freebsd.org> In-Reply-To: <200806231451.52340.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV version 0.93, clamav-milter version 0.93 on gritton.org X-Virus-Status: Clean Cc: freebsd-hackers@freebsd.org, freebsd-stable@freebsd.org Subject: Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Jun 2008 19:16:47 -0000 John Baldwin wrote: > On Thursday 19 June 2008 11:57:51 am James Gritton wrote: > >> John Baldwin wrote: >> >>> On Sunday 15 June 2008 07:23:19 am Stef Walter wrote: >>> >>> >>>> I've been trying to track down a deadlock on some newish production >>>> servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a >>>> specific (although mundane) hardware configuration, and each of several >>>> servers running this hardware deadlock about once per week. >>>> >>>> Although I suspect that this is not hardware related, from a (naive) >>>> perusal of the attached stack traces. >>>> >>>> Forgive me if my interpretation of this is all wrong, but I'm pretty >>>> desperate for help. So here's my basic understanding of the deadlock: >>>> >>>> These processes seem to be waiting on the page queue mutex: >>>> sendmail (in vm_mmap > vm_map_find > vm_map_insert > vm_map_pmap_enter) >>>> bsnmpd (in malloc, uma_large_malloc > page_alloc > kmem_malloc) >>>> httpd (in trap > trap_pfault > vm_fault) >>>> [g_up] (in g_vfs_done > bufdone) >>>> >>>> The page queue mutex is held by rsync process: >>>> rsync (in trap > trap_pfault > vm_fault > pmap_enter) >>>> >>>> Rsync kernel process (in pmap_enter) was interrupted while holding the >>>> page queue lock? >>>> >>>> >>>> Giant is enabled in loader.conf due to the needs of the pf firewall when >>>> dealing with user credentials lookups. I do not believe that Giant plays >>>> into this deadlock. Kernel config attached. >>>> >>>> Any and all help or info is welcome. Thanks in advance. >>>> >>>> >>> Try this change: >>> >>> jhb 2007-10-27 22:07:40 UTC >>> >>> FreeBSD src repository >>> >>> Modified files: >>> sys/kern sched_4bsd.c >>> Log: >>> Change the roundrobin implementation in the 4BSD scheduler to trigger a >>> userland preemption directly from hardclock() via sched_clock() when a >>> thread uses up a full quantum instead of using a periodic timeout to >>> > cause > >>> a userland preemption every so often. This fixes a potential deadlock >>> when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held >>> by a thread pinned or bound to another CPU. The current thread on that >>> CPU will never be preempted while softclock is blocked. >>> >>> Note that ULE already drives its round-robin userland preemption from >>> sched_clock() as well and always enables IPI_PREEMPT. >>> >>> MFC after: 1 week >>> >>> Revision Changes Path >>> 1.108 +8 -29 src/sys/kern/sched_4bsd.c >>> >>> We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD >>> when softclock() (swi4: clock) blocks on a lock like Giant. >>> >>> >> I've been seeing similar troubles on 6.2 and I'll have to give this a >> try as we upgrade to 6.3. I notice "MFC after: 1 week" in the log; it's >> been a week - any chance of seeing this fix rolled into 6.x? >> > > If people confirm it fixes issues I will MFC it. There was some pushback when > I first committed it so I waited on the MFC. I can confirm that on 6.3 I can recreate the deadlock without the patch, and can't recreate it with the patch.