From owner-freebsd-stable@FreeBSD.ORG Mon Jun 30 23:16:14 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D74F31065673; Mon, 30 Jun 2008 23:16:14 +0000 (UTC) (envelope-from stef@memberwebs.com) Received: from mx.npubs.com (mail.zoneseven.net [209.66.100.224]) by mx1.freebsd.org (Postfix) with ESMTP id AFA338FC1B; Mon, 30 Jun 2008 23:16:14 +0000 (UTC) (envelope-from stef@memberwebs.com) Received: from mx.npubs.com (avhost [209.66.100.194]) by mx.npubs.com (Postfix) with ESMTP id C64D7F181CD; Mon, 30 Jun 2008 23:16:13 +0000 (UTC) Received: from northstar-srv2 (unknown [172.27.2.11]) by mx.npubs.com (Postfix) with ESMTP id F0245F1817C; Mon, 30 Jun 2008 23:16:11 +0000 (UTC) From: Stef User-Agent: Thunderbird 2.0.0.14 (X11/20080505) MIME-Version: 1.0 To: John Baldwin References: <20080615112318.146C1F18512@mx.npubs.com> <200806180917.05689.jhb@freebsd.org> X-Enigmail-Version: 0.95.0 Content-Type: multipart/mixed; boundary="------------040406060900030904060601" Message-Id: <20080630231611.F0245F1817C@mx.npubs.com> X-Virus-Scanned: ClamAV using ClamSMTP Date: Mon, 30 Jun 2008 23:16:13 +0000 (UTC) Cc: freebsd-hackers@freebsd.org, freebsd-stable@freebsd.org Subject: Re: FreeBSD 6.3 deadlock (vm_map?) with DDB output X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Jun 2008 23:16:15 -0000 This is a multi-part message in MIME format. --------------040406060900030904060601 Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 7bit John Baldwin wrote: > On Sunday 15 June 2008 07:23:19 am Stef Walter wrote: >> I've been trying to track down a deadlock on some newish production >> servers running FreeBSD 6.3-RELEASE-p2. The deadlock occurs on a >> specific (although mundane) hardware configuration, and each of several >> servers running this hardware deadlock about once per week. > > Try this change: > > We use it at work on 6.x. W/o this fix, round-robin stops working on 4BSD > when softclock() (swi4: clock) blocks on a lock like Giant. Just wanted to confirm: That patch did the trick. All the SMP machines that had this problem have been stable for 11 days now, longer than any of them were up previously. I changed the patch slightly to work with FreeBSD 6.3-RELEASE. That's attached, in case anyone needs this later. Cheers, Stef --------------040406060900030904060601 Content-Type: text/x-patch; name="kern_sched4bsd_deadlock.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="kern_sched4bsd_deadlock.patch" --- sys/kern/sched_4bsd.c.orig 2006-06-16 22:11:55.000000000 +0000 +++ sys/kern/sched_4bsd.c 2008-06-18 17:04:34.000000000 +0000 @@ -157,13 +157,10 @@ static int sched_quantum; /* Roundrobin scheduling quantum in ticks. */ #define SCHED_QUANTUM (hz / 10) /* Default sched quantum */ -static struct callout roundrobin_callout; - static void slot_fill(struct ksegrp *kg); static struct kse *sched_choose(void); /* XXX Should be thread * */ static void setup_runqs(void); -static void roundrobin(void *arg); static void schedcpu(void); static void schedcpu_thread(void); static void sched_priority(struct thread *td, u_char prio); @@ -316,27 +313,6 @@ } /* - * Force switch among equal priority processes every 100ms. - * We don't actually need to force a context switch of the current process. - * The act of firing the event triggers a context switch to softclock() and - * then switching back out again which is equivalent to a preemption, thus - * no further work is needed on the local CPU. - */ -/* ARGSUSED */ -static void -roundrobin(void *arg) -{ - -#ifdef SMP - mtx_lock_spin(&sched_lock); - forward_roundrobin(); - mtx_unlock_spin(&sched_lock); -#endif - - callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL); -} - -/* * Constants for digital decay and forget: * 90% of (kg_estcpu) usage in 5 * loadav time * 95% of (ke_pctcpu) usage in 60 seconds (load insensitive) @@ -618,11 +594,6 @@ sched_quantum = SCHED_QUANTUM; hogticks = 2 * sched_quantum; - callout_init(&roundrobin_callout, CALLOUT_MPSAFE); - - /* Kick off timeout driven events by calling first time. */ - roundrobin(NULL); - /* Account for thread0. */ sched_load_add(); } @@ -697,6 +668,14 @@ resetpriority(kg); resetpriority_thread(td, kg); } + + /* + * Force a context switch if the current thread has used up a full + * quantum (default quantum is 100ms). + */ + if (!((td)->td_flags & TDF_IDLETD) && + ticks - PCPU_GET(switchticks) >= sched_quantum) + td->td_flags |= TDF_NEEDRESCHED; } /* --------------040406060900030904060601--