From owner-freebsd-threads@FreeBSD.ORG Fri Sep 17 11:47:38 2004 Return-Path: Delivered-To: freebsd-threads@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E2B8716A4CE; Fri, 17 Sep 2004 11:47:38 +0000 (GMT) Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id 640D143D46; Fri, 17 Sep 2004 11:47:38 +0000 (GMT) (envelope-from gallatin@cs.duke.edu) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.12.10/8.12.10) with ESMTP id i8HBlZJt029879 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 17 Sep 2004 07:47:35 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.12.9p2/8.12.9/Submit) id i8HBlTTE073202; Fri, 17 Sep 2004 07:47:29 -0400 (EDT) (envelope-from gallatin) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16714.52945.827195.748164@grasshopper.cs.duke.edu> Date: Fri, 17 Sep 2004 07:47:29 -0400 (EDT) To: Julian Elischer In-Reply-To: <414A6ACD.2020600@elischer.org> References: <16703.11479.679335.588170@grasshopper.cs.duke.edu> <414942B3.1060703@elischer.org> <16713.38977.864343.415015@grasshopper.cs.duke.edu> <200409161316.43010.jhb@FreeBSD.org> <414A6ACD.2020600@elischer.org> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid cc: John Baldwin cc: freebsd-threads@freebsd.org Subject: Re: Unkillable KSE threaded proc X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Sep 2004 11:47:39 -0000 Julian Elischer writes: > John Baldwin wrote: > > On Thursday 16 September 2004 09:42 am, Andrew Gallatin wrote: > > > >>Julian Elischer writes: > >> > Andrew, please try -current on ts own now.. > >> > I have checked in some fixes that have helped others. > >> > >>OK, preemption off... Still a system lockup, but a little different. > >> > >>The interesting thing here is that continuing and breaking into the > >>debugger repeatedly seems to show that thread 0xc1646af0 is looping in > >>exit. I've seen him in thread_single, thread_suspend_check, and in > >>exit itself at kern_exit.c:163, etc. A breakpoint in > >>thread_suspend_one never triggers, so I guess he's holding the proc > >>lock and just looping forever. A breakpoint in _mtx_assert() shows > >>him asserting the proc lock in thread_suspend_check at kern_thread.c:898. > >>Over and over. > > > > > > There is definitely some sort of infinite loop here. Stripping out the > > comments in exit1() for that section of code reveals basically: > > > > PROC_LOCK(p); > > if (p->p_flag & P_HADTHREADS) { > > retry: > > thread_suspend_check(0); > > if (thread_single(SINGLE_EXIT)) > > goto retry; > > } > > p->p_flag |= P_WEXIT; > > PROC_UNLOCK(p); > > > > So it's easy to see how it can stuck in a loop I think. If thread_single() > > never drops the lock then other threads that are waiting to die can't > > actually wait because they can never get the proc lock so that they can die. > > > > > hmm intersting.. > but this code hasn't changed in ages... > > > in thread_single we see: > > thread_suspend_one(td); > PROC_UNLOCK(p); > mi_switch(SW_VOL, NULL); > mtx_unlock_spin(&sched_lock); > PROC_LOCK(p); > mtx_lock_spin(&sched_lock); > > so when it sleeps it releases the proc lock. But that's the problem. As I said above, break in thread_suspend_one never triggers, so this code is never called. It must be bailing out of thread_suspend_one() before this happens. Did somebody fix ddb? If yes, I can try stepping through it if you like. Maybe a quick fix would be to drop the proc lock and tsleep for a clock tick at the bottom of the infinate loop... Drew