From owner-freebsd-amd64@FreeBSD.ORG Sun Dec 23 23:11:01 2007 Return-Path: Delivered-To: amd64@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 45C0816A418; Sun, 23 Dec 2007 23:11:01 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx08.syd.optusnet.com.au (fallbackmx08.syd.optusnet.com.au [211.29.132.10]) by mx1.freebsd.org (Postfix) with ESMTP id EFC4513C467; Sun, 23 Dec 2007 23:11:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by fallbackmx08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lBNKQ4eg026690; Mon, 24 Dec 2007 07:26:04 +1100 Received: from c211-30-219-213.carlnfd3.nsw.optusnet.com.au (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lBNKQ1LJ019168 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 24 Dec 2007 07:26:02 +1100 Date: Mon, 24 Dec 2007 07:26:01 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Robert Watson In-Reply-To: <20071223125714.K79882@fledge.watson.org> Message-ID: <20071224065516.K4239@delplex.bde.org> References: <20071223125236.GM1616@droso.net> <20071223125714.K79882@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: amd64@freebsd.org Subject: Re: Can't panic from debugger X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Dec 2007 23:11:01 -0000 On Sun, 23 Dec 2007, Robert Watson wrote: > On Sun, 23 Dec 2007, Erwin Lansing wrote: > >> The amd64 nodes in the pointyhat cluster are starting to behave quite >> interestingly. They stop to respond to ssh, but are still answering ping. >> More worrying is that I cannot get a useful dump out of it, as a panic from >> the debugger just hangs there, and all I am left with is to pull the plug. >> This even happens on a normal working system after entering the debugger, >> of which there is a typescript below. >> ... > I discovered yesterday that I was seeing the same problem on a dual-cpu, > dual-core box in the netperf cluster: This is as expected. Debugger context is special, and no non-debugger functions can be called from it without going through the (unimplemented) trampoline needed to temporarily leave it. Some non-debugger functions may work accidentally or appear to work when called directly. panic() is not one of these, since it tends to trip over a lock. panic() called from an arbitrary context has the same problem, since the calling context may hold a lock that is used by panic(). Debugger context always holds the pseudo-spinlock of masked CPU interrupts and stopped other CPUs. panic() (actually boot()) normally begins with a normal sync() call that is not aware that it may be called in either panic or debugger context and depends on a large amount of system code working normally. It cannot legitimately sync anything when called in debugger context, since syncing requires i/o and i/o normally requires interrupts. > I *can* get a coredump if I directly "call doadump" and then "reset", but I > can't get one if I just do "panic". Dumps have some chance of working since they are required to try harder than sync() to work in any context. In particular, they are or were aware that they are not permitted to use interrupts. Reset has a better chance of working since it is simpler and reset is a legitimate debugger command. I was asleep when the panic debugger command was added. Interrupt-driven I/O in sync for panics back then tended to work bogusly by blowing away both spl*() masks and the hard CPU interrupt disable mask. Bruce