From owner-freebsd-current@FreeBSD.ORG Fri Jan 25 20:10:48 2008 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 42A7F16A418; Fri, 25 Jan 2008 20:10:48 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 0F53613C457; Fri, 25 Jan 2008 20:10:47 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id A66594821B; Fri, 25 Jan 2008 15:10:47 -0500 (EST) Date: Fri, 25 Jan 2008 20:10:47 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Scott Long In-Reply-To: <479A305E.3020801@samsco.org> Message-ID: <20080125195603.I37258@fledge.watson.org> References: <20080125180740.GA1646@team.vega.ru> <479A305E.3020801@samsco.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: current@freebsd.org Subject: Re: minidumps are unsafe on amd64 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2008 20:10:48 -0000 On Fri, 25 Jan 2008, Scott Long wrote: > Is this a case where you are manually triggering a dump on a system that is > otherwise running fine? I thought that crashes already disabled interrupts > and made an attempt to stop other CPUs. That's why there is dump-specific > code in every storage driver in the first place; it implements polled i/o so > that crashdump i/o can take place with interrupts disabled. If it's a case > where interrupts aren't actually getting disabled, then that's one thing. > If it's a case where you're trying to fix something that isn't broken, then > I'm very cautious about the added complexity that you're proposing. Unfortunately, we don't really do this today -- we do stop the other CPUs when we enter the debugger, but we restart them when we leave, and the dump code runs outside of the debugger context. I ran into this problem when working on textdumps, as common storage drivers attempt to acquire locks in their dump path. Instead of writing out DDB output incrementally block-at-a-time, I have to buffer it all and then generate it at the normal dump point after leaving the debugger. In terms of generally improving robustness of the debugging environment, I've been pondering the following: - Dump routines run from the KDB context, so that they get the protections associated with running in the debugger. In particular, they need a more reliable assumption that the rest of the kernel is halted. I'm a bit surprised we haven't been bitten by this more in the past... - A more SMP-safe passage into the debugger, especially from panic(). We should disable interrupts immediately on panic() to prevent preemption on the panicking CPU by an interrupt. We should write any state to pass into the debugger into a per-CPU buffer to be picked up after kdb_trap() has popped us into the debugger. The panic message should be printed by KDB, and not using printf(), which is prone to preemption especially on serial consoles. - Dump routines pass through a bounds checking block write call. Right now they directly invoke di->dumper(), and the caller is responsible for not asking for blocks outside the swap partition. A wrapper on the order of dump_blockwrite() should do the bounds checking to add robustness (obviously, callers should also place their blocks correctly). I'm almost certainly not the right person to look at making dumper routines work in KDB, but I can look at improving the reliability of getting into KDB, as well as passing data into it more reliably. I'm happy to let someone else pick this up and run with it, though, as it will be a ways down on my TODO list for a bit. Robert N M Watson Computer Laboratory University of Cambridge