From owner-freebsd-stable@FreeBSD.ORG  Tue Sep  2 07:55:33 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 375BA1065677;
	Tue,  2 Sep 2008 07:55:33 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 051958FC1D;
	Tue,  2 Sep 2008 07:55:32 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 00E4146C1C;
	Tue,  2 Sep 2008 03:55:31 -0400 (EDT)
Date: Tue, 2 Sep 2008 08:55:31 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Jeremy Chadwick <koitsu@FreeBSD.org>
In-Reply-To: <20080902064740.GA17890@icarus.home.lan>
Message-ID: <alpine.BSF.1.10.0809020841340.1150@fledge.watson.org>
References: <20080901235144.4B53B4501A@ptavv.es.net>
	<E1KaPXs-0008LG-O7@cs1.cs.huji.ac.il>
	<20080902064740.GA17890@icarus.home.lan>
User-Agent: Alpine 1.10 (BSF 962 2008-03-14)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: =?ISO-8859-15?Q?Derek_Kuli=C5=3Fski?= <takeda@takeda.tk>,
	Michael <freebsdports@bindone.de>, freebsd-stable@freebsd.org
Subject: Re: bin/121684: : dump(8) frequently hangs
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Sep 2008 07:55:33 -0000


On Mon, 1 Sep 2008, Jeremy Chadwick wrote:

> On Tue, Sep 02, 2008 at 09:39:20AM +0300, Danny Braniss wrote:
>> take a look at:
>> 	http://www.freebsd.org/cgi/query-pr.cgi?pr=117603 danny
>
> That PR may or may not be relevant, depending upon what FreeBSD version 
> users are using, and what kernel build date.
>
> The bug mentioned in that PR got addressed in HEAD on March 13th, 2008, and 
> the fix MFC'd to RELENG_7 on April 19th, 2008.  It was never MFC'd to 
> RELENG_6.
>
> If there are users on RELENG_7 with kernels built with sources after April 
> 19th 2008 who are experiencing the problem, then the PR is probably not 
> relevant.

Part of the "problem" here is that a whole class of possible bugs have 
near-identical symptoms.  While each of the following means quite different 
things to a kernel developer working on the problem, and may reflect quite 
different types of bugs, they are all often described as "dump hangs" or "dump 
wedges":

- the system deadlocks
- dump fails to complete and/or exit, but cannot be killed
- dump fails to complete and/or exit, but can be killed

When snapshots were first introduced, the problems tended to be at the top end 
of that list, corresponding to VFS locking and resource starvation deadlocks. 
As the snapshot code has matured, new problems in both the kernel scheduler 
and dump code have arisen as parallelism has increased, reflecting a 
combination of old bugs in the dump code and new bugs in the kernel scheduler. 
Unfortunately, these bugs don't tend to get discovered much during testing in 
-CURRENT -- perhaps people don't back up their -CURRENT boxes much :-).

I think we need to rigorously do the following:

- For each bug report, determine whether it is reporting one or more of the
   above types of "hangs".  If multiple types are reported, track them with
   different bug reports.

- Establish as early as possible whether a fix resolves the problems in each
   report.  Because we're dealing with many bugs over time, it's possible to
   end up with accidentally "omnibus" reports that remain open and are never
   closed, even though committed and released fixes may correct the problems
   experienced by the reporter.  It is almost impossible, btw, to rewind and
   years later determine if any particular fix would have corrected any
   particular report, because the original submitter will have moved on.

Dump happens to be particularly sensitive to bugs of these sorts because it 
uses snapshots and it uses multiple workers that signal each other, so it's a 
good lithmus test of stability of both of those features.  However, it's easy 
to conclude that dump is much less stable than it proves in practice because 
we have a lot of stale and confusing bug reports.  What we do need is a dump 
bug report owner, who can keep track of the outstanding set, try to 
agressively close the ones that are fixed, which will among other things allow 
us to better track regressions vs bugs from inception.

Robert N M Watson
Computer Laboratory
University of Cambridge