From owner-freebsd-current@FreeBSD.ORG  Tue Dec 27 22:29:29 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1DD751065670
	for <freebsd-current@freebsd.org>; Tue, 27 Dec 2011 22:29:29 +0000 (UTC)
	(envelope-from delphij@gmail.com)
Received: from mail-tul01m020-f182.google.com (mail-tul01m020-f182.google.com
	[209.85.214.182])
	by mx1.freebsd.org (Postfix) with ESMTP id DA6818FC1A
	for <freebsd-current@freebsd.org>; Tue, 27 Dec 2011 22:29:28 +0000 (UTC)
Received: by obbwd18 with SMTP id wd18so11739104obb.13
	for <freebsd-current@freebsd.org>; Tue, 27 Dec 2011 14:29:28 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=oU3xYWNvNf3PzC1Q6oJ6+gI1N8p86jzEE/73vFPJuZI=;
	b=gHy0QqMEZY++L5d2RyxWOZFeA8ImRV/DFw+nL2Kt9Gk163pEnT5mSIEVlVghs3LtXN
	IzMe5xX03SyEeMoXJSk5qgJal4/JKms6lpTSAfk3TZfdCkCcxppveMaQbGrCXvezpJkQ
	Mb4AHw21f8uOcXXBF8fK+Ii/qpSYrhFIrZLj0=
MIME-Version: 1.0
Received: by 10.182.45.102 with SMTP id l6mr26769391obm.0.1325024967240; Tue,
	27 Dec 2011 14:29:27 -0800 (PST)
Received: by 10.182.67.163 with HTTP; Tue, 27 Dec 2011 14:29:27 -0800 (PST)
In-Reply-To: <20111227215330.GI45484@redundancy.redundancy.org>
References: <20111227215330.GI45484@redundancy.redundancy.org>
Date: Tue, 27 Dec 2011 14:29:27 -0800
Message-ID: <CAGMYy3t3Rv006qvBCHr4kdbM86andkr5mRkvaGYw5CETO1XHkg@mail.gmail.com>
From: Xin LI <delphij@gmail.com>
To: David Thiel <lx@redundancy.redundancy.org>
Content-Type: text/plain; charset=UTF-8
Cc: freebsd-current@freebsd.org
Subject: Re: SU+J systems do not fsck themselves
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Dec 2011 22:29:29 -0000

On Tue, Dec 27, 2011 at 1:53 PM, David Thiel
<lx@redundancy.redundancy.org> wrote:
> I've had multiple machines now (9.0-RC3, amd64, i386 and earlier
> 9-CURRENT on ppc) running SU+J that have had unexplained panics and
> crashes start happening relating to disk I/O. When I end up running a
> full fsck, it keeps turning out that the disk is dirty and corrupted,
> but no mechanism is in place with SU+J to detect and fix this. A bgfsck
> never happens, but a manual fsck in single-user does indeed fix the
> crashing and weird behavior. Others have tested their SU+J volumes and
> found them to have errors as well. This makes me super nervous.
>
> Basically, the way SU+J seems to operate is this:
>
> http://redundancy.redundancy.org/fscklog2
>
> "Oh hey, I see you shut down uncleanly, let's check everything looks
> good, off you go, whee"
>
> Until I actually go and fsck, when I get:
>
> http://redundancy.redundancy.org/fscklog1
>
> So, I understand that journalling doesn't replace the need for a
> potential fsck (though I never had this problem with gjournal), but
> without a way for the system to detect that a fsck is necessary, this
> seems pretty much a guaranteed recipe for data corruption, and seems to
> offer little to no benefit over plain SU+fsck, or even just mounting
> async.
>
> So: is everyone else seeing this? Am I misunderstanding how SU+J should
> be used? How should the error resolution process really happen?

I'm not sure if your experiments are right here, the second log shows
you're running it read-only, which is likely caused by running it on
live file system.  What I would suggest to do is:

 - Reset the system while it's running;
 - Boot into single user mode;
 - 'dd' the disk image to an image;
 - Boot the system normally and:
    - use mdconfig -a -t vnode -f on copy of the image
    - use journalled fsck;
    - use normal fsck to check if the journalled fsck did the right thing.

This would rule out possible after-mount introduced changes, etc.  I
personally did not hit problems a few months ago but I didn't re-test
recently.

Cheers,
-- 
Xin LI <delphij@delphij.net> https://www.delphij.net/
FreeBSD - The Power to Serve! Live free or die