Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 27 Dec 2011 14:29:27 -0800
From:      Xin LI <delphij@gmail.com>
To:        David Thiel <lx@redundancy.redundancy.org>
Cc:        freebsd-current@freebsd.org
Subject:   Re: SU+J systems do not fsck themselves
Message-ID:  <CAGMYy3t3Rv006qvBCHr4kdbM86andkr5mRkvaGYw5CETO1XHkg@mail.gmail.com>
In-Reply-To: <20111227215330.GI45484@redundancy.redundancy.org>
References:  <20111227215330.GI45484@redundancy.redundancy.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Dec 27, 2011 at 1:53 PM, David Thiel
<lx@redundancy.redundancy.org> wrote:
> I've had multiple machines now (9.0-RC3, amd64, i386 and earlier
> 9-CURRENT on ppc) running SU+J that have had unexplained panics and
> crashes start happening relating to disk I/O. When I end up running a
> full fsck, it keeps turning out that the disk is dirty and corrupted,
> but no mechanism is in place with SU+J to detect and fix this. A bgfsck
> never happens, but a manual fsck in single-user does indeed fix the
> crashing and weird behavior. Others have tested their SU+J volumes and
> found them to have errors as well. This makes me super nervous.
>
> Basically, the way SU+J seems to operate is this:
>
> http://redundancy.redundancy.org/fscklog2
>
> "Oh hey, I see you shut down uncleanly, let's check everything looks
> good, off you go, whee"
>
> Until I actually go and fsck, when I get:
>
> http://redundancy.redundancy.org/fscklog1
>
> So, I understand that journalling doesn't replace the need for a
> potential fsck (though I never had this problem with gjournal), but
> without a way for the system to detect that a fsck is necessary, this
> seems pretty much a guaranteed recipe for data corruption, and seems to
> offer little to no benefit over plain SU+fsck, or even just mounting
> async.
>
> So: is everyone else seeing this? Am I misunderstanding how SU+J should
> be used? How should the error resolution process really happen?

I'm not sure if your experiments are right here, the second log shows
you're running it read-only, which is likely caused by running it on
live file system.  What I would suggest to do is:

 - Reset the system while it's running;
 - Boot into single user mode;
 - 'dd' the disk image to an image;
 - Boot the system normally and:
    - use mdconfig -a -t vnode -f on copy of the image
    - use journalled fsck;
    - use normal fsck to check if the journalled fsck did the right thing.

This would rule out possible after-mount introduced changes, etc.  I
personally did not hit problems a few months ago but I didn't re-test
recently.

Cheers,
-- 
Xin LI <delphij@delphij.net> https://www.delphij.net/
FreeBSD - The Power to Serve! Live free or die



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGMYy3t3Rv006qvBCHr4kdbM86andkr5mRkvaGYw5CETO1XHkg>