From owner-freebsd-current@FreeBSD.ORG Tue Dec 27 22:20:13 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B0A46106564A for ; Tue, 27 Dec 2011 22:20:13 +0000 (UTC) (envelope-from lx@redundancy.redundancy.org) Received: from redundancy.redundancy.org (75-101-96-57.dsl.static.sonic.net [75.101.96.57]) by mx1.freebsd.org (Postfix) with SMTP id 758DF8FC16 for ; Tue, 27 Dec 2011 22:20:13 +0000 (UTC) Received: (qmail 73876 invoked by uid 1001); 27 Dec 2011 21:53:55 -0000 Date: Tue, 27 Dec 2011 13:53:55 -0800 From: David Thiel To: freebsd-current@freebsd.org Message-ID: <20111227215330.GI45484@redundancy.redundancy.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-OpenPGP-Key-fingerprint: 482A 8C46 C844 7E7C 8CBC 2313 96EE BEE5 1F4B CA13 X-OpenPGP-Key-available: http://redundancy.redundancy.org/lx.gpg X-Face: %H~{$1~NOw1y#%mM6{|4:/ List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Dec 2011 22:20:13 -0000 I've had multiple machines now (9.0-RC3, amd64, i386 and earlier 9-CURRENT on ppc) running SU+J that have had unexplained panics and crashes start happening relating to disk I/O. When I end up running a full fsck, it keeps turning out that the disk is dirty and corrupted, but no mechanism is in place with SU+J to detect and fix this. A bgfsck never happens, but a manual fsck in single-user does indeed fix the crashing and weird behavior. Others have tested their SU+J volumes and found them to have errors as well. This makes me super nervous. Basically, the way SU+J seems to operate is this: http://redundancy.redundancy.org/fscklog2 "Oh hey, I see you shut down uncleanly, let's check everything looks good, off you go, whee" Until I actually go and fsck, when I get: http://redundancy.redundancy.org/fscklog1 So, I understand that journalling doesn't replace the need for a potential fsck (though I never had this problem with gjournal), but without a way for the system to detect that a fsck is necessary, this seems pretty much a guaranteed recipe for data corruption, and seems to offer little to no benefit over plain SU+fsck, or even just mounting async. So: is everyone else seeing this? Am I misunderstanding how SU+J should be used? How should the error resolution process really happen? Thanks, David