From owner-freebsd-current@FreeBSD.ORG Tue Dec 27 22:29:29 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1DD751065670 for ; Tue, 27 Dec 2011 22:29:29 +0000 (UTC) (envelope-from delphij@gmail.com) Received: from mail-tul01m020-f182.google.com (mail-tul01m020-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id DA6818FC1A for ; Tue, 27 Dec 2011 22:29:28 +0000 (UTC) Received: by obbwd18 with SMTP id wd18so11739104obb.13 for ; Tue, 27 Dec 2011 14:29:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=oU3xYWNvNf3PzC1Q6oJ6+gI1N8p86jzEE/73vFPJuZI=; b=gHy0QqMEZY++L5d2RyxWOZFeA8ImRV/DFw+nL2Kt9Gk163pEnT5mSIEVlVghs3LtXN IzMe5xX03SyEeMoXJSk5qgJal4/JKms6lpTSAfk3TZfdCkCcxppveMaQbGrCXvezpJkQ Mb4AHw21f8uOcXXBF8fK+Ii/qpSYrhFIrZLj0= MIME-Version: 1.0 Received: by 10.182.45.102 with SMTP id l6mr26769391obm.0.1325024967240; Tue, 27 Dec 2011 14:29:27 -0800 (PST) Received: by 10.182.67.163 with HTTP; Tue, 27 Dec 2011 14:29:27 -0800 (PST) In-Reply-To: <20111227215330.GI45484@redundancy.redundancy.org> References: <20111227215330.GI45484@redundancy.redundancy.org> Date: Tue, 27 Dec 2011 14:29:27 -0800 Message-ID: From: Xin LI To: David Thiel Content-Type: text/plain; charset=UTF-8 Cc: freebsd-current@freebsd.org Subject: Re: SU+J systems do not fsck themselves X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Dec 2011 22:29:29 -0000 On Tue, Dec 27, 2011 at 1:53 PM, David Thiel wrote: > I've had multiple machines now (9.0-RC3, amd64, i386 and earlier > 9-CURRENT on ppc) running SU+J that have had unexplained panics and > crashes start happening relating to disk I/O. When I end up running a > full fsck, it keeps turning out that the disk is dirty and corrupted, > but no mechanism is in place with SU+J to detect and fix this. A bgfsck > never happens, but a manual fsck in single-user does indeed fix the > crashing and weird behavior. Others have tested their SU+J volumes and > found them to have errors as well. This makes me super nervous. > > Basically, the way SU+J seems to operate is this: > > http://redundancy.redundancy.org/fscklog2 > > "Oh hey, I see you shut down uncleanly, let's check everything looks > good, off you go, whee" > > Until I actually go and fsck, when I get: > > http://redundancy.redundancy.org/fscklog1 > > So, I understand that journalling doesn't replace the need for a > potential fsck (though I never had this problem with gjournal), but > without a way for the system to detect that a fsck is necessary, this > seems pretty much a guaranteed recipe for data corruption, and seems to > offer little to no benefit over plain SU+fsck, or even just mounting > async. > > So: is everyone else seeing this? Am I misunderstanding how SU+J should > be used? How should the error resolution process really happen? I'm not sure if your experiments are right here, the second log shows you're running it read-only, which is likely caused by running it on live file system. What I would suggest to do is: - Reset the system while it's running; - Boot into single user mode; - 'dd' the disk image to an image; - Boot the system normally and: - use mdconfig -a -t vnode -f on copy of the image - use journalled fsck; - use normal fsck to check if the journalled fsck did the right thing. This would rule out possible after-mount introduced changes, etc. I personally did not hit problems a few months ago but I didn't re-test recently. Cheers, -- Xin LI https://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die