From owner-freebsd-current@FreeBSD.ORG Wed Dec 28 07:58:02 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 21A7D106566B for ; Wed, 28 Dec 2011 07:58:02 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id E929A8FC12 for ; Wed, 28 Dec 2011 07:58:01 +0000 (UTC) Received: from [127.0.0.1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.5/8.14.5) with ESMTP id pBS7vtWN035916; Wed, 28 Dec 2011 00:57:55 -0700 (MST) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1251.1) Content-Type: text/plain; charset=us-ascii From: Scott Long In-Reply-To: <20111228073442.GM45484@redundancy.redundancy.org> Date: Wed, 28 Dec 2011 00:57:55 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <9DAD04BE-D330-4DC8-9307-597834EEA2CA@samsco.org> References: <20111227215330.GI45484@redundancy.redundancy.org> <20111227223638.GK45484@redundancy.redundancy.org> <4EFA4B4E.201@delphij.net> <20111228051404.GL45484@redundancy.redundancy.org> <6F3ACDEE-B656-46D0-AB11-FF1B23E70A27@samsco.org> <20111228073442.GM45484@redundancy.redundancy.org> To: David Thiel X-Mailer: Apple Mail (2.1251.1) X-Spam-Status: No, score=-50.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.0 X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org Cc: freebsd-current@freebsd.org, d@delphij.net Subject: Re: SU+J systems do not fsck themselves X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Dec 2011 07:58:02 -0000 On Dec 28, 2011, at 12:34 AM, David Thiel wrote: > On Tue, Dec 27, 2011 at 11:54:20PM -0700, Scott Long wrote: >> The first run of fsck, using the journal, gives results that I would=20= >> expect. The second run seems to imply that the fixes made on the=20 >> first run didn't actually get written to disk. This is definitely an=20= >> oddity. I see that you're using geli, maybe there's some strange=20 >> side-effect there. No idea. Report as a bug, this is definitely=20 >> undesired behavior. >=20 > Not impossible, but I was seeing similar issues on two non-geli = systems=20 > as well, i.e. tons of errors fixed when doing a single-user=20 > non-journalled fsck, but journalled fsck not fixing stuff. I'll try to=20= > replicate on a test machine, as I already lost data on the last=20 > (non-geli) machine this happened to. >=20 >> For the love that is all good and holy, don't ever run fsck on a live=20= >> filesystem. It's going to report these kinds of problems! It's=20 >> normal; filesystem metadata updates stay cached in memory, and fsck=20= >> bypasses that cache. =20 >=20 > Ok. I expected fsck would be softupdate-aware in that way, but I=20 > understand it not doing so. >=20 >>> - SU+J and fsck do not work correctly together to fix corruption on=20= >>> boot, i.e. bgfsck isn't getting run when it should >>=20 >> The point of SUJ is to eliminate the need for bgfsck. Effectively,=20= >> they are exclusive ideas. =20 >=20 > This is surprising to me. It is my impression that under Linux at = least,=20 > ext3fs is checked against the journal, and gets a full e2fsck if it=20 > finds it's still dirty. Additionally, there's a periodic fsck after = 180=20 > days continuous runtime or x number of mounts (see tune2fs -i and -c). = =20 > Is SU+J somehow implemented in such a way that this is unnecessary? = What=20 > does it do that the ext3fs people have missed? >=20 SUJ isn't like ext3 journaling, it doesn't do 100% metadata logging. = Instead, it's an extension of softupdates. Softupdates (SU) is still = responsible for ordering dependent writes to the disk to maintain = consistency. What SU can't handle is the Unix/POSIX idiom of unlinking = a file from the namespace but keeping its inode active through = refcounts. When you have an unclean shutdown, you wind up with stale = blocks allocated to orphaned inodes. The point of bgfsck was to scan = the filesystem for these allocations and free them, just like fsck does, = but to do it in the background so that the boot could continue. SUJ is = basically just an intent log for this case; it tells fsck where to find = these allocations so that fsck doesn't have to do the lengthy scan. = FWIW, this problem is present in most any journaling implementation and = is usually solved via the use of intent records in a journal, not unlike = SUJ. So, there's an assumption with SUJ+fsck that SU is keeping the = filesystem consistent. Maybe that's a bad assumption, and I'm not = trying to discredit your report. But the intention with SUJ is to = eliminate the need for anything more than a cursory check of the = superblocks and a processing of the SUJ intent log. If either of these = fails then fsck reverts to a traditional scan. In the same vein, ext3 = and most other traditional journaling filesystems assume that the = journal is correct and is preserving consistency, and don't do anything = more than a cursory data structure scan and journal replay as well, but = then revert to a full scan if that fails (zfs seems to be an exception = here, with there being no actual fsck available for it). As for the 180 day forced scan on ext3, I have no public comment. SU = has matured nicely over the last 10+ years, and I'm happy with the = progress that SUJ has made in the last 2-3 years. If there are bugs, = they need to be exposed and addressed ASAP. Scott