From owner-freebsd-fs@FreeBSD.ORG  Sun Jul 10 04:38:35 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B770D1065672
	for <freebsd-fs@FreeBSD.org>; Sun, 10 Jul 2011 04:38:35 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au
	[211.29.132.186])
	by mx1.freebsd.org (Postfix) with ESMTP id 426658FC13
	for <freebsd-fs@FreeBSD.org>; Sun, 10 Jul 2011 04:38:34 +0000 (UTC)
Received: from c122-106-165-191.carlnfd1.nsw.optusnet.com.au
	(c122-106-165-191.carlnfd1.nsw.optusnet.com.au [122.106.165.191])
	by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	p6A4cPTV008603
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 10 Jul 2011 14:38:26 +1000
Date: Sun, 10 Jul 2011 14:38:25 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Peter Jeremy <peterjeremy@acm.org>
In-Reply-To: <20110710011549.GA88534@server.vk2pj.dyndns.org>
Message-ID: <20110710133025.V1039@besplex.bde.org>
References: <CAAN28g39WW6vBZ6q7AQ=08++-SJ31WikeFtXy2zDF1=XtdKAzg@mail.gmail.com>
	<20110710011549.GA88534@server.vk2pj.dyndns.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org
Subject: Re: fsck_ufs a 2TB partition with 256MB RAM stalls
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 10 Jul 2011 04:38:35 -0000

On Sun, 10 Jul 2011, Peter Jeremy wrote:

> On 2011-Jul-07 15:57:52 +0200, Rick van der Zwet <info@rickvanderzwet.nl> wrote:
>> I want to build a file server with limited power usage, so I got
>> myself an ALIX alix2d13 which has 256MB DDR RAM. I connected a 2TB
>> USB2.0 disk to the alix2d13 to be used as storage.
>>
>> The file system gets corrupted due to power failure, which is likely
>> going to happen when running Solar Power in The Netherlands, I cannot
>> fix it anymore cause the fsck_ufs never to complete. This actually
>> makes sense as the recommendation [1] says ``1TB storage needs 1GB of
>> RAM for fsck_ufs''.
>
> The problem is that fsck allocates both per-CG and per-allocated-inode
> space (and possibly other space) so running fsck on a large UFS needs
> lots of RAM.  I suspect you'll need to find an amd64 box to run the
> fsck on but you might be able to get the fsck to complete (very
> slowly) by adding plenty of swap (on another disk) and increasing
> kern.maxdsiz in loader.conf.

This (and the inherent slowness of fscking an enormous number of CGs
might be fixable by using a per-CG dirty flag.  Each CG would become
more like an independent file system, with most or all of the file
systems effectively mounted read-only most of the time, so that most
of them don't have to be looked at by fsck after a crash.  Upgrades
to read-write would be automatic and instant, while downgrades to
read-only would be either automatic (in the kernel, after a timeout),
or managed by an application.  The dirty flags should be stored together
and not in individual CGs (except as backups) so that examining them
to determine what to fsck doesn't require reading all CGs.

This should work well, since on a multi-TB disk, it is physically
impossible to have more than a tiny proportion of the disk active at
any one time.  The proportion might be scattered over the whole disk
and thus require too many "mounted" CGs, but that is bad for performance
in other ways so should be avoided, and implementing this avoidance
is relatively easy (just add a mild preference to use "mounted" CGs
to existing preferences for the same and nearby CGs).  The complications
for independent sub-filesystems in CGs are similar but much smaller
than ones for growing a filesystem by turning separate filesystems
into sub-filesystems.

I have little need for large file systems so I haven't tried implementing
any of this.  I just use a poor man's version with too many separate
file systems so that each can be mounted read-only and backed up to
small media independently.  The automatic upgrade and downgrade would
be useful even for this setup, since most of my small file systems are
also rarely written to, but I have to mount half of them read-write all the
time since it is too much work to manually upgrade and downgrade them.

msdosfs's single dirty flag is a bit closer to working right than ffs's.
msdosfs doesn't scribble timestamps on the super block for read-write
mounts that never write any data.  But non-scribbling on the super-block
was broken when the dirty flag was implemented (very late, via bad bits
from Apple) for msdosfs.  Read-write mounts of msdosfs now scribble the
dirty flag itself on the superblock, so after a crash a read-write
mounted msdosfs filesystem is now considered dirty and has to be fscked,
although for lightly used ones the only dirt on it is the flag that
marks it as dirty.  And this dirt even has bugs in it: msdosfs's
superblock is actually its FAT; the dirty bit is in magic bytes at
the start of the FAT; but msdosfs file systems normally have 2 FATs,
and the dirty bit is not maintained properly in both of them, except
accidentally if a real FAT entry near the start is changed -- then
the second FAT is written to properly back up the real FAT entry,
and this accidentally backs up the dirty bit entry so that fsck
doesn't find the FATs to be inconsistent.  (Perhaps fsck should only
look at the dirty bit in the first copy.  Nothing really cares about
this, and only the simple comparision used in fsck notices the difference.
I don't know if OtherOS's fscks notice this difference.)

msdosfs also has a dynamic dirty flag (pm_fmod) which tracks changes to
FAT metadata, but this is not really used and the addition of the dirty
flag turned it it into nonsense.  It is only used to panic when an
assertion fails.  Its useful use is only indicated in a comment.  But
the useful use never worked, since the flag is never downgraded to 0
(after making the FAT undirty by writing it), and setting the dirty
flag made it further from working since pm_fmod is upgraded to 1 on
every read-write mount.  pm_fmod thus tracks !MNT_RDONLY and is useless.

The corresponding flag in ffs (fs_fmod) which tracks changes to the
superblock is useful and is used correctly.  It is more needed since
ffs scribbles timestamps and other metadata to the superblock and
depends on delayed updates to write these to the disk.  Clearing the
flag on every superblock update prevents doing writes of null changes
on every sync().  Maintaining a dynamic per-filesystem dirty flag
is only slightly more complicated than maintaining this superblock
dirty flag.  It just has to track dirtyness for all data as well as
superblock metadata.

Bruce