From owner-freebsd-fs@FreeBSD.ORG Sun Jul 10 04:38:35 2011 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B770D1065672 for ; Sun, 10 Jul 2011 04:38:35 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by mx1.freebsd.org (Postfix) with ESMTP id 426658FC13 for ; Sun, 10 Jul 2011 04:38:34 +0000 (UTC) Received: from c122-106-165-191.carlnfd1.nsw.optusnet.com.au (c122-106-165-191.carlnfd1.nsw.optusnet.com.au [122.106.165.191]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p6A4cPTV008603 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 10 Jul 2011 14:38:26 +1000 Date: Sun, 10 Jul 2011 14:38:25 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Peter Jeremy In-Reply-To: <20110710011549.GA88534@server.vk2pj.dyndns.org> Message-ID: <20110710133025.V1039@besplex.bde.org> References: <20110710011549.GA88534@server.vk2pj.dyndns.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org Subject: Re: fsck_ufs a 2TB partition with 256MB RAM stalls X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Jul 2011 04:38:35 -0000 On Sun, 10 Jul 2011, Peter Jeremy wrote: > On 2011-Jul-07 15:57:52 +0200, Rick van der Zwet wrote: >> I want to build a file server with limited power usage, so I got >> myself an ALIX alix2d13 which has 256MB DDR RAM. I connected a 2TB >> USB2.0 disk to the alix2d13 to be used as storage. >> >> The file system gets corrupted due to power failure, which is likely >> going to happen when running Solar Power in The Netherlands, I cannot >> fix it anymore cause the fsck_ufs never to complete. This actually >> makes sense as the recommendation [1] says ``1TB storage needs 1GB of >> RAM for fsck_ufs''. > > The problem is that fsck allocates both per-CG and per-allocated-inode > space (and possibly other space) so running fsck on a large UFS needs > lots of RAM. I suspect you'll need to find an amd64 box to run the > fsck on but you might be able to get the fsck to complete (very > slowly) by adding plenty of swap (on another disk) and increasing > kern.maxdsiz in loader.conf. This (and the inherent slowness of fscking an enormous number of CGs might be fixable by using a per-CG dirty flag. Each CG would become more like an independent file system, with most or all of the file systems effectively mounted read-only most of the time, so that most of them don't have to be looked at by fsck after a crash. Upgrades to read-write would be automatic and instant, while downgrades to read-only would be either automatic (in the kernel, after a timeout), or managed by an application. The dirty flags should be stored together and not in individual CGs (except as backups) so that examining them to determine what to fsck doesn't require reading all CGs. This should work well, since on a multi-TB disk, it is physically impossible to have more than a tiny proportion of the disk active at any one time. The proportion might be scattered over the whole disk and thus require too many "mounted" CGs, but that is bad for performance in other ways so should be avoided, and implementing this avoidance is relatively easy (just add a mild preference to use "mounted" CGs to existing preferences for the same and nearby CGs). The complications for independent sub-filesystems in CGs are similar but much smaller than ones for growing a filesystem by turning separate filesystems into sub-filesystems. I have little need for large file systems so I haven't tried implementing any of this. I just use a poor man's version with too many separate file systems so that each can be mounted read-only and backed up to small media independently. The automatic upgrade and downgrade would be useful even for this setup, since most of my small file systems are also rarely written to, but I have to mount half of them read-write all the time since it is too much work to manually upgrade and downgrade them. msdosfs's single dirty flag is a bit closer to working right than ffs's. msdosfs doesn't scribble timestamps on the super block for read-write mounts that never write any data. But non-scribbling on the super-block was broken when the dirty flag was implemented (very late, via bad bits from Apple) for msdosfs. Read-write mounts of msdosfs now scribble the dirty flag itself on the superblock, so after a crash a read-write mounted msdosfs filesystem is now considered dirty and has to be fscked, although for lightly used ones the only dirt on it is the flag that marks it as dirty. And this dirt even has bugs in it: msdosfs's superblock is actually its FAT; the dirty bit is in magic bytes at the start of the FAT; but msdosfs file systems normally have 2 FATs, and the dirty bit is not maintained properly in both of them, except accidentally if a real FAT entry near the start is changed -- then the second FAT is written to properly back up the real FAT entry, and this accidentally backs up the dirty bit entry so that fsck doesn't find the FATs to be inconsistent. (Perhaps fsck should only look at the dirty bit in the first copy. Nothing really cares about this, and only the simple comparision used in fsck notices the difference. I don't know if OtherOS's fscks notice this difference.) msdosfs also has a dynamic dirty flag (pm_fmod) which tracks changes to FAT metadata, but this is not really used and the addition of the dirty flag turned it it into nonsense. It is only used to panic when an assertion fails. Its useful use is only indicated in a comment. But the useful use never worked, since the flag is never downgraded to 0 (after making the FAT undirty by writing it), and setting the dirty flag made it further from working since pm_fmod is upgraded to 1 on every read-write mount. pm_fmod thus tracks !MNT_RDONLY and is useless. The corresponding flag in ffs (fs_fmod) which tracks changes to the superblock is useful and is used correctly. It is more needed since ffs scribbles timestamps and other metadata to the superblock and depends on delayed updates to write these to the disk. Clearing the flag on every superblock update prevents doing writes of null changes on every sync(). Maintaining a dynamic per-filesystem dirty flag is only slightly more complicated than maintaining this superblock dirty flag. It just has to track dirtyness for all data as well as superblock metadata. Bruce