From owner-freebsd-hackers@FreeBSD.ORG Sat Jan 10 14:55:25 2004 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 952BE16A4CE for ; Sat, 10 Jan 2004 14:55:25 -0800 (PST) Received: from server.vk2pj.dyndns.org (c211-30-75-229.belrs2.nsw.optusnet.com.au [211.30.75.229]) by mx1.FreeBSD.org (Postfix) with ESMTP id E3FC243D53 for ; Sat, 10 Jan 2004 14:55:21 -0800 (PST) (envelope-from peterjeremy@optushome.com.au) Received: from server.vk2pj.dyndns.org (localhost.vk2pj.dyndns.org [127.0.0.1])i0AMtA7B061218; Sun, 11 Jan 2004 09:55:10 +1100 (EST) (envelope-from peter@server.vk2pj.dyndns.org) Received: (from peter@localhost) by server.vk2pj.dyndns.org (8.12.10/8.12.10/Submit) id i0AMt9ee061217; Sun, 11 Jan 2004 09:55:09 +1100 (EST) (envelope-from peter) Date: Sun, 11 Jan 2004 09:55:09 +1100 From: Peter Jeremy To: Tom Arnold Message-ID: <20040110225509.GA60996@server.vk2pj.dyndns.org> References: <20040109193551.GD39751@moo.sysabend.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040109193551.GD39751@moo.sysabend.org> User-Agent: Mutt/1.4.1i cc: freebsd-hackers@freebsd.org Subject: Re: Large Filesystem Woes X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 Jan 2004 22:55:25 -0000 On Fri, Jan 09, 2004 at 11:35:51AM -0800, Tom Arnold wrote: >Building a box thats going to house many billions of small files. Think >innd circa 1998 or someone trying to house AOLs mail system on cyrus or >something. This is probably going to stress any filesystem. You might like to consider an alternative approach to storing the files (eg some sort of database). > To this end I've hung a 3.3TB hardware raid off a BSD box >broken into 4 partitions. 3 1TB and 1 300GB. >Originally this was on a 4.9 box. da0s1 and da0s2 were formatted "stock" >( -f 2048 -b 16384 -i 8192 ) da1s1 and s2 were both formatted -f 512 -b 4096 >-i 512. I ran '-f 512 -b 4096' on a news server for a while but I found that '-f 1024 -b 8192' significantly improved performance (at the cost of a significant increase in disk space usage). >Switched to 5.2. Newfs'd the RAID for UFS2. First issue, if the machine >came up dirty, bgfsck seemed to do its thing and the machine was online and >usable after about 20 minutes however after a few hours I get this error : > >fsck: /dev/da1s1e: CANNOT CREATE SNAPSHOT /export/database/.snap/fsck_snapshot: File too large >fsck: /dev/da1s1e: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. I can't explain this. This means that mount(2) returned EFBIG - which isn't a documented error. I had a quick look through the sources and can't quickly see why EFBIG would get returned. >And the second thing I've noticed is I have lost a lot of space. >Under 4.9 with UFS da1s1e was approx 870gigs and s2e was around 180, now >I see : >Filesystem Size Used Avail Capacity iused ifree %iused Mounted on >/dev/da0s1e 992G 4.0K 912G 0% 2 134411260 0% /export/logs1 >/dev/da0s2e 992G 4.0K 912G 0% 2 134411260 0% /export/logs2 >/dev/da1s1e 510G 1.0K 469G 0% 2 2148661228 0% /export/database >/dev/da1s2e 94G 1.0K 86G 0% 2 395214332 0% /export/spare The size of a UFS1 inode is 128 bytes and a UFS2 inode is 256 bytes. With '-i 512', UFS2 allocates about 1/2 of your disk space to inodes. (And you have a further overhead of 8 bytes + name for each directory entry). >I'm not certain if I've run into some kind of weird limit here or a bug or >what and am looking for ideas to persue before I'm stuck going to an OS with >something journaled. Inode numbers are supposed to be u_int32_t but it's possible that they are being (incorrectly) treated as signed somewhere (and you have >2^31 inodes on da1s1e). Moving to a journalled filesystem won't necessarily help. I use DEC/Compaq/HP AdvFS at work - each file needs at least 282 bytes of metadata (under some circumstances, it can require multiple 282 byte metadata blocks) and from memory it is limited to 2^31 (or maybe 2^32) files. Our main fileserver has a filesystem with 2.7e6 files and we are continually running into undocumented "features" (aka bugs) as a result of the large number of files. (OTOH, I have no problems with 1.9e6 files in a UFS1 partition on a FreeBSD box). Peter