From owner-freebsd-current@FreeBSD.ORG Thu Apr 22 13:31:57 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id BE8F316A4CE for ; Thu, 22 Apr 2004 13:31:57 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 54D3E43D3F for ; Thu, 22 Apr 2004 13:31:57 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.11/8.12.11) with ESMTP id i3MKVdeU025956; Thu, 22 Apr 2004 16:31:39 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)i3MKVdZH025953; Thu, 22 Apr 2004 16:31:39 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Thu, 22 Apr 2004 16:31:39 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Eric Anderson In-Reply-To: <40867A5D.9010600@centtech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@freebsd.org Subject: Re: Directories with 2million files X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 Apr 2004 20:31:57 -0000 On Wed, 21 Apr 2004, Eric Anderson wrote: > First, let me say that I am impressed (but not shocked) - FreeBSD > quietly handled my building of a directory with 2055476 files in it. > I'm not sure if there is a limit to this number, but at least we know it > works to 2million. I'm running 5.2.1-RELEASE. > > However, several tools seem to choke on that many files - mainly ls and > du. Find works just fine. Here's what my directory looks like (from > the parent): Directories with millions of entries turn up surprisingly frequently, actually. While FreeBSD handles them quite well, they're something that's not frequently optimized for in applications: cyrus# /usr/bin/time \ls -f | wc 1.86 real 1.20 user 0.34 sys 338806 338806 2599362 cyrus# /usr/bin/time \ls | wc 6.48 real 4.39 user 0.28 sys 338807 338807 2599370 > I'd work on some patches, but I'm not worth much when it comes to C/C++. Unfortunately, a lot of this has to do with the desire to have programs behave nicely in ways that scale well only to a limited extent. I.e., sorting and sizing of output. If you have algorithms that require all elements in a large array be in memory, such as sorting algorithms, it's inevitably going to hurt. And with text applications designed to run in command pipelines, to POSIX specs, etc, there isn't a whole lot of room to generate warnings like: cyrus# ls ls: Holy cow, you have a lot of files. You might want to disable sorting. ... > If someone has some patches, or code to try, let me know - I'd be more > than willing to test, possibly even give out an account on the machine. Efficiency improvements will generally always be welcome, as long as they're correct and don't overly complicate the implementation. For what it's worth, I've noticed a lot of tools are getting better about handling large numbers of (whatevers). For example, when I pointed Mozilla at an IMAP mail folder with 100,000 messages in it, it would reread the mailbox index every 60 seconds if there was a mailbox change. If you add one message to the mailbox a minute, it will never stop rereading the index if it takes over 59 seconds to read the index, which over a WAN it would. Recent versions are *much* smarter, and appear in many cases to scale to millions of messages, which is what I keep in my large directories :-). Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Senior Research Scientist, McAfee Research