From owner-freebsd-hackers  Sat May 26 15:42:32 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [65.0.135.147])
	by hub.freebsd.org (Postfix) with ESMTP id 032DC37B422
	for <hackers@FreeBSD.ORG>; Sat, 26 May 2001 15:42:30 -0700 (PDT)
	(envelope-from peter@wemm.org)
Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3])
	by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id f4QMgTM05232
	for <hackers@FreeBSD.ORG>; Sat, 26 May 2001 15:42:29 -0700 (PDT)
	(envelope-from peter@wemm.org)
Received: from wemm.org (localhost [127.0.0.1])
	by overcee.netplex.com.au (Postfix) with ESMTP
	id B4B97380E; Sat, 26 May 2001 15:42:29 -0700 (PDT)
	(envelope-from peter@wemm.org)
X-Mailer: exmh version 2.3.1 01/18/2001 with nmh-1.0.4
To: areilly@bigpond.net.au (Andrew Reilly), hackers@FreeBSD.ORG
Subject: Re: technical comparison 
In-Reply-To: <200105262214.CAA21056@aaz.links.ru> 
Date: Sat, 26 May 2001 15:42:29 -0700
From: Peter Wemm <peter@wemm.org>
Message-Id: <20010526224229.B4B97380E@overcee.netplex.com.au>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

.@babolo.ru wrote:
> Andrew Reilly writes:
> ....
> > /usr/ports/distfiles on any of the mirrors probably contains
> > upwards of 5000 files too, and there is a strong likelyhood that
> > these will be accessed out-of-order by ports-makefile-driven
> > fetch requests.
> Oh!
> You point a good example!
> 0cicuta~(13)>/bin/ls /usr/ports/distfiles/ | wc
>     9672    9672  198244

.. Which is almost entirely stored in the name cache, which is hashed. Once
you scan the directory for the first time, the entries are pre-inserted
into the hash.  This cache is very long lived and is quite effective at
dealing with this sort of thing, especially if you have plenty of memory
and have vfs.vmiodirenable=1 turned on.  While it may not scale too well to
directories with millions of files, it certainly deals well with tens of
thousands of files.  We have recently made improvements to the hashing
algorithms to get better dispersion on small and iterative filenames, eg:
00, 01, 02 -> FF.

It is not perfect, but it is a hell of a lot better than the false
assumption that the linear search method is the usual case.

Which is more expensive?  Maintaining an on-disk hashed (or b+tree)
directory format for *everything* or maintaining a simple low-cost format
on disk with in-memory hashing for fast lookups?  For the small directory
case I suspect the FFS+namecache way is more cost effective.  For the
medium to large directory case (10,000 to 100,000 entries), I suspect the
FFS+namecache method isn't too shabby, providing you are not starved for
memory.  For the insanely large cases - I dont want to think about :-).

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message