From owner-freebsd-hackers Sat May 26 15:42:32 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [65.0.135.147]) by hub.freebsd.org (Postfix) with ESMTP id 032DC37B422 for ; Sat, 26 May 2001 15:42:30 -0700 (PDT) (envelope-from peter@wemm.org) Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3]) by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id f4QMgTM05232 for ; Sat, 26 May 2001 15:42:29 -0700 (PDT) (envelope-from peter@wemm.org) Received: from wemm.org (localhost [127.0.0.1]) by overcee.netplex.com.au (Postfix) with ESMTP id B4B97380E; Sat, 26 May 2001 15:42:29 -0700 (PDT) (envelope-from peter@wemm.org) X-Mailer: exmh version 2.3.1 01/18/2001 with nmh-1.0.4 To: areilly@bigpond.net.au (Andrew Reilly), hackers@FreeBSD.ORG Subject: Re: technical comparison In-Reply-To: <200105262214.CAA21056@aaz.links.ru> Date: Sat, 26 May 2001 15:42:29 -0700 From: Peter Wemm Message-Id: <20010526224229.B4B97380E@overcee.netplex.com.au> Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG .@babolo.ru wrote: > Andrew Reilly writes: > .... > > /usr/ports/distfiles on any of the mirrors probably contains > > upwards of 5000 files too, and there is a strong likelyhood that > > these will be accessed out-of-order by ports-makefile-driven > > fetch requests. > Oh! > You point a good example! > 0cicuta~(13)>/bin/ls /usr/ports/distfiles/ | wc > 9672 9672 198244 .. Which is almost entirely stored in the name cache, which is hashed. Once you scan the directory for the first time, the entries are pre-inserted into the hash. This cache is very long lived and is quite effective at dealing with this sort of thing, especially if you have plenty of memory and have vfs.vmiodirenable=1 turned on. While it may not scale too well to directories with millions of files, it certainly deals well with tens of thousands of files. We have recently made improvements to the hashing algorithms to get better dispersion on small and iterative filenames, eg: 00, 01, 02 -> FF. It is not perfect, but it is a hell of a lot better than the false assumption that the linear search method is the usual case. Which is more expensive? Maintaining an on-disk hashed (or b+tree) directory format for *everything* or maintaining a simple low-cost format on disk with in-memory hashing for fast lookups? For the small directory case I suspect the FFS+namecache way is more cost effective. For the medium to large directory case (10,000 to 100,000 entries), I suspect the FFS+namecache method isn't too shabby, providing you are not starved for memory. For the insanely large cases - I dont want to think about :-). Cheers, -Peter -- Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au "All of this is for nothing if we don't go to the stars" - JMS/B5 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message