From owner-freebsd-fs  Thu Apr 24 07:04:26 1997
Return-Path: <owner-fs>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id HAA07677
          for fs-outgoing; Thu, 24 Apr 1997 07:04:26 -0700 (PDT)
Received: from Sisyphos.MI.Uni-Koeln.DE (Sisyphos.MI.Uni-Koeln.DE [134.95.212.10])
          by hub.freebsd.org (8.8.5/8.8.5) with SMTP id HAA07667
          for <fs@freebsd.org>; Thu, 24 Apr 1997 07:04:20 -0700 (PDT)
Received: from x14.mi.uni-koeln.de (annexr3-1.slip.Uni-Koeln.DE) by Sisyphos.MI.Uni-Koeln.DE with SMTP id AA01677
  (5.67b/IDA-1.5 for <fs@freebsd.org>); Thu, 24 Apr 1997 16:03:47 +0200
Received: (from se@localhost) by x14.mi.uni-koeln.de (8.8.5/8.6.9) id QAA23373; Thu, 24 Apr 1997 16:03:39 +0200 (CEST)
Message-Id: <19970424160338.57282@x14.mi.uni-koeln.de>
Date: Thu, 24 Apr 1997 16:03:38 +0200
From: Stefan Esser <se@freebsd.org>
To: Don Lewis <Don.Lewis@tsc.tdk.com>
Cc: dg@root.com, Poul-Henning Kamp <phk@dk.tfs.com>,
        Michael Hancock <michaelh@cet.co.jp>, fs@freebsd.org
Subject: Re: the namei cache...
References: <dg@root.com> <199704241256.FAA11674@salsa.gv.tsc.tdk.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.68
In-Reply-To: <199704241256.FAA11674@salsa.gv.tsc.tdk.com>; from Don Lewis on Thu, Apr 24, 1997 at 05:56:01AM -0700
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

On Apr 24, Don Lewis <Don.Lewis@tsc.tdk.com> wrote:
> Yes, most processors are really bad at integer divide.  You might look at
> the Perl hash.  You multiply the current value by 33 and add the next
> character, then mask off however many bits you want at the end (the table
> size must be a power of 2).  If your compiler is at all smart, it will
> notice that the multiply is just a shift and an add.  I suspect this might
> be pretty easy to just drop in to the existing code for a quick trial.

I used the following as a hash function, after looking up
algorithms in Knuth's book and studying lots of examples
in several software packages. I needed a fast hash, which
should work on power of 2 size tables, with keys mostly
being ASCII text. I did quite some profiling and testing
with my data sets, and the following gave good results:

#define SHVAL		(8)
#define MULTIPLIER	(0x5425daf1)

unsigned long 
hashf (unsigned long hashmax, void *key, size_t keylen)
{
    unsigned long i = 0;
    unsigned char *p = key;
    while (keylen--)
    {
	unsigned char c = *p++;
	i -= (i << 4) ^ c;
/*      i -= (i << 4) + (i >> 23) ^ c;*/
    }
    i *= MULTIPLIER;
    i >>= SHVAL;
    return i & hashmax;
}

This lead to significantly less collisions than other,
simpler hash functions. I tried other versions that fed
back some of the high order bits, but surprisingly found
that this did not help reduce the number of collisions,
and thus was not worth the additional shift and add.
But this is possibly not true in the case of file names
that tend to have a common suffix (.tar.gz, for example).

(The value of hashmax must obviously be a power of 2
minus 1 ...)

I used tables from a few thousand up to a few hundred
thousand entries, BTW.

Regards, STefan