Date: Wed, 8 May 2019 08:50:37 -0700 From: Conrad Meyer <cem@freebsd.org> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: test hash functions for fsid Message-ID: <CAG6CVpUER0ODF5mfdGFA-soSpGbh=qkjc4rKiJEdaOUaUaUmAw@mail.gmail.com> In-Reply-To: <YQBPR0101MB2260D82BAE348FB82902508CDD320@YQBPR0101MB2260.CANPRD01.PROD.OUTLOOK.COM> References: <YQBPR0101MB2260D82BAE348FB82902508CDD320@YQBPR0101MB2260.CANPRD01.PROD.OUTLOOK.COM>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Rick, On Wed, May 8, 2019 at 7:39 AM Rick Macklem <rmacklem@uoguelph.ca> wrote: > If you have a FreeBSD system with lots of local file systems (1000s), maybe you could > run the attached program on it and email me the stats. > I'm just trying to see how well these hash functions work on fsids. Below I'll get to some comments on hash functions and hash tables, but first: have you considered a non-hash data structure, such as PCTRIE? PCTRIE_DEFINE() creates a datastructure with similar properties to a hash table for this use, but does not require a hash function step at all because all fsids are already unique, fixed-size, 64-bit values. It may be especially space-efficient for non-ZFS FSIDs, when the 64-bit key is composed with '(fsid.val[1] << 32 | fsid.val[0])'. (Elaborated on more below.(A)) Usually it makes sense to size hash tables for the size of the data; for direct insert tables you want to keep the load factor below 0.8 or something like that. So the hardcoded 256 doesn't make a ton of sense, IMO. As far as hash functions, both (hash32_buf == djb2, FNV32) are really weak, but importantly for hash tables, fast. https://github.com/Cyan4973/smhasher#summary provides some broader context, but keep in mind most of those algorithms are targeting much longer keys than 8 bytes. I would guess that h1/h3 will be noticably worse than h2/h4 without performing any better, but I don't have data to back that up. (If you were to test calculate_crc32c() or XXH32/XXH64, included in sys/contrib/zstd, you might observe fewer collisions and better distribution. But that isn't necessarily the most important factor for hash table performance in this use.) (A) FSIDs themselves have poor initial distribution and *very* few unique bits (i.e., for filesystems that use vfs_getnewfsid, the int32 fsid val[1] will be identical across all filesystems of the same type, such as NFS mounts). The remaining val[0] is unique due to an incrementing global counter. You could pretty easily extract the vfs_getnewfsid() code out to userspace to simulate creating a bunch of fsid_t's and run some tests on that list. It isn't a real world distribution but it would probably be pretty close, outside of ZFS. ZFS FSIDs have 8 bits of shared vfs type, but the remaining 56 bits appears to be generated by arc4rand(9), so ZFS FSIDs will have good distribution even without hashing. Anyway, maybe this is helpful, maybe not. Thanks for working on scaling exports, and NFS in general! It is appreciated. Take care, Conrad
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAG6CVpUER0ODF5mfdGFA-soSpGbh=qkjc4rKiJEdaOUaUaUmAw>