Date: Fri, 28 Nov 2003 13:34:57 -0800 (PST) From: Don Lewis <truckman@FreeBSD.org> To: kmarx@vicor.com Cc: mckusick@beastie.mckusick.com Subject: Re: 4.8 ffs_dirpref problem Message-ID: <200311282135.hASLYveF018257@gw.catspoiler.org> In-Reply-To: <200311281107.hASB7BeF017371@gw.catspoiler.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 28 Nov, To: kmarx@vicor.com wrote: > On 24 Nov, Ken Marx wrote: >> >> >> Don Lewis wrote: > >>> Index: sys/kern/vfs_bio.c >>> =================================================================== >>> RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v >>> retrieving revision 1.242.2.21 >>> diff -u -r1.242.2.21 vfs_bio.c >>> --- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21 >>> +++ sys/kern/vfs_bio.c 18 Nov 2003 02:10:55 -0000 >>> @@ -140,6 +140,7 @@ >>> &bufreusecnt, 0, ""); >>> >>> static int bufhashmask; >>> +static int bufhashshift; >>> static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash; >>> struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } }; >>> char *buf_wmesg = BUF_WMESG; >>> @@ -160,7 +161,20 @@ >>> struct bufhashhdr * >>> bufhash(struct vnode *vnp, daddr_t bn) >>> { >>> - return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]); >>> + u_int64_t hashkey64; >>> + int hashkey; >>> + >>> + /* >>> + * Fibonacci hash, see Knuth's >>> + * _Art of Computer Programming, Volume 3 / Sorting and Searching_ >>> + * >>> + * We reduce the argument to 32 bits before doing the hash to >>> + * avoid the need for a slow 64x64 multiply on 32 bit platforms. >>> + */ >>> + hashkey64 = (u_int64_t)(uintptr_t)vnp + (u_int64_t)bn; >>> + hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 2654435769u) >> >>> + bufhashshift) & bufhashmask; >>> + return(&bufhashtbl[hashkey]); >>> } >>> >>> /* >>> @@ -319,8 +333,9 @@ >>> bufhashinit(caddr_t vaddr) >>> { >>> /* first, make a null hash table */ >>> + bufhashshift = 29; >>> for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1) >>> - ; >>> + bufhashshift--; >>> bufhashtbl = (void *)vaddr; >>> vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask; >>> --bufhashmask; >>> >>> >> >> Well, I'm mildly beflummoxed - I tried to compare hashtable preformance >> between all three known versions of the hashing - legacy power of 2, >> the Vicor ^= hash, and Don's fibonacci hash. >> >> Running with >> >> minifree = max( 1, avgifree / 4 ); >> minbfree = max( 1, avgbfree ); >> >> all perform about the same, with no performance problems all >> the way up to 100% disk capacity (didn't test into reserved space). >> >> Looking at instrumentation to show freq and avg depth of the >> hash buckets, everything seems very calm (mainly because >> we're not hitting the linear searching very often, I'd presume). >> >> I can't explain why I seemlingly got performance problems >> with similar (identical) minbfree code previously. >> >> So, out of spite, I went back to >> >> minbfree = max( 1, avgbfree/4 ); >> >> This does hit the hashtable harder for the legacy version >> and not so much for either new flavor. Here are a few >> samplings of calling my dump routine from the debugger. >> "avgdepth" really means 'search depth' since we use >> the depth reached after finding a bp in gbincore. >> >> The line below such as, >> >> 0: avgdepth[1] cnt=801 >> >> means that 801 of the hashtable buckets had an avg search >> depth of 1 at the time the debug routine was called. >> The 'N:' prefix means the N-th unique non-zero such value. >> So large cnt's for small []'d depth values means an efficient hash. >> >> I've edited out the details as much as possible. >> >> LEGACY: >> -------- >> Nov 24 13:34:54 oos0b /kernel: bh[442/0x1ba]: freq=2706110, avgdepth = 154 >> ... >> Nov 24 13:34:54 oos0b /kernel: 0: avgdepth[1] cnt=1015 >> Nov 24 13:34:54 oos0b /kernel: 1: avgdepth[2] cnt=7 >> Nov 24 13:34:54 oos0b /kernel: 2: avgdepth[154] cnt=1 <- !! >> Nov 24 13:34:54 oos0b /kernel: 3: avgdepth[3] cnt=1 >> ----------- >> >> Nov 24 13:36:49 oos0b /kernel: bh[442/0x1ba]: freq=3416953, avgdepth = 141 >> ... >> Nov 24 13:36:49 oos0b /kernel: 0: avgdepth[1] cnt=1017 >> Nov 24 13:36:49 oos0b /kernel: 1: avgdepth[141] cnt=1 >> Nov 24 13:36:49 oos0b /kernel: 2: avgdepth[2] cnt=6 >> >> VICOR x-or hashtable: >> --------------------- >> Nov 24 13:07:24 oos0b /kernel: 0: avgdepth[1] cnt=762 >> Nov 24 13:07:24 oos0b /kernel: 1: avgdepth[2] cnt=259 >> Nov 24 13:07:24 oos0b /kernel: 2: avgdepth[3] cnt=3 >> ----------- >> >> Nov 24 13:08:07 oos0b /kernel: 0: avgdepth[1] cnt=744 >> Nov 24 13:08:07 oos0b /kernel: 1: avgdepth[2] cnt=275 >> Nov 24 13:08:07 oos0b /kernel: 2: avgdepth[3] cnt=5 >> >> FIBONACCI: >> ---------- >> Nov 24 11:56:50 oos0b /kernel: 0: avgdepth[1] cnt=811 >> Nov 24 11:56:50 oos0b /kernel: 1: avgdepth[3] cnt=88 >> Nov 24 11:56:50 oos0b /kernel: 2: avgdepth[2] cnt=124 >> Nov 24 11:56:50 oos0b /kernel: 3: avgdepth[0] cnt=1 >> ----------- >> >> Nov 24 11:57:48 oos0b /kernel: 0: avgdepth[1] cnt=801 >> Nov 24 11:57:48 oos0b /kernel: 1: avgdepth[3] cnt=93 >> Nov 24 11:57:48 oos0b /kernel: 2: avgdepth[2] cnt=130 >> >> So, while this is far from analytically eshaustive, >> it almost appears the fibonacci hash has more entries >> of depth 3, while the Vicor one has more at depth 2. >> >> I'm happy to run more tests if you have ideas. I'm also fine >> to cut bait and go with whatever you decide. It *seems* like >> putting the fibonacci hash is prudent since the current hash >> has been observed to be expensive. I had trouble proving this >> unequivocally though. So, perhaps Don's minbfree fix is sufficient >> after all. I'm tempted at this point to go with the 100% flavor. > > I think we're running into one of the weaknesses in the Fibonacci hash. > There are a large number of hash entries for the cylinder group blocks, > which are located at offsets which are multiples of 89 * 2^10 in your > example, or something on the order of 2^16. The effect of this is for > the cylinder group number to be hashed using the least significant bits > of the hash multiplier, which don't work as well for distributing the > hash values. I tried some of Knuth's suggestions, and got better > results with the hash multiplier 0x9E376DB1u. The most significant 16 > bits of the multplier are the same as the original constant, and the > least significant bits act as a fraction in the desirable range of 1/3 > to 3/7. Please give this new hash multiplier a try. I went ahead and spun a new version of my patch with the new multiplier, one other tweak to the formula, and updated comments. Index: sys/kern/vfs_bio.c =================================================================== RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v retrieving revision 1.242.2.21 diff -u -r1.242.2.21 vfs_bio.c --- sys/kern/vfs_bio.c 9 Aug 2003 16:21:19 -0000 1.242.2.21 +++ sys/kern/vfs_bio.c 28 Nov 2003 20:02:06 -0000 @@ -140,6 +140,7 @@ &bufreusecnt, 0, ""); static int bufhashmask; +static int bufhashshift; static LIST_HEAD(bufhashhdr, buf) *bufhashtbl, invalhash; struct bqueues bufqueues[BUFFER_QUEUES] = { { 0 } }; char *buf_wmesg = BUF_WMESG; @@ -160,7 +161,40 @@ struct bufhashhdr * bufhash(struct vnode *vnp, daddr_t bn) { - return(&bufhashtbl[(((uintptr_t)(vnp) >> 7) + (int)bn) & bufhashmask]); + u_int64_t hashkey64; + int hashkey; + + /* + * A variation on the Fibonacci hash that Knuth credits to + * R. W. Floyd, see Knuth's _Art of Computer Programming, + * Volume 3 / Sorting and Searching_ + * + * We reduce the argument to 32 bits before doing the hash to + * avoid the need for a slow 64x64 multiply on 32 bit platforms. + * + * sizeof(struct vnode) is 168 on i386, so toss some of the lower + * bits of the vnode address to reduce the key range, which + * improves the distribution of keys across buckets. + * + * The file system cylinder group blocks are very heavily + * used. They are located at invervals of fbg, which is + * on the order of 89 to 94 * 2^10, depending on other + * filesystem parameters, for a 16k block size. Smaller block + * sizes will reduce fpg approximately proportionally. This + * will cause the cylinder group index to be hashed using the + * lower bits of the hash multiplier, which will not distribute + * the keys as uniformly in a classic Fibonacci hash where a + * relatively small number of the upper bits of the result + * are used. Using 2^16 as a close-enough approximation to + * fpg, split the hash multiplier in half, with the upper 16 + * bits being the inverse of the golden ratio, and the lower + * 16 bits being a fraction between 1/3 and 3/7 (closer to + * 3/7 in this case), that gives good experimental results. + */ + hashkey64 = ((u_int64_t)(uintptr_t)vnp >> 3) + (u_int64_t)bn; + hashkey = (((u_int32_t)(hashkey64 + (hashkey64 >> 32)) * 0x9E376DB1u) >> + bufhashshift) & bufhashmask; + return(&bufhashtbl[hashkey]); } /* @@ -319,8 +353,9 @@ bufhashinit(caddr_t vaddr) { /* first, make a null hash table */ + bufhashshift = 29; for (bufhashmask = 8; bufhashmask < nbuf / 4; bufhashmask <<= 1) - ; + bufhashshift--; bufhashtbl = (void *)vaddr; vaddr = vaddr + sizeof(*bufhashtbl) * bufhashmask; --bufhashmask;
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200311282135.hASLYveF018257>