From owner-freebsd-fs@freebsd.org Wed May 8 17:33:25 2019 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8A833158ED88 for ; Wed, 8 May 2019 17:33:25 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 3ED8C86A81; Wed, 8 May 2019 17:33:13 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id C6B64438F16; Thu, 9 May 2019 03:33:02 +1000 (AEST) Date: Thu, 9 May 2019 03:33:01 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Conrad Meyer cc: Rick Macklem , "freebsd-fs@freebsd.org" Subject: Re: test hash functions for fsid In-Reply-To: Message-ID: <20190509015851.J1574@besplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 cx=a_idp_d a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=hF2rLc1pAAAA:8 a=oLN-b00mV-6PYYPs_TgA:9 a=7I_NG7zuy4-b4ME3:21 a=VU27sKedNTBlt2fD:21 a=CjuIK1q_8ugA:10 a=O9OM7dhJW_8Hj9EqqvKN:22 X-Rspamd-Queue-Id: 3ED8C86A81 X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; spf=pass (mx1.freebsd.org: domain of brde@optusnet.com.au designates 211.29.132.246 as permitted sender) smtp.mailfrom=brde@optusnet.com.au X-Spamd-Result: default: False [-5.40 / 15.00]; ARC_NA(0.00)[]; TO_DN_EQ_ADDR_SOME(0.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[246.132.29.211.list.dnswl.org : 127.0.5.1]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:211.29.132.0/23]; FREEMAIL_FROM(0.00)[optusnet.com.au]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[optusnet.com.au]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; TO_DN_SOME(0.00)[]; MX_INVALID(0.50)[greylisted]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-0.93)[-0.931,0]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; IP_SCORE(-2.67)[ip: (-7.32), ipnet: 211.28.0.0/14(-3.35), asn: 4804(-2.67), country: AU(-0.01)]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[optusnet.com.au]; ASN(0.00)[asn:4804, ipnet:211.28.0.0/14, country:AU]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 May 2019 17:33:25 -0000 On Wed, 8 May 2019, Conrad Meyer wrote: > On Wed, May 8, 2019 at 7:39 AM Rick Macklem wrote: >> If you have a FreeBSD system with lots of local file systems (1000s), maybe you could >> run the attached program on it and email me the stats. >> I'm just trying to see how well these hash functions work on fsids. > > Below I'll get to some comments on hash functions and hash tables, but > first: have you considered a non-hash data structure, such as PCTRIE? > PCTRIE_DEFINE() creates a datastructure with similar properties to a > hash table for this use, but does not require a hash function step at > all because all fsids are already unique, fixed-size, 64-bit values. > It may be especially space-efficient for non-ZFS FSIDs, when the > 64-bit key is composed with '(fsid.val[1] << 32 | fsid.val[0])'. > (Elaborated on more below.(A)) fsids must be unique in 32 bits, especially for nfs uses, since they are usually blindly truncated to 32 bits in nfs clients that have 32-bit dev_t to form st_dev, but st_dev must be unique. The FreeBSD-5 client blindly truncates: vap->va_fsid = vp->v_mount->mnt_stat.f_fsid.val[0]; The FreeBSD-11 client does the same for nfsv3, but for nfsv4 it does hashing stuff to convert from a 128-bit (?) na_filesid to 32 bits. va_fsid isn't really an fsid. It is a dev_t in all versions of FreeBSD, so it is too small to hold the 64-bit fsids created by vfs_getnewfsid() in all versions of FreeBSD except recent versions where dev_t is also 64 bits -- it works accidentally for them. > (A) FSIDs themselves have poor initial distribution and *very* few > unique bits (i.e., for filesystems that use vfs_getnewfsid, the int32 > fsid val[1] will be identical across all filesystems of the same type, > such as NFS mounts). The remaining val[0] is unique due to an > incrementing global counter. val[1] is useless unless there are more than 256 fs types, since it is already in the top 8 bits of the 24-bit minor number in fsid.val[0], except on systems with 8-bit minor numbers and in buggy versions of nfs that truncate the minor number to 8 bits (very old versions of oldnfs and not so old versions of newnfs). val[1] should never be used since it gets truncated away in many cases. val[0] has 16 bits of uniqueness from the counter, 8 mostly wasted for distinguishing the fs type, and another 8 bits even more mostly wasted for distinguishing the device type (255 for all synthetic devices was to stay away from major numbers 0-254 for physical devices, but devfs made all major numbers for physical devices 0). In some versions of FreeBSD, changing makedev() broke the uniqueness of val[0] by shifting the major bits into obvlivion (so synthetic devices were not distinguishable from physical devices, and since physical devices uses a dense set of minor numbers, collisions were likely). The difference between servers and clients' use of fsid is confusing, and I probably got some of the above wrong by conflating them. Clients should generate fsids unique in 32 bits. Using vfs_getnewfsid() and ignoring val[1] almost does this. This is needed to support applications using compat syscalls with 32-bit dev_t in st_dev and other places. Other file systems should also avoid actually using 64-bit dev_t for the same reason. Using devfs almost does this. It might be possible to generated more than 2**24 ptys so that the minor numbers start using the top 32 bits, but this is foot-shooting and hopefully disk devices are already allocated with low minor numbers. Devices which can back file systems should be in a different namespace (e.g., major == 1) to avoid this problem. I think device numbers from servers are only visible to clients for special files. Uniqueness then is not so important, but I found that changing makedev() broke it by noticing that server device numbers were truncated on clients. If the server fsid_t or dev_t is larger than the client fsid_t or dev_t, then there is no way that the client can represent all server numbers. Hashing can work for sparse numbers, but is complicated. Servers should also not actually use 64-bit dev_t. Bruce