From owner-dev-commits-src-branches@freebsd.org Sat Apr 10 06:02:13 2021 Return-Path: Delivered-To: dev-commits-src-branches@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 013AC5CFAAE; Sat, 10 Apr 2021 06:02:13 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4FHPXh4wBgz4RCn; Sat, 10 Apr 2021 06:02:12 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 2585E3FA0; Sat, 10 Apr 2021 06:02:12 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.16.1/8.16.1) with ESMTP id 13A62CB4015311; Sat, 10 Apr 2021 06:02:12 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.16.1/8.16.1/Submit) id 13A62Cjv015310; Sat, 10 Apr 2021 06:02:12 GMT (envelope-from git) Date: Sat, 10 Apr 2021 06:02:12 GMT Message-Id: <202104100602.13A62Cjv015310@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-branches@FreeBSD.org From: Mateusz Guzik Subject: git: d8aa5ff4f606 - stable/13 - cache: add high level overview MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: mjg X-Git-Repository: src X-Git-Refname: refs/heads/stable/13 X-Git-Reftype: branch X-Git-Commit: d8aa5ff4f606788fdbbfd032ebbe2966c1fb4399 Auto-Submitted: auto-generated X-BeenThere: dev-commits-src-branches@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Commits to the stable branches of the FreeBSD src repository List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 10 Apr 2021 06:02:13 -0000 The branch stable/13 has been updated by mjg: URL: https://cgit.FreeBSD.org/src/commit/?id=d8aa5ff4f606788fdbbfd032ebbe2966c1fb4399 commit d8aa5ff4f606788fdbbfd032ebbe2966c1fb4399 Author: Mateusz Guzik AuthorDate: 2021-02-11 15:39:28 +0000 Commit: Mateusz Guzik CommitDate: 2021-04-10 05:57:56 +0000 cache: add high level overview Differential Revision: https://reviews.freebsd.org/D28675 (cherry picked from commit f79bd71def7a03b3a1b043cae7b908b36a05f41c) --- sys/kern/vfs_cache.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 125 insertions(+) diff --git a/sys/kern/vfs_cache.c b/sys/kern/vfs_cache.c index ad91fb1686a4..a3af949597af 100644 --- a/sys/kern/vfs_cache.c +++ b/sys/kern/vfs_cache.c @@ -82,6 +82,131 @@ __FBSDID("$FreeBSD$"); #include +/* + * High level overview of name caching in the VFS layer. + * + * Originally caching was implemented as part of UFS, later extracted to allow + * use by other filesystems. A decision was made to make it optional and + * completely detached from the rest of the kernel, which comes with limitations + * outlined near the end of this comment block. + * + * This fundamental choice needs to be revisited. In the meantime, the current + * state is described below. Significance of all notable routines is explained + * in comments placed above their implementation. Scattered thoroughout the + * file are TODO comments indicating shortcomings which can be fixed without + * reworking everything (most of the fixes will likely be reusable). Various + * details are omitted from this explanation to not clutter the overview, they + * have to be checked by reading the code and associated commentary. + * + * Keep in mind that it's individual path components which are cached, not full + * paths. That is, for a fully cached path "foo/bar/baz" there are 3 entries, + * one for each name. + * + * I. Data organization + * + * Entries are described by "struct namecache" objects and stored in a hash + * table. See cache_get_hash for more information. + * + * "struct vnode" contains pointers to source entries (names which can be found + * when traversing through said vnode), destination entries (names of that + * vnode (see "Limitations" for a breakdown on the subject) and a pointer to + * the parent vnode. + * + * The (directory vnode; name) tuple reliably determines the target entry if + * it exists. + * + * Since there are no small locks at this time (all are 32 bytes in size on + * LP64), the code works around the problem by introducing lock arrays to + * protect hash buckets and vnode lists. + * + * II. Filesystem integration + * + * Filesystems participating in name caching do the following: + * - set vop_lookup routine to vfs_cache_lookup + * - set vop_cachedlookup to whatever can perform the lookup if the above fails + * - if they support lockless lookup (see below), vop_fplookup_vexec and + * vop_fplookup_symlink are set along with the MNTK_FPLOOKUP flag on the + * mount point + * - call cache_purge or cache_vop_* routines to eliminate stale entries as + * applicable + * - call cache_enter to add entries depending on the MAKEENTRY flag + * + * With the above in mind, there are 2 entry points when doing lookups: + * - ... -> namei -> cache_fplookup -- this is the default + * - ... -> VOP_LOOKUP -> vfs_cache_lookup -- normally only called by namei + * should the above fail + * + * Example code flow how an entry is added: + * ... -> namei -> cache_fplookup -> cache_fplookup_noentry -> VOP_LOOKUP -> + * vfs_cache_lookup -> VOP_CACHEDLOOKUP -> ufs_lookup_ino -> cache_enter + * + * III. Performance considerations + * + * For lockless case forward lookup avoids any writes to shared areas apart + * from the terminal path component. In other words non-modifying lookups of + * different files don't suffer any scalability problems in the namecache. + * Looking up the same file is limited by VFS and goes beyond the scope of this + * file. + * + * At least on amd64 the single-threaded bottleneck for long paths is hashing + * (see cache_get_hash). There are cases where the code issues acquire fence + * multiple times, they can be combined on architectures which suffer from it. + * + * For locked case each encountered vnode has to be referenced and locked in + * order to be handed out to the caller (normally that's namei). This + * introduces significant hit single-threaded and serialization multi-threaded. + * + * Reverse lookup (e.g., "getcwd") fully scales provided it is fully cached -- + * avoids any writes to shared areas to any components. + * + * Unrelated insertions are partially serialized on updating the global entry + * counter and possibly serialized on colliding bucket or vnode locks. + * + * IV. Observability + * + * Note not everything has an explicit dtrace probe nor it should have, thus + * some of the one-liners below depend on implementation details. + * + * Examples: + * + * # Check what lookups failed to be handled in a lockless manner. Column 1 is + * # line number, column 2 is status code (see cache_fpl_status) + * dtrace -n 'vfs:fplookup:lookup:done { @[arg1, arg2] = count(); }' + * + * # Lengths of names added by binary name + * dtrace -n 'fbt::cache_enter_time:entry { @[execname] = quantize(args[2]->cn_namelen); }' + * + * # Same as above but only those which exceed 64 characters + * dtrace -n 'fbt::cache_enter_time:entry /args[2]->cn_namelen > 64/ { @[execname] = quantize(args[2]->cn_namelen); }' + * + * # Who is performing lookups with spurious slashes (e.g., "foo//bar") and what + * # path is it + * dtrace -n 'fbt::cache_fplookup_skip_slashes:entry { @[execname, stringof(args[0]->cnp->cn_pnbuf)] = count(); }' + * + * V. Limitations and implementation defects + * + * - since it is possible there is no entry for an open file, tools like + * "procstat" may fail to resolve fd -> vnode -> path to anything + * - even if a filesystem adds an entry, it may get purged (e.g., due to memory + * shortage) in which case the above problem applies + * - hardlinks are not tracked, thus if a vnode is reachable in more than one + * way, resolving a name may return a different path than the one used to + * open it (even if said path is still valid) + * - by default entries are not added for newly created files + * - adding an entry may need to evict negative entry first, which happens in 2 + * distinct places (evicting on lookup, adding in a later VOP) making it + * impossible to simply reuse it + * - there is a simple scheme to evict negative entries as the cache is approaching + * its capacity, but it is very unclear if doing so is a good idea to begin with + * - vnodes are subject to being recycled even if target inode is left in memory, + * which loses the name cache entries when it perhaps should not. in case of tmpfs + * names get duplicated -- kept by filesystem itself and namecache separately + * - struct namecache has a fixed size and comes in 2 variants, often wasting space. + * now hard to replace with malloc due to dependence on SMR. + * - lack of better integration with the kernel also turns nullfs into a layered + * filesystem instead of something which can take advantage of caching + */ + static SYSCTL_NODE(_vfs, OID_AUTO, cache, CTLFLAG_RW | CTLFLAG_MPSAFE, 0, "Name cache");