From owner-svn-src-all@freebsd.org Sat Dec 31 12:32:51 2016 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E6C9BC98F88; Sat, 31 Dec 2016 12:32:51 +0000 (UTC) (envelope-from mjg@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id AF0321E6A; Sat, 31 Dec 2016 12:32:51 +0000 (UTC) (envelope-from mjg@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id uBVCWoKw079815; Sat, 31 Dec 2016 12:32:50 GMT (envelope-from mjg@FreeBSD.org) Received: (from mjg@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id uBVCWoiu079811; Sat, 31 Dec 2016 12:32:50 GMT (envelope-from mjg@FreeBSD.org) Message-Id: <201612311232.uBVCWoiu079811@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: mjg set sender to mjg@FreeBSD.org using -f From: Mateusz Guzik Date: Sat, 31 Dec 2016 12:32:50 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-11@freebsd.org Subject: svn commit: r310959 - in stable/11/sys: cddl/contrib/opensolaris/uts/common/fs/zfs kern sys X-SVN-Group: stable-11 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 31 Dec 2016 12:32:52 -0000 Author: mjg Date: Sat Dec 31 12:32:50 2016 New Revision: 310959 URL: https://svnweb.freebsd.org/changeset/base/310959 Log: MFC r305378,r305379,r305386,r305684,r306224,r306608,r306803,r307650,r307685, r308407,r308665,r308667,r309067: cache: put all negative entry management code into dedicated functions == cache: manage negative entry list with a dedicated lock Since negative entries are managed with a LRU list, a hit requires a modificaton. Currently the code tries to upgrade the global lock if needed and is forced to retry the lookup if it fails. Provide a dedicated lock for use when the cache is only shared-locked. == cache: defer freeing entries until after the global lock is dropped This also defers vdrop for held vnodes. == cache: improve scalability by introducing bucket locks An array of bucket locks is added. All modifications still require the global cache_lock to be held for writing. However, most readers only need the relevant bucket lock and in effect can run concurrently to the writer as long as they use a different lock. See the added comment for more details. This is an intermediate step towards removal of the global lock. == cache: get rid of the global lock Add a table of vnode locks and use them along with bucketlocks to provide concurrent modification support. The approach taken is to preserve the current behaviour of the namecache and just lock all relevant parts before any changes are made. Lookups still require the relevant bucket to be locked. == cache: ignore purgevfs requests for filesystems with few vnodes purgevfs is purely optional and induces lock contention in workloads which frequently mount and unmount filesystems. In particular, poudriere will do this for filesystems with 4 vnodes or less. Full cache scan is clearly wasteful. Since there is no explicit counter for namecache entries, the number of vnodes used by the target fs is checked. The default limit is the number of bucket locks. == (by kib) Limit scope of the optimization in r306608 to dounmount() caller only. Other uses of cache_purgevfs() do rely on the cache purge for correct operations, when paths are invalidated without unmount. == cache: split negative entry LRU into multiple lists This splits the ncneg_mtx lock while preserving the hit ratio at least during buildworld. Create N dedicated lists for new negative entries. Entries with at least one hit get promoted to the hot list, where they get requeued every M hits. Shrinking demotes one hot entry and performs a round-robin shrinking of regular lists. == cache: fix up a corner case in r307650 If no negative entry is found on the last list, the ncp pointer will be left uninitialized and a non-null value will make the function assume an entry was found. Fix the problem by initializing to NULL on entry. == (by kib) vn_fullpath1() checked VV_ROOT and then unreferenced vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race. Lock vnode after we noticed the VV_ROOT flag. See comments for explanation why unlocked check for the flag is considered safe. == cache: fix a race between entry removal and demotion The negative list shrinker can demote an entry with only hotlist + neglist locks held. On the other hand entry removal possibly sets the NCF_DVDROP without aformentioned locks held prior to detaching it from the respective netlist., which can lose the update made by the shrinker. == cache: plug a write-only variable in cache_negative_zap_one == cache: ensure that the number of bucket locks does not exceed hash size The size can be changed by side effect of modifying kern.maxvnodes. Since numbucketlocks was not modified, setting a sufficiently low value would give more locks than actual buckets, which would then lead to corruption. Force the number of buckets to be not smaller. Note this should not matter for real world cases. Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c stable/11/sys/kern/subr_witness.c stable/11/sys/kern/vfs_cache.c stable/11/sys/kern/vfs_mount.c stable/11/sys/kern/vfs_mountroot.c stable/11/sys/sys/vnode.h Directory Properties: stable/11/ (props changed) Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c Sat Dec 31 12:30:14 2016 (r310958) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c Sat Dec 31 12:32:50 2016 (r310959) @@ -1843,7 +1843,7 @@ zfsvfs_teardown(zfsvfs_t *zfsvfs, boolea */ (void) dnlc_purge_vfsp(zfsvfs->z_parent->z_vfs, 0); #ifdef FREEBSD_NAMECACHE - cache_purgevfs(zfsvfs->z_parent->z_vfs); + cache_purgevfs(zfsvfs->z_parent->z_vfs, true); #endif } Modified: stable/11/sys/kern/subr_witness.c ============================================================================== --- stable/11/sys/kern/subr_witness.c Sat Dec 31 12:30:14 2016 (r310958) +++ stable/11/sys/kern/subr_witness.c Sat Dec 31 12:32:50 2016 (r310959) @@ -623,6 +623,14 @@ static struct witness_order_list_entry o { "vnode interlock", &lock_class_mtx_sleep }, { NULL, NULL }, /* + * VFS namecache + */ + { "ncvn", &lock_class_mtx_sleep }, + { "ncbuc", &lock_class_rw }, + { "vnode interlock", &lock_class_mtx_sleep }, + { "ncneg", &lock_class_mtx_sleep }, + { NULL, NULL }, + /* * ZFS locking */ { "dn->dn_mtx", &lock_class_sx }, Modified: stable/11/sys/kern/vfs_cache.c ============================================================================== --- stable/11/sys/kern/vfs_cache.c Sat Dec 31 12:30:14 2016 (r310958) +++ stable/11/sys/kern/vfs_cache.c Sat Dec 31 12:32:50 2016 (r310959) @@ -51,6 +51,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include #include @@ -83,8 +84,10 @@ SDT_PROBE_DEFINE1(vfs, namecache, purge_ SDT_PROBE_DEFINE1(vfs, namecache, purgevfs, done, "struct mount *"); SDT_PROBE_DEFINE3(vfs, namecache, zap, done, "struct vnode *", "char *", "struct vnode *"); -SDT_PROBE_DEFINE2(vfs, namecache, zap_negative, done, "struct vnode *", - "char *"); +SDT_PROBE_DEFINE3(vfs, namecache, zap_negative, done, "struct vnode *", + "char *", "int"); +SDT_PROBE_DEFINE3(vfs, namecache, shrink_negative, done, "struct vnode *", + "char *", "int"); /* * This structure describes the elements in the cache of recent @@ -96,7 +99,10 @@ struct namecache { LIST_ENTRY(namecache) nc_src; /* source vnode list */ TAILQ_ENTRY(namecache) nc_dst; /* destination vnode list */ struct vnode *nc_dvp; /* vnode of parent of name */ - struct vnode *nc_vp; /* vnode the name refers to */ + union { + struct vnode *nu_vp; /* vnode the name refers to */ + u_int nu_neghits; /* negative entry hits */ + } n_un; u_char nc_flag; /* flag bits */ u_char nc_nlen; /* length of name */ char nc_name[0]; /* segment name + nul */ @@ -115,7 +121,10 @@ struct namecache_ts { LIST_ENTRY(namecache) nc_src; /* source vnode list */ TAILQ_ENTRY(namecache) nc_dst; /* destination vnode list */ struct vnode *nc_dvp; /* vnode of parent of name */ - struct vnode *nc_vp; /* vnode the name refers to */ + union { + struct vnode *nu_vp; /* vnode the name refers to */ + u_int nu_neghits; /* negative entry hits */ + } n_un; u_char nc_flag; /* flag bits */ u_char nc_nlen; /* length of name */ struct timespec nc_time; /* timespec provided by fs */ @@ -124,6 +133,9 @@ struct namecache_ts { char nc_name[0]; /* segment name + nul */ }; +#define nc_vp n_un.nu_vp +#define nc_neghits n_un.nu_neghits + /* * Flags in namecache.nc_flag */ @@ -131,6 +143,9 @@ struct namecache_ts { #define NCF_ISDOTDOT 0x02 #define NCF_TS 0x04 #define NCF_DTS 0x08 +#define NCF_DVDROP 0x10 +#define NCF_NEGATIVE 0x20 +#define NCF_HOTNEGATIVE 0x40 /* * Name caching works as follows: @@ -147,6 +162,37 @@ struct namecache_ts { * Upon reaching the last segment of a path, if the reference * is for DELETE, or NOCACHE is set (rewrite), and the * name is located in the cache, it will be dropped. + * + * These locks are used (in the order in which they can be taken): + * NAME TYPE ROLE + * vnodelock mtx vnode lists and v_cache_dd field protection + * bucketlock rwlock for access to given set of hash buckets + * neglist mtx negative entry LRU management + * + * Additionally, ncneg_shrink_lock mtx is used to have at most one thread + * shrinking the LRU list. + * + * It is legal to take multiple vnodelock and bucketlock locks. The locking + * order is lower address first. Both are recursive. + * + * "." lookups are lockless. + * + * ".." and vnode -> name lookups require vnodelock. + * + * name -> vnode lookup requires the relevant bucketlock to be held for reading. + * + * Insertions and removals of entries require involved vnodes and bucketlocks + * to be write-locked to prevent other threads from seeing the entry. + * + * Some lookups result in removal of the found entry (e.g. getting rid of a + * negative entry with the intent to create a positive one), which poses a + * problem when multiple threads reach the state. Similarly, two different + * threads can purge two different vnodes and try to remove the same name. + * + * If the already held vnode lock is lower than the second required lock, we + * can just take the other lock. However, in the opposite case, this could + * deadlock. As such, this is resolved by trylocking and if that fails unlocking + * the first node, locking everything in order and revalidating the state. */ /* @@ -155,7 +201,6 @@ struct namecache_ts { #define NCHHASH(hash) \ (&nchashtbl[(hash) & nchash]) static LIST_HEAD(nchashhead, namecache) *nchashtbl; /* Hash Table */ -static TAILQ_HEAD(, namecache) ncneg; /* Hash Table */ static u_long nchash; /* size of hash table */ SYSCTL_ULONG(_debug, OID_AUTO, nchash, CTLFLAG_RD, &nchash, 0, "Size of namecache hash table"); @@ -174,17 +219,54 @@ SYSCTL_ULONG(_debug, OID_AUTO, numcacheh u_int ncsizefactor = 2; SYSCTL_UINT(_vfs, OID_AUTO, ncsizefactor, CTLFLAG_RW, &ncsizefactor, 0, "Size factor for namecache"); +static u_int ncpurgeminvnodes; +SYSCTL_UINT(_vfs, OID_AUTO, ncpurgeminvnodes, CTLFLAG_RW, &ncpurgeminvnodes, 0, + "Number of vnodes below which purgevfs ignores the request"); +static u_int ncneghitsrequeue = 8; +SYSCTL_UINT(_vfs, OID_AUTO, ncneghitsrequeue, CTLFLAG_RW, &ncneghitsrequeue, 0, + "Number of hits to requeue a negative entry in the LRU list"); struct nchstats nchstats; /* cache effectiveness statistics */ -static struct rwlock cache_lock; -RW_SYSINIT(vfscache, &cache_lock, "Name Cache"); +static struct mtx ncneg_shrink_lock; +MTX_SYSINIT(vfscache_shrink_neg, &ncneg_shrink_lock, "Name Cache shrink neg", + MTX_DEF); + +struct neglist { + struct mtx nl_lock; + TAILQ_HEAD(, namecache) nl_list; +} __aligned(CACHE_LINE_SIZE); + +static struct neglist *neglists; +static struct neglist ncneg_hot; -#define CACHE_UPGRADE_LOCK() rw_try_upgrade(&cache_lock) -#define CACHE_RLOCK() rw_rlock(&cache_lock) -#define CACHE_RUNLOCK() rw_runlock(&cache_lock) -#define CACHE_WLOCK() rw_wlock(&cache_lock) -#define CACHE_WUNLOCK() rw_wunlock(&cache_lock) +static int shrink_list_turn; + +static u_int numneglists; +static inline struct neglist * +NCP2NEGLIST(struct namecache *ncp) +{ + + return (&neglists[(((uintptr_t)(ncp) >> 8) % numneglists)]); +} + +static u_int numbucketlocks; +static struct rwlock_padalign *bucketlocks; +#define HASH2BUCKETLOCK(hash) \ + ((struct rwlock *)(&bucketlocks[((hash) % numbucketlocks)])) + +static u_int numvnodelocks; +static struct mtx *vnodelocks; +static inline struct mtx * +VP2VNODELOCK(struct vnode *vp) +{ + struct mtx *vlp; + + if (vp == NULL) + return (NULL); + vlp = &vnodelocks[(((uintptr_t)(vp) >> 8) % numvnodelocks)]; + return (vlp); +} /* * UMA zones for the VFS cache. @@ -224,6 +306,8 @@ cache_free(struct namecache *ncp) if (ncp == NULL) return; ts = ncp->nc_flag & NCF_TS; + if ((ncp->nc_flag & NCF_DVDROP) != 0) + vdrop(ncp->nc_dvp); if (ncp->nc_nlen <= CACHE_PATH_CUTOFF) { if (ts) uma_zfree(cache_zone_small_ts, ncp); @@ -299,17 +383,49 @@ STATNODE_COUNTER(numfullpathfail2, "Number of fullpath search errors (VOP_VPTOCNP failures)"); STATNODE_COUNTER(numfullpathfail4, "Number of fullpath search errors (ENOMEM)"); STATNODE_COUNTER(numfullpathfound, "Number of successful fullpath calls"); -static long numupgrades; STATNODE_ULONG(numupgrades, - "Number of updates of the cache after lookup (write lock + retry)"); +static long zap_and_exit_bucket_fail; STATNODE_ULONG(zap_and_exit_bucket_fail, + "Number of times zap_and_exit failed to lock"); +static long cache_lock_vnodes_cel_3_failures; +STATNODE_ULONG(cache_lock_vnodes_cel_3_failures, + "Number of times 3-way vnode locking failed"); -static void cache_zap(struct namecache *ncp); -static int vn_vptocnp_locked(struct vnode **vp, struct ucred *cred, char *buf, - u_int *buflen); +static void cache_zap_locked(struct namecache *ncp, bool neg_locked); static int vn_fullpath1(struct thread *td, struct vnode *vp, struct vnode *rdir, char *buf, char **retbuf, u_int buflen); static MALLOC_DEFINE(M_VFSCACHE, "vfscache", "VFS name cache entries"); +static int cache_yield; +SYSCTL_INT(_vfs_cache, OID_AUTO, yield, CTLFLAG_RD, &cache_yield, 0, + "Number of times cache called yield"); + +static void +cache_maybe_yield(void) +{ + + if (should_yield()) { + cache_yield++; + kern_yield(PRI_USER); + } +} + +static inline void +cache_assert_vlp_locked(struct mtx *vlp) +{ + + if (vlp != NULL) + mtx_assert(vlp, MA_OWNED); +} + +static inline void +cache_assert_vnode_locked(struct vnode *vp) +{ + struct mtx *vlp; + + vlp = VP2VNODELOCK(vp); + cache_assert_vlp_locked(vlp); +} + static uint32_t cache_get_hash(char *name, u_char len, struct vnode *dvp) { @@ -320,6 +436,109 @@ cache_get_hash(char *name, u_char len, s return (hash); } +static inline struct rwlock * +NCP2BUCKETLOCK(struct namecache *ncp) +{ + uint32_t hash; + + hash = cache_get_hash(nc_get_name(ncp), ncp->nc_nlen, ncp->nc_dvp); + return (HASH2BUCKETLOCK(hash)); +} + +#ifdef INVARIANTS +static void +cache_assert_bucket_locked(struct namecache *ncp, int mode) +{ + struct rwlock *blp; + + blp = NCP2BUCKETLOCK(ncp); + rw_assert(blp, mode); +} +#else +#define cache_assert_bucket_locked(x, y) do { } while (0) +#endif + +#define cache_sort(x, y) _cache_sort((void **)(x), (void **)(y)) +static void +_cache_sort(void **p1, void **p2) +{ + void *tmp; + + if (*p1 > *p2) { + tmp = *p2; + *p2 = *p1; + *p1 = tmp; + } +} + +static void +cache_lock_all_buckets(void) +{ + u_int i; + + for (i = 0; i < numbucketlocks; i++) + rw_wlock(&bucketlocks[i]); +} + +static void +cache_unlock_all_buckets(void) +{ + u_int i; + + for (i = 0; i < numbucketlocks; i++) + rw_wunlock(&bucketlocks[i]); +} + +static void +cache_lock_all_vnodes(void) +{ + u_int i; + + for (i = 0; i < numvnodelocks; i++) + mtx_lock(&vnodelocks[i]); +} + +static void +cache_unlock_all_vnodes(void) +{ + u_int i; + + for (i = 0; i < numvnodelocks; i++) + mtx_unlock(&vnodelocks[i]); +} + +static int +cache_trylock_vnodes(struct mtx *vlp1, struct mtx *vlp2) +{ + + cache_sort(&vlp1, &vlp2); + MPASS(vlp2 != NULL); + + if (vlp1 != NULL) { + if (!mtx_trylock(vlp1)) + return (EAGAIN); + } + if (!mtx_trylock(vlp2)) { + if (vlp1 != NULL) + mtx_unlock(vlp1); + return (EAGAIN); + } + + return (0); +} + +static void +cache_unlock_vnodes(struct mtx *vlp1, struct mtx *vlp2) +{ + + MPASS(vlp1 != NULL || vlp2 != NULL); + + if (vlp1 != NULL) + mtx_unlock(vlp1); + if (vlp2 != NULL) + mtx_unlock(vlp2); +} + static int sysctl_nchstats(SYSCTL_HANDLER_ARGS) { @@ -361,9 +580,9 @@ retry: if (req->oldptr == NULL) return SYSCTL_OUT(req, 0, n_nchash * sizeof(int)); cntbuf = malloc(n_nchash * sizeof(int), M_TEMP, M_ZERO | M_WAITOK); - CACHE_RLOCK(); + cache_lock_all_buckets(); if (n_nchash != nchash + 1) { - CACHE_RUNLOCK(); + cache_unlock_all_buckets(); free(cntbuf, M_TEMP); goto retry; } @@ -371,7 +590,7 @@ retry: for (ncpp = nchashtbl, i = 0; i < n_nchash; ncpp++, i++) LIST_FOREACH(ncp, ncpp, nc_hash) cntbuf[i]++; - CACHE_RUNLOCK(); + cache_unlock_all_buckets(); for (error = 0, i = 0; i < n_nchash; i++) if ((error = SYSCTL_OUT(req, &cntbuf[i], sizeof(int))) != 0) break; @@ -394,7 +613,7 @@ sysctl_debug_hashstat_nchash(SYSCTL_HAND if (!req->oldptr) return SYSCTL_OUT(req, 0, 4 * sizeof(int)); - CACHE_RLOCK(); + cache_lock_all_buckets(); n_nchash = nchash + 1; /* nchash is max index, not count */ used = 0; maxlength = 0; @@ -411,7 +630,7 @@ sysctl_debug_hashstat_nchash(SYSCTL_HAND maxlength = count; } n_nchash = nchash + 1; - CACHE_RUNLOCK(); + cache_unlock_all_buckets(); pct = (used * 100) / (n_nchash / 100); error = SYSCTL_OUT(req, &n_nchash, sizeof(n_nchash)); if (error) @@ -433,49 +652,432 @@ SYSCTL_PROC(_debug_hashstat, OID_AUTO, n #endif /* - * cache_zap(): + * Negative entries management + * + * A variation of LRU scheme is used. New entries are hashed into one of + * numneglists cold lists. Entries get promoted to the hot list on first hit. + * Partial LRU for the hot list is maintained by requeueing them every + * ncneghitsrequeue hits. + * + * The shrinker will demote hot list head and evict from the cold list in a + * round-robin manner. + */ +static void +cache_negative_hit(struct namecache *ncp) +{ + struct neglist *neglist; + u_int hits; + + MPASS(ncp->nc_flag & NCF_NEGATIVE); + hits = atomic_fetchadd_int(&ncp->nc_neghits, 1); + if (ncp->nc_flag & NCF_HOTNEGATIVE) { + if ((hits % ncneghitsrequeue) != 0) + return; + mtx_lock(&ncneg_hot.nl_lock); + if (ncp->nc_flag & NCF_HOTNEGATIVE) { + TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); + TAILQ_INSERT_TAIL(&ncneg_hot.nl_list, ncp, nc_dst); + mtx_unlock(&ncneg_hot.nl_lock); + return; + } + /* + * The shrinker cleared the flag and removed the entry from + * the hot list. Put it back. + */ + } else { + mtx_lock(&ncneg_hot.nl_lock); + } + neglist = NCP2NEGLIST(ncp); + mtx_lock(&neglist->nl_lock); + if (!(ncp->nc_flag & NCF_HOTNEGATIVE)) { + TAILQ_REMOVE(&neglist->nl_list, ncp, nc_dst); + TAILQ_INSERT_TAIL(&ncneg_hot.nl_list, ncp, nc_dst); + ncp->nc_flag |= NCF_HOTNEGATIVE; + } + mtx_unlock(&neglist->nl_lock); + mtx_unlock(&ncneg_hot.nl_lock); +} + +static void +cache_negative_insert(struct namecache *ncp, bool neg_locked) +{ + struct neglist *neglist; + + MPASS(ncp->nc_flag & NCF_NEGATIVE); + cache_assert_bucket_locked(ncp, RA_WLOCKED); + neglist = NCP2NEGLIST(ncp); + if (!neg_locked) { + mtx_lock(&neglist->nl_lock); + } else { + mtx_assert(&neglist->nl_lock, MA_OWNED); + } + TAILQ_INSERT_TAIL(&neglist->nl_list, ncp, nc_dst); + if (!neg_locked) + mtx_unlock(&neglist->nl_lock); + atomic_add_rel_long(&numneg, 1); +} + +static void +cache_negative_remove(struct namecache *ncp, bool neg_locked) +{ + struct neglist *neglist; + bool hot_locked = false; + bool list_locked = false; + + MPASS(ncp->nc_flag & NCF_NEGATIVE); + cache_assert_bucket_locked(ncp, RA_WLOCKED); + neglist = NCP2NEGLIST(ncp); + if (!neg_locked) { + if (ncp->nc_flag & NCF_HOTNEGATIVE) { + hot_locked = true; + mtx_lock(&ncneg_hot.nl_lock); + if (!(ncp->nc_flag & NCF_HOTNEGATIVE)) { + list_locked = true; + mtx_lock(&neglist->nl_lock); + } + } else { + list_locked = true; + mtx_lock(&neglist->nl_lock); + } + } else { + mtx_assert(&neglist->nl_lock, MA_OWNED); + mtx_assert(&ncneg_hot.nl_lock, MA_OWNED); + } + if (ncp->nc_flag & NCF_HOTNEGATIVE) { + TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); + } else { + TAILQ_REMOVE(&neglist->nl_list, ncp, nc_dst); + } + if (list_locked) + mtx_unlock(&neglist->nl_lock); + if (hot_locked) + mtx_unlock(&ncneg_hot.nl_lock); + atomic_subtract_rel_long(&numneg, 1); +} + +static void +cache_negative_shrink_select(int start, struct namecache **ncpp, + struct neglist **neglistpp) +{ + struct neglist *neglist; + struct namecache *ncp; + int i; + + *ncpp = ncp = NULL; + + for (i = start; i < numneglists; i++) { + neglist = &neglists[i]; + if (TAILQ_FIRST(&neglist->nl_list) == NULL) + continue; + mtx_lock(&neglist->nl_lock); + ncp = TAILQ_FIRST(&neglist->nl_list); + if (ncp != NULL) + break; + mtx_unlock(&neglist->nl_lock); + } + + *neglistpp = neglist; + *ncpp = ncp; +} + +static void +cache_negative_zap_one(void) +{ + struct namecache *ncp, *ncp2; + struct neglist *neglist; + struct mtx *dvlp; + struct rwlock *blp; + + if (!mtx_trylock(&ncneg_shrink_lock)) + return; + + mtx_lock(&ncneg_hot.nl_lock); + ncp = TAILQ_FIRST(&ncneg_hot.nl_list); + if (ncp != NULL) { + neglist = NCP2NEGLIST(ncp); + mtx_lock(&neglist->nl_lock); + TAILQ_REMOVE(&ncneg_hot.nl_list, ncp, nc_dst); + TAILQ_INSERT_TAIL(&neglist->nl_list, ncp, nc_dst); + ncp->nc_flag &= ~NCF_HOTNEGATIVE; + mtx_unlock(&neglist->nl_lock); + } + + cache_negative_shrink_select(shrink_list_turn, &ncp, &neglist); + shrink_list_turn++; + if (shrink_list_turn == numneglists) + shrink_list_turn = 0; + if (ncp == NULL && shrink_list_turn == 0) + cache_negative_shrink_select(shrink_list_turn, &ncp, &neglist); + if (ncp == NULL) { + mtx_unlock(&ncneg_hot.nl_lock); + goto out; + } + + MPASS(ncp->nc_flag & NCF_NEGATIVE); + dvlp = VP2VNODELOCK(ncp->nc_dvp); + blp = NCP2BUCKETLOCK(ncp); + mtx_unlock(&neglist->nl_lock); + mtx_unlock(&ncneg_hot.nl_lock); + mtx_lock(dvlp); + rw_wlock(blp); + mtx_lock(&ncneg_hot.nl_lock); + mtx_lock(&neglist->nl_lock); + ncp2 = TAILQ_FIRST(&neglist->nl_list); + if (ncp != ncp2 || dvlp != VP2VNODELOCK(ncp2->nc_dvp) || + blp != NCP2BUCKETLOCK(ncp2) || !(ncp2->nc_flag & NCF_NEGATIVE)) { + ncp = NULL; + goto out_unlock_all; + } + SDT_PROBE3(vfs, namecache, shrink_negative, done, ncp->nc_dvp, + nc_get_name(ncp), ncp->nc_neghits); + + cache_zap_locked(ncp, true); +out_unlock_all: + mtx_unlock(&neglist->nl_lock); + mtx_unlock(&ncneg_hot.nl_lock); + rw_wunlock(blp); + mtx_unlock(dvlp); +out: + mtx_unlock(&ncneg_shrink_lock); + cache_free(ncp); +} + +/* + * cache_zap_locked(): * * Removes a namecache entry from cache, whether it contains an actual * pointer to a vnode or if it is just a negative cache entry. */ static void -cache_zap(struct namecache *ncp) +cache_zap_locked(struct namecache *ncp, bool neg_locked) { - struct vnode *vp; - rw_assert(&cache_lock, RA_WLOCKED); - CTR2(KTR_VFS, "cache_zap(%p) vp %p", ncp, ncp->nc_vp); - if (ncp->nc_vp != NULL) { + if (!(ncp->nc_flag & NCF_NEGATIVE)) + cache_assert_vnode_locked(ncp->nc_vp); + cache_assert_vnode_locked(ncp->nc_dvp); + cache_assert_bucket_locked(ncp, RA_WLOCKED); + + CTR2(KTR_VFS, "cache_zap(%p) vp %p", ncp, + (ncp->nc_flag & NCF_NEGATIVE) ? NULL : ncp->nc_vp); + if (!(ncp->nc_flag & NCF_NEGATIVE)) { SDT_PROBE3(vfs, namecache, zap, done, ncp->nc_dvp, nc_get_name(ncp), ncp->nc_vp); } else { - SDT_PROBE2(vfs, namecache, zap_negative, done, ncp->nc_dvp, - nc_get_name(ncp)); + SDT_PROBE3(vfs, namecache, zap_negative, done, ncp->nc_dvp, + nc_get_name(ncp), ncp->nc_neghits); } - vp = NULL; LIST_REMOVE(ncp, nc_hash); + if (!(ncp->nc_flag & NCF_NEGATIVE)) { + TAILQ_REMOVE(&ncp->nc_vp->v_cache_dst, ncp, nc_dst); + if (ncp == ncp->nc_vp->v_cache_dd) + ncp->nc_vp->v_cache_dd = NULL; + } else { + cache_negative_remove(ncp, neg_locked); + } if (ncp->nc_flag & NCF_ISDOTDOT) { if (ncp == ncp->nc_dvp->v_cache_dd) ncp->nc_dvp->v_cache_dd = NULL; } else { LIST_REMOVE(ncp, nc_src); if (LIST_EMPTY(&ncp->nc_dvp->v_cache_src)) { - vp = ncp->nc_dvp; - numcachehv--; + ncp->nc_flag |= NCF_DVDROP; + atomic_subtract_rel_long(&numcachehv, 1); } } - if (ncp->nc_vp) { - TAILQ_REMOVE(&ncp->nc_vp->v_cache_dst, ncp, nc_dst); - if (ncp == ncp->nc_vp->v_cache_dd) - ncp->nc_vp->v_cache_dd = NULL; + atomic_subtract_rel_long(&numcache, 1); +} + +static void +cache_zap_negative_locked_vnode_kl(struct namecache *ncp, struct vnode *vp) +{ + struct rwlock *blp; + + MPASS(ncp->nc_dvp == vp); + MPASS(ncp->nc_flag & NCF_NEGATIVE); + cache_assert_vnode_locked(vp); + + blp = NCP2BUCKETLOCK(ncp); + rw_wlock(blp); + cache_zap_locked(ncp, false); + rw_wunlock(blp); +} + +static bool +cache_zap_locked_vnode_kl2(struct namecache *ncp, struct vnode *vp, + struct mtx **vlpp) +{ + struct mtx *pvlp, *vlp1, *vlp2, *to_unlock; + struct rwlock *blp; + + MPASS(vp == ncp->nc_dvp || vp == ncp->nc_vp); + cache_assert_vnode_locked(vp); + + if (ncp->nc_flag & NCF_NEGATIVE) { + if (*vlpp != NULL) { + mtx_unlock(*vlpp); + *vlpp = NULL; + } + cache_zap_negative_locked_vnode_kl(ncp, vp); + return (true); + } + + pvlp = VP2VNODELOCK(vp); + blp = NCP2BUCKETLOCK(ncp); + vlp1 = VP2VNODELOCK(ncp->nc_dvp); + vlp2 = VP2VNODELOCK(ncp->nc_vp); + + if (*vlpp == vlp1 || *vlpp == vlp2) { + to_unlock = *vlpp; + *vlpp = NULL; } else { - TAILQ_REMOVE(&ncneg, ncp, nc_dst); - numneg--; + if (*vlpp != NULL) { + mtx_unlock(*vlpp); + *vlpp = NULL; + } + cache_sort(&vlp1, &vlp2); + if (vlp1 == pvlp) { + mtx_lock(vlp2); + to_unlock = vlp2; + } else { + if (!mtx_trylock(vlp1)) + goto out_relock; + to_unlock = vlp1; + } + } + rw_wlock(blp); + cache_zap_locked(ncp, false); + rw_wunlock(blp); + if (to_unlock != NULL) + mtx_unlock(to_unlock); + return (true); + +out_relock: + mtx_unlock(vlp2); + mtx_lock(vlp1); + mtx_lock(vlp2); + MPASS(*vlpp == NULL); + *vlpp = vlp1; + return (false); +} + +static int +cache_zap_locked_vnode(struct namecache *ncp, struct vnode *vp) +{ + struct mtx *pvlp, *vlp1, *vlp2, *to_unlock; + struct rwlock *blp; + int error = 0; + + MPASS(vp == ncp->nc_dvp || vp == ncp->nc_vp); + cache_assert_vnode_locked(vp); + + pvlp = VP2VNODELOCK(vp); + if (ncp->nc_flag & NCF_NEGATIVE) { + cache_zap_negative_locked_vnode_kl(ncp, vp); + goto out; + } + + blp = NCP2BUCKETLOCK(ncp); + vlp1 = VP2VNODELOCK(ncp->nc_dvp); + vlp2 = VP2VNODELOCK(ncp->nc_vp); + cache_sort(&vlp1, &vlp2); + if (vlp1 == pvlp) { + mtx_lock(vlp2); + to_unlock = vlp2; + } else { + if (!mtx_trylock(vlp1)) { + error = EAGAIN; + goto out; + } + to_unlock = vlp1; + } + rw_wlock(blp); + cache_zap_locked(ncp, false); + rw_wunlock(blp); + mtx_unlock(to_unlock); +out: + mtx_unlock(pvlp); + return (error); +} + +static int +cache_zap_rlocked_bucket(struct namecache *ncp, struct rwlock *blp) +{ + struct mtx *dvlp, *vlp; + + cache_assert_bucket_locked(ncp, RA_RLOCKED); + + dvlp = VP2VNODELOCK(ncp->nc_dvp); + vlp = NULL; + if (!(ncp->nc_flag & NCF_NEGATIVE)) + vlp = VP2VNODELOCK(ncp->nc_vp); + if (cache_trylock_vnodes(dvlp, vlp) == 0) { + rw_runlock(blp); + rw_wlock(blp); + cache_zap_locked(ncp, false); + rw_wunlock(blp); + cache_unlock_vnodes(dvlp, vlp); + return (0); + } + + rw_runlock(blp); + return (EAGAIN); +} + +static int +cache_zap_wlocked_bucket_kl(struct namecache *ncp, struct rwlock *blp, + struct mtx **vlpp1, struct mtx **vlpp2) +{ + struct mtx *dvlp, *vlp; + + cache_assert_bucket_locked(ncp, RA_WLOCKED); + + dvlp = VP2VNODELOCK(ncp->nc_dvp); + vlp = NULL; + if (!(ncp->nc_flag & NCF_NEGATIVE)) + vlp = VP2VNODELOCK(ncp->nc_vp); + cache_sort(&dvlp, &vlp); + + if (*vlpp1 == dvlp && *vlpp2 == vlp) { + cache_zap_locked(ncp, false); + cache_unlock_vnodes(dvlp, vlp); + *vlpp1 = NULL; + *vlpp2 = NULL; + return (0); + } + + if (*vlpp1 != NULL) + mtx_unlock(*vlpp1); + if (*vlpp2 != NULL) + mtx_unlock(*vlpp2); + *vlpp1 = NULL; + *vlpp2 = NULL; + + if (cache_trylock_vnodes(dvlp, vlp) == 0) { + cache_zap_locked(ncp, false); + cache_unlock_vnodes(dvlp, vlp); + return (0); + } + + rw_wunlock(blp); + *vlpp1 = dvlp; + *vlpp2 = vlp; + if (*vlpp1 != NULL) + mtx_lock(*vlpp1); + mtx_lock(*vlpp2); + rw_wlock(blp); + return (EAGAIN); +} + +static void +cache_lookup_unlock(struct rwlock *blp, struct mtx *vlp) +{ + + if (blp != NULL) { + rw_runlock(blp); + mtx_assert(vlp, MA_NOTOWNED); + } else { + mtx_unlock(vlp); } - numcache--; - cache_free(ncp); - if (vp != NULL) - vdrop(vp); } /* @@ -500,19 +1102,21 @@ cache_lookup(struct vnode *dvp, struct v struct timespec *tsp, int *ticksp) { struct namecache *ncp; + struct rwlock *blp; + struct mtx *dvlp, *dvlp2; uint32_t hash; - int error, ltype, wlocked; + int error, ltype; if (!doingcache) { cnp->cn_flags &= ~MAKEENTRY; return (0); } retry: - wlocked = 0; - counter_u64_add(numcalls, 1); + blp = NULL; + dvlp = VP2VNODELOCK(dvp); error = 0; + counter_u64_add(numcalls, 1); -retry_wlocked: if (cnp->cn_nameptr[0] == '.') { if (cnp->cn_namelen == 1) { *vpp = dvp; @@ -544,28 +1148,44 @@ retry_wlocked: } return (-1); } - if (!wlocked) - CACHE_RLOCK(); if (cnp->cn_namelen == 2 && cnp->cn_nameptr[1] == '.') { counter_u64_add(dotdothits, 1); - if (dvp->v_cache_dd == NULL) { + dvlp2 = NULL; + mtx_lock(dvlp); +retry_dotdot: + ncp = dvp->v_cache_dd; + if (ncp == NULL) { SDT_PROBE3(vfs, namecache, lookup, miss, dvp, "..", NULL); - goto unlock; + mtx_unlock(dvlp); + return (0); } if ((cnp->cn_flags & MAKEENTRY) == 0) { - if (!wlocked && !CACHE_UPGRADE_LOCK()) - goto wlock; - if (dvp->v_cache_dd->nc_flag & NCF_ISDOTDOT) - cache_zap(dvp->v_cache_dd); - dvp->v_cache_dd = NULL; - CACHE_WUNLOCK(); + if ((ncp->nc_flag & NCF_ISDOTDOT) != 0) { + if (ncp->nc_dvp != dvp) + panic("dvp %p v_cache_dd %p\n", dvp, ncp); + if (!cache_zap_locked_vnode_kl2(ncp, + dvp, &dvlp2)) + goto retry_dotdot; + MPASS(dvp->v_cache_dd == NULL); + mtx_unlock(dvlp); + if (dvlp2 != NULL) + mtx_unlock(dvlp2); + cache_free(ncp); + } else { + dvp->v_cache_dd = NULL; + mtx_unlock(dvlp); + if (dvlp2 != NULL) + mtx_unlock(dvlp2); + } return (0); } - ncp = dvp->v_cache_dd; - if (ncp->nc_flag & NCF_ISDOTDOT) - *vpp = ncp->nc_vp; - else + if ((ncp->nc_flag & NCF_ISDOTDOT) != 0) { + if (ncp->nc_flag & NCF_NEGATIVE) + *vpp = NULL; + else + *vpp = ncp->nc_vp; + } else *vpp = ncp->nc_dvp; /* Return failure if negative entry was found. */ if (*vpp == NULL) @@ -581,10 +1201,12 @@ retry_wlocked: nc_dotdottime; goto success; } - } else if (!wlocked) - CACHE_RLOCK(); + } hash = cache_get_hash(cnp->cn_nameptr, cnp->cn_namelen, dvp); + blp = HASH2BUCKETLOCK(hash); + rw_rlock(blp); + LIST_FOREACH(ncp, (NCHHASH(hash)), nc_hash) { counter_u64_add(numchecks, 1); if (ncp->nc_dvp == dvp && ncp->nc_nlen == cnp->cn_namelen && @@ -607,15 +1229,11 @@ retry_wlocked: /* We don't want to have an entry, so dump it */ if ((cnp->cn_flags & MAKEENTRY) == 0) { counter_u64_add(numposzaps, 1); - if (!wlocked && !CACHE_UPGRADE_LOCK()) - goto wlock; - cache_zap(ncp); - CACHE_WUNLOCK(); - return (0); + goto zap_and_exit; } /* We found a "positive" match, return the vnode */ - if (ncp->nc_vp) { + if (!(ncp->nc_flag & NCF_NEGATIVE)) { counter_u64_add(numposhits, 1); *vpp = ncp->nc_vp; CTR4(KTR_VFS, "cache_lookup(%p, %s) found %p via ncp %p", @@ -630,43 +1248,19 @@ negative_success: /* We found a negative match, and want to create it, so purge */ if (cnp->cn_nameiop == CREATE) { counter_u64_add(numnegzaps, 1); - if (!wlocked && !CACHE_UPGRADE_LOCK()) - goto wlock; - cache_zap(ncp); - CACHE_WUNLOCK(); - return (0); + goto zap_and_exit; } *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***