Date: Wed, 24 Sep 2008 12:12:09 -0400 From: John Baldwin <jhb@freebsd.org> To: freebsd-hackers@freebsd.org Cc: Jeff Wheelhouse <freebsd-hackers@wheelhouse.org> Subject: Re: Major SMP problems with lstat/namei Message-ID: <200809241212.09920.jhb@freebsd.org> In-Reply-To: <8185F68B-C443-4891-BEC2-5E3D453DDC93@wheelhouse.org> References: <8185F68B-C443-4891-BEC2-5E3D453DDC93@wheelhouse.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday 24 September 2008 12:52:59 am Jeff Wheelhouse wrote: > > We have encountered some serious SMP performance/scalability problems > that we've tracked back to lstat/namei calls. I've written a quick > benchmark with a pair of tests to simplify/measure the problem. Both > tests use a tree of directories: the top level directory contains five > subdirectories a, b, c, d, and e. Each subdirectory contains five > subdirectories a, b, c, d, and e, and so on.. 1 directory at level > one, 5 at level two, 25 at level three, 125 at level four, 625 at > level five, and 3125 at level six. > > In the "realpath" test, a random path is constructed at the bottom of > the tree (e.g. /tmp/lstat/a/b/c/d/e) and realpath() is called on that, > provoking lstat() calls on the whole tree. This is to simulate a mix > of high-contention and low-contention lstat() calls. > > In the "lstat" test, lstat is called directly on a path at the bottom > of the tree. Since there are 3125 files, this simulates relatively > low-contention lstat() calls. > > In both cases, the test repeats as many times as possible for 60 > seconds. Each test is run simultaneously by multiple processes, with > progressively doubling concurrency from 1 to 512. > > What I found was that everything is fine at concurrency 2, probably > indicating that the benchmark pegged on some other resource limit. At > concurrency 4, realpath drops to 31.8% of concurrency 1. At > concurrency 8, performance is down to 18.3%. In the interim, CPU load > goes to 80-90% system CPU. I've confirmed via ktrace and the rusage > that the CPU usage is all system time, and that lstat() is the *only* > system call in the test (realpath() is called with an absolute path). > > I then reran the 32-process test on 1-7 cores, and found that > performance peaks at 2 cores and drops sharply from there. eight > cores runs *fifteen* times slower than two cores. > > The test full results are at the bottom of this message. > > This is on 6.3-RELEASE-p4 with vfs.lookup_shared=1. Shared lookups only work on the NFS client in 6.x. I'm about to turn them on for UFS in HEAD (8.x) and will backport the needed fixes to 7.x after 7.1 (too risky to merge to 7.x this close to a release). So lookup_shared=1 isn't going to really help on 6.x unless you are doing it all over NFS. You also want to backport my fix to cache_enter() before using lookup_shared at all: jhb 2008-08-23 15:13:39 UTC FreeBSD src repository Modified files: sys/kern vfs_cache.c Log: SVN rev 182061 on 2008-08-23 15:13:39Z by jhb Fix a race condition with concurrent LOOKUP namecache operations for a vnode not in the namecache when shared lookups are enabled (vfs.lookup_shared=1, it is currently off by default) and the filesystem supports shared lookups (e.g. NFS client). Specifically, if multiple concurrent LOOKUPs both miss in the name cache in parallel, each of the lookups may each end up adding an entry to the namecache resulting in duplicate entries in the namecache for the same pathname. A subsequent removal of the mapping of that pathname to that vnode (via remove or rename) would only evict one of the entries from the name cache. As a result, subseqent lookups for that pathname would still return the old vnode. This race was observed with shared lookups over NFS where a file was updated by writing a new file out to a temporary file name and then renaming that temporary file to the "real" file to effect atomic updates of a file. Other processes on the same client that were periodically reading the file would occasionally receive an ESTALE error from open(2) because the VOP_GETATTR() in nfs_open() would receive that error when given the stale vnode. The fix here is to check for duplicates in cache_enter() and just return if an entry for this same directory and leaf file name for this vnode is already in the cache. The check for duplicates is done by walking the per-vnode list of name cache entries. It is expected that this list should be very small in the common case (usually 0 or 1 entries during a cache_enter() since most files only have 1 "leaf" name). Reviewed by: ups, scottl MFC after: 2 months Revision Changes Path 1.124 +33 -9 src/sys/kern/vfs_cache.c If you want to try the UFS stuff on 7, you would need to probably backport at least the following, maybe more: jeff 2008-04-11 09:44:25 UTC FreeBSD src repository Modified files: sys/ufs/ufs ufs_lookup.c Log: - cache dp->i_offset in the local 'i_offset' variable for use in loop indexes so directory lookup becomes shared lock safe. In the modifying cases an exclusive lock is held here so the commit routine may rely on the state of i_offset. - Similarly handle i_diroff by fetching at the start and setting only once the operation is complete. Without the exclusive lock these are only considered hints. - Assert that an exclusive lock is held when we're preparing for a commit routine. - Honor the lock type request from lookup instead of always using exclusive locking. Tested by: pho, kris Revision Changes Path 1.87 +48 -29 src/sys/ufs/ufs/ufs_lookup.c jeff 2008-04-22 12:34:16 UTC FreeBSD src repository Modified files: sys/ufs/ufs inode.h ufs_lookup.c Log: - Use a local variable for i_ino in ufs_lookup. It is only used to communicate between two parts of this one function. This was causing problems with shared lookups as each would trash the ino value in the inode. - Remove the unused i_ino field from the inode structure. Revision Changes Path 1.53 +0 -1 src/sys/ufs/ufs/inode.h 1.88 +10 -13 src/sys/ufs/ufs/ufs_lookup.c jhb 2008-07-30 21:07:56 UTC FreeBSD src repository Modified files: sys/ufs/ufs ufs_lookup.c Log: SVN rev 181018 on 2008-07-30 21:07:56Z by jhb Whitespace tweak. Revision Changes Path 1.90 +0 -1 src/sys/ufs/ufs/ufs_lookup.c jhb 2008-09-16 16:18:36 UTC FreeBSD src repository Modified files: sys/ufs/ufs ufs_lookup.c Log: SVN rev 183079 on 2008-09-16 16:18:36Z by jhb - Only set i_offset in the parent directory's i-node during a lookup for non-LOOKUP operations. - Relax a VOP assertion for a DELETE lookup. rename() uses WANTPARENT instead of LOCKPARENT when looking up the source pathname. ufs_rename() uses a relookup() to lock the parent directory when it decides to finally remove the source path. Thus, it is ok for a DELETE with WANTPARENT set instead of LOCKPARENT to use a shared vnode lock rather than an exclusive vnode lock. Reported by: kris (2) Reviewed by: jeff Revision Changes Path 1.91 +9 -3 src/sys/ufs/ufs/ufs_lookup.c jhb 2008-09-16 19:06:44 UTC FreeBSD src repository Modified files: sys/ufs/ufs inode.h ufs_lookup.c Log: SVN rev 183093 on 2008-09-16 19:06:44Z by jhb Retire the 'i_reclen' field from the in-memory i-node. Previously, during a DELETE lookup operation, lookup would cache the length of the directory entry to be deleted in 'i_reclen'. Later, the actual VOP to remove the directory entry (ufs_remove, ufs_rename, etc.) would call ufs_dirremove() which extended the length of the previous directory entry to "remove" the deleted entry. However, we always read the entire block containing the directory entry when doing the removal, so we always have the directory entry to be deleted in-memory when doing the update to the directory block. Also, we already have to figure out where the directory entry that is being removed is in the block so that we can pass the component name to the dirhash code to update the dirhash. So, instead of passing 'i_reclen' from ufs_lookup() to the ufs_dirremove() routine, just read the 'd_reclen' field directly out of the entry being removed when updating the length of the previous entry in the block. This avoids a cosmetic issue of writing to 'i_reclen' while holding a shared vnode lock. It also slightly reduces the amount of side-band data passed from ufs_lookup() to operations updating a directory via the directory's i-node. Reviewed by: jeff Revision Changes Path 1.54 +0 -1 src/sys/ufs/ufs/inode.h 1.92 +9 -6 src/sys/ufs/ufs/ufs_lookup.c jeff 2008-04-11 09:48:12 UTC FreeBSD src repository Modified files: sys/ufs/ufs dirhash.h ufs_dirhash.c Log: - Use a lockmgr lock rather than a mtx to protect dirhash. This lock may be held for the duration of the various dirhash operations which avoids many complex unlock/lock/revalidate sequences. - Permit shared locks on lookup. To protect the ip->i_dirhash pointer we use the vnode interlock in the shared case. Callers holding the exclusive vnode lock can run without fear of concurrent modification to i_dirhash. - Hold an exclusive dirhash lock when creating the dirhash structure for the first time or when re-creating a dirhash structure which has been recycled. Tested by: kris, pho Revision Changes Path 1.6 +2 -1 src/sys/ufs/ufs/dirhash.h 1.24 +289 -227 src/sys/ufs/ufs/ufs_dirhash.c jhb 2008-09-16 16:23:56 UTC FreeBSD src repository Modified files: sys/ufs/ufs dirhash.h ufs_dirhash.c Log: SVN rev 183080 on 2008-09-16 16:23:56Z by jhb Fix a race with shared lookups on UFS. If the the dirhash code reached the cap on memory usage, then shared LOOKUP operations could start free'ing dirhash structures. Without these fixes, concurrent free's on the same directory could result in one of the threads blocked on a lock in a dirhash structure free'd by the other thread. - Replace the lockmgr lock in the dirhash structure with an sx lock. - Use a reference count managed with ufsdirhash_hold()/drop() to determine when to free the dirhash structures. The directory i-node holds a reference while the dirhash is attached to an i-node. Code that wishes to lock the dirhash while holding a shared vnode lock must first acquire a private reference to the dirhash while holding the vnode interlock before acquiring the dirhash sx lock. After acquiring the sx lock, it drops the private reference after checking to see if the dirhash is still used by the directory i-node. Revision Changes Path 1.7 +5 -1 src/sys/ufs/ufs/dirhash.h 1.25 +82 -33 src/sys/ufs/ufs/ufs_dirhash.c jhb 2008-09-22 20:53:22 UTC FreeBSD src repository Modified files: sys/ufs/ufs ufs_dirhash.c Log: SVN rev 183280 on 2008-09-22 20:53:22Z by jhb Close a race between concurrent calls to ufsdirhash_recycle() and ufsdirhash_free() introduced in my last commit by removing the dirhash about to be free'd in ufsdirhash_free() from the global dirhash list before dropping the sx lock. Tested by: kris Revision Changes Path 1.26 +10 -5 src/sys/ufs/ufs/ufs_dirhash.c There are additional fixes needed to fix races with umount -f, so if you backport all this stuff, don't use umount -f or you risk panics. :) Also, you will need to set the flag in the mount flags to enable shared lookups in the mount VOP in ffs_vfsops.c: --- //depot/projects/smpng/sys/ufs/ffs/ffs_vfsops.c 2008/08/25 16:33:41 +++ //depot/user/jhb/lock/ufs/ffs/ffs_vfsops.c 2008/08/29 15:04:03 @@ -852,7 +852,7 @@ * Initialize filesystem stat information in mount struct. */ MNT_ILOCK(mp); - mp->mnt_kern_flag |= MNTK_MPSAFE; + mp->mnt_kern_flag |= MNTK_MPSAFE | MNTK_LOOKUP_SHARED; MNT_IUNLOCK(mp); #ifdef UFS_EXTATTR #ifdef UFS_EXTATTR_AUTOSTART For 6.x you could in theory backport all of this as well, but there may be other fixes needed as well for 6.x. I'm only planning on merging this stuff back to 7.x myself. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200809241212.09920.jhb>