Date: Wed, 25 Jan 2017 06:10:53 +0100 From: Mateusz Guzik <mjguzik@gmail.com> To: Bruce Evans <brde@optusnet.com.au> Cc: Mateusz Guzik <mjg@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r312724 - in head/sys: sys vm Message-ID: <20170125051052.GA9675@dft-labs.eu> In-Reply-To: <20170125130904.Q857@besplex.bde.org> References: <201701242200.v0OM0GRO042084@repo.freebsd.org> <20170125130904.Q857@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Jan 25, 2017 at 01:29:21PM +1100, Bruce Evans wrote: > On Tue, 24 Jan 2017, Mateusz Guzik wrote: > > >Log: > > hwpmc: partially depessimize munmap handling if the module is not loaded > > > > HWPMC_HOOKS is enabled in GENERIC and triggers some work avoidable in the > > common (module not loaded) case. > >... > >Modified: head/sys/sys/pmckern.h > >============================================================================== > >--- head/sys/sys/pmckern.h Tue Jan 24 21:48:57 2017 (r312723) > >+++ head/sys/sys/pmckern.h Tue Jan 24 22:00:16 2017 (r312724) > >@@ -174,6 +174,9 @@ extern const int pmc_kernel_version; > >/* PMC soft per cpu trapframe */ > >extern struct trapframe pmc_tf[MAXCPU]; > > > >+/* Quick check if preparatory work is necessary */ > >+#define PMC_HOOK_INSTALLED(cmd) __predict_false(pmc_hook != NULL) > > I'm still waiting for other __predict_ugly() macro invocations to be > removed. > Actually pmc was already using the annotation, so this fits the local style. > The 2 new ones here even less effect than most. I couldn't measure the > effect on makeworld of removing the PMC_HOOKS completely. Removing KTRACE > long ago also seemed to have no effect. Unfortunately, it is impossible > to remove procctl and other bloat that has grown in the syscall path, and > is easy to measure slowdowns from this. > The kernel has a lot of systematic single- and multi-threaded slowness and it is unlikely that some branch removals/predictions in isolation will make a measurable difference in a macrobenchmark. Most of the slow down is caused by avoidable atomic ops, which arguably trump spurious branches even if the target cacheline is not ping-ponged around. However, when a spurious branch/mostly false branch shows up, I don't see why not plug it. There is work done on proper hotpatching support which will hopefully make plenty of branches go away. syscall handling is slowed down by a lot of branches which are typically false. However, here the solution is to group them under one condition which if true takes the kernel to the slow (current) path. That is ktrace, ptrace, capsicum and whatever else would set a ->td_specialhandling (or whatever) flag/bitfield along with whatever it sets now. There is an additional slowness of packing/unpacking all arguments, but I don't know if this is worth changing. For non-static syscalls, their handling can be postponed. Instead of putting such syscalls directly into the table, they can be using a proxy method which would handle the reference counting. That is, the knowledge of such syscalls would be removed from the common path completely. There is only a matter of finding a nice way to get back the syscall number. I am *NOT* working on this though. For the vfs layer, slowness includes: - vget grabs both hold and usecount, which gives 2 atomic ops. then they have to be released and that's another 2. I had patch which made it so that a usecount implies holdcnt, but it rotted and is only a part of the real fix - dropping references to vnodes too easily causes them to be reshuffled on the free list, which adds single-threaded slowness due to more work and a lock contention point. I don't have a good solution, but a total hack I came up with so far boils down to grabbing and extra reference in vget and letting the syncer drop it some time later - similarly, vputx relocks the vnode in exclusive mode to call the inactive routine. Said routine very often does nothing and/or gets away with a mere interlock. I have a hack which extends the interface so that the filesystem can be asked whether it wants to do inactive. - VOP_* routines have arguments packed which the target fs has to unpack before use. Wrappers have several avoidable and mostly false branches. I would argue a new interface is needed. - the main lockmgr routine takes 8 arguments, 2 more than what's passable in registers on amd64 with sysv abi. Then it proceeds to perform several branches. I have a patch introducing a fast path which avoids it all and falls back to the original lockmgr if there are issues getting the lock uncontested. this gave me +4% more ops in a single-threaded stat benchmark. there is a LOCK_PROFILING bug somewhere in there I have to fix before committing - lockmgr locks are not adaptive. the facility itself does support the feature, but the code is disabled and no fs annotates its vnodes to use it if present - closing a vnode performs vn_start_write + vnode lock + VOP_CLOSE. the latter almost always has nothing to do so, thus making the above work unnecessary. In a spirit similar to inactive handling, we can ask the fs what it wants to do. Or better yet, revamp the interface so that the fs calls relevant helpers to do the locking it needs. and so on. tl;dr the main point of this patch was to not lock pmc_sx if it can be helped. I presume you don't have objections here. The annotations and shifting the work around was done to combat part of the systematic slowness. Unclear if you claim that such actions have no effect in general, or just happen to not be measurable in isolation. They are not easily visible as there is plenty of other slowness in significantly more used places. > The above one is an even better obfuscation than most. It is only invoked > once, and it is context-dependent whether to false branch is the unusual > case. > The HWPMC_HOOKS option is present in the default kernel. I highly doubt majority of users load the hwpmc module to take advantage of it. Things like this tend to be hotpatched which will not be any faster than the current branch if it has to be taken. > >+ pkm.pm_address = (uintptr_t) NULL; > >+ if (vm_map_lookup_entry(map, addr, &entry)) { > >+ for (; > >+ entry != &map->header && entry->start < addr + size; > >+ entry = entry->next) { > >+ if (vm_map_check_protection(map, entry->start, > >+ entry->end, VM_PROT_EXECUTE) == TRUE) { > >+ pkm.pm_address = (uintptr_t) addr; > >+ pkm.pm_size = (size_t) size; > >+ break; > >+ } > > } > > } > > } > > Predictions could also be implemented in a more hard-coded way using > 'goto slowcase' and moving the slow case out of the way (assuming that > that the hardware predicts forward branches as not taken and the > compiler doesn't reorder the code anyway). This would be uglier, but > not much more invasive than re-indenting the code after adding a test > for the slow case. > I tried that. clang folds the slow case back in if the hint is not provided. That said, it still can be done to reduce indentation level. -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170125051052.GA9675>