Date: Mon, 29 Oct 2012 16:38:23 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Attilio Rao <attilio@freebsd.org> Cc: src-committers@freebsd.org, John Baldwin <jhb@freebsd.org>, Jeff Roberson <jeff@freebsd.org>, Florian Smeets <flo@freebsd.org>, Bruce Evans <bde@freebsd.org>, svn-src-projects@freebsd.org Subject: Re: svn commit: r238907 - projects/calloutng/sys/kern Message-ID: <20121029155136.O943@besplex.bde.org> In-Reply-To: <CAJ-FndDPLmkpAJeGVN2wgbhdgHYezyUV-PPvH9e-CA7Go7HG3A@mail.gmail.com> References: <201207301350.q6UDobCI099069@svn.freebsd.org> <CAJ-FndBj8tpC_BJXs_RH8sG2TBG8yA=Lxu3-GTVT9Ap_zOCuVQ@mail.gmail.com> <CAJ-FndDnO7wjnWPV0tTu%2BUGHjsxa3YDarMxmyei3ZmjLAFvRkQ@mail.gmail.com> <201207301732.33474.jhb@freebsd.org> <CAJ-FndD5EO12xsWOAe6u0EvX00q33wxO4OivnGjzj0=T2Oe8uA@mail.gmail.com> <CAJ-FndCRg0UCThFkatp=tw7rUWWCvhsApLE=iztLpxpGBC1F9w@mail.gmail.com> <CAJ-FndBqV2uD8Th9ePtxyJwhMAPzY3AXA5cQ7HszLp=%2BfSpHTA@mail.gmail.com> <CAJ-FndDPLmkpAJeGVN2wgbhdgHYezyUV-PPvH9e-CA7Go7HG3A@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 29 Oct 2012, Attilio Rao wrote: > Now that sched_pin()/sched_unpin() are fixed I would like to introduce > this new patch, making critical_enter()/critical_exit() inline: > http://www.freebsd.org/~attilio/inline_critical.patch > > The concept is pretty simple: simple add/dec for critical_enter, exit > are inlined, the rest is in an "hard path". Debugging enables the hard > paths by default (really I think that only KTR may be due here, but I > thought that in case of INVARIANTS this was also wanted, so I added > the check also for that case). > > flo@ has stress-tested the patch already and he may be starting > serious benchmarks soon. > The effectiveness of this patch is to determine and it brings again > the question: better an hard function or the usage of compiler membars > to avoid issues? I think that this patch should be in only if we can > absolutely prove a measurable performance bump, otherwise I will just > add a note to critical_enter()/critical_exit() explaining why they > should not be inlined at all. Inlining of mtx_lock_spin() is bogus unless critical_enter() is inlined. Similarly for mtx_unlock_spin() and critical_exit(). It saves 1 function call. but critical_enter() does a function call anyway. critical_exit*( also has a branch in branch in it that might cost more than the function call just for mispredicting it. My version goes the other way and uninlines mtx_lock_spin() and mtx_unlock_spin(). Then it inlines (open codes) critical_enter() and critical_exit() in them. This saves a lot of text space and thus hopefully saves time too. I couldn't find any cases where it either saved or lost a significant amount of time. The main cause of time differences is probably mispredicted branches: with mtx*() inlined, the branches in it can be predicted separately, so they are more predictable. However, when they are separate, there may be enough to thrash the branch predictor. It is probably best to only inline mtx*() for a few heavily-used locks and only inline critical*() if it is either in one of these or is in a non-inline function. Otherwise, inlining mainly gives code bloat that is only harmless if the code is rarely actually used. But this is hard to configure. <sys/mutex.h> is already a mess to support uninlining for debugging cases. OTOH, I couldn't get uninlining of mtx_lock() and mtx_unlock() to work. These are a little smaller than the spinlock versions, mainly since they don't inline critical*(). The loss from unlining them cannot be compensated for by uninlining critical*() in them, and I found just 1 workload where uninlining them was slower: udp packets per second bandwidth tests. Uninlining gives a very measurable space debloatage jump :-). Time performance changes for uninlining and inlining are difficult to measure and more difficult to verify, since they are very machine-dependent, load-dependent, and cache-dependent. While I was measuring it, the packet bandwidth test was varying by up to 30% due to changes like adding 4 bytes in unused code or data. I attribute this to more cache misses, but couldn't find exactly where they were. (This, was mostly on AthlonXP and Athlon64 systems, where caches are large but not very associative. The non-associativity tends to cause more cache misses in microbenchmarks since although microbenchmarks give an unreally small data set, it is not small enough to fit perfectly with low associativity). The only consistency was that -current was slower than my version and kept getting slower. I attribute the slowness to general kernel bloat. When the CPU resources are held fixed, the creeping bloat steals more resources from the fast path. Notes about this shouldn't say more than that it didn't work when you tried it. Doubling of CPU resources in next year's CPU might make everything fit in the fast path until the creeping bloat catches up. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121029155136.O943>