Date: Sat, 16 Jul 2016 17:17:39 +0200 From: Mateusz Guzik <mjguzik@gmail.com> To: Ian Lepore <ian@freebsd.org> Cc: freebsd-current@freebsd.org Subject: Re: [PATCH] microoptimize locking primitives by introducing randomized delay between atomic ops Message-ID: <20160716151739.GA23095@dft-labs.eu> In-Reply-To: <1468161121.72182.115.camel@freebsd.org> References: <20160710111326.GA7853@dft-labs.eu> <1468161121.72182.115.camel@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Jul 10, 2016 at 08:32:01AM -0600, Ian Lepore wrote: > On Sun, 2016-07-10 at 13:13 +0200, Mateusz Guzik wrote: > > If the lock is contended, primitives like __mtx_lock_sleep will spin > > checking if the owner is running or the lock was freed. The problem > > is > > that once it is discovered that the lock is free, multiple CPUs are > > likely to try to do the atomic op which will make it more costly for > > everyone and throughput suffers. > > > > The standard thing to do is to have some sort of a randomized delay > > so > > that this kind of behaviour is reduced. > > > > As such, below is a trivial hack which takes cpu_ticks() into account > > and performs % 2048, which in my testing gives reasonbly good > > results. > > > > Please note there is definitely way more room for improvement in > > general. > > > > In terms of results, there was no statistically significant change in > > -j 40 buildworld nor buildkernel. > > > > However, a 40-way find on a ports tree placed on tmpfs yielded the > > following: > > > > x vanilla > > + patched > > +-------------------------------------------------------------------- > > --------------------+ > > > ++++ + x > > > x x x | > > > + ++++ +++ + + + ++ + + x x > > > x xxxxxxxx x x x| > > > |_____M____A__________| > > > |________AM______| | > > +-------------------------------------------------------------------- > > --------------------+ > > N Min Max Median Avg > > Stddev > > x 20 12.431 15.952 14.897 14.7444 > > 0.74241657 > > + 20 8.103 11.863 9.0135 9.44565 > > 1.0059484 > > Difference at 95.0% confidence > > -5.29875 +/- 0.565836 > > -35.9374% +/- 3.83764% > > (Student's t, pooled s = 0.884057) > > > > The patch: > [...] > > What about platforms that don't have a useful implementation of > cpu_ticks()? > Do we have such platforms and do they have smp? > What about platforms that don't suffer the large expense for atomic ops > that x86 apparently does? > The current state of locking primitives already seems to be x86-centric. Postponing of atomic ops is implemented in some parts and this patch only extends it (in a different form). That said, if we have platforms where this kind of stuff is detrimental to performance, machine-specific primitives should be introduced. Meanwhile, courtesy of andrew@ I tested the patch on cavium (48-way arm64) and saw great improvement. x vanilla + patched +----------------------------------------------------------------------------------------+ |+ | |+ | |+ | |+ | |+ | |+ x | |++ xxx | |++ xxxxxx| |A| |A_| | +----------------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 17.25 17.849 17.48 17.4968 0.19581556 + 10 6.56 6.679 6.586 6.6011 0.038013009 Difference at 95.0% confidence -10.8957 +/- 0.132528 -62.2725% +/- 0.757439% (Student's t, pooled s = 0.141047) Note: find does open+close a lot. close results in exclusive vnode locking if the fs does not have the MNTK_EXTENDED_SHARED flag set, which is the case on tmpfs. On this machine it contributed to a major slowdown. The flag was set locally. I'm not sure yet how safe the change in terms of general use. It is definitely fine enough for the benchmark. That said, I would like to commit this next week unless there are objections. -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160716151739.GA23095>