Date: Sun, 5 Oct 2014 08:37:51 +0200 From: Mateusz Guzik <mjguzik@gmail.com> To: Attilio Rao <attilio@freebsd.org> Cc: Alan Cox <alc@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>, Johan Schuijt <johan@transip.nl>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org> Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory barriers. Message-ID: <20141005063750.GA9262@dft-labs.eu> In-Reply-To: <CAJ-FndAHawNWC%2BYh2BRtmg4e-f3dUdRVonScwfaADABqWuF3Tg@mail.gmail.com> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <1408064112-573-2-git-send-email-mjguzik@gmail.com> <20140816093811.GX2737@kib.kiev.ua> <20140816185406.GD2737@kib.kiev.ua> <20140817012646.GA21025@dft-labs.eu> <CAJUyCcPA7ZDNbwyfx3fT7mq3SE7M-mL5he=eXZ8bY3z-xUCJ-g@mail.gmail.com> <20141004052851.GA27891@dft-labs.eu> <CAJ-FndAHawNWC%2BYh2BRtmg4e-f3dUdRVonScwfaADABqWuF3Tg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Oct 04, 2014 at 11:37:16AM +0200, Attilio Rao wrote: > On Sat, Oct 4, 2014 at 7:28 AM, Mateusz Guzik <mjguzik@gmail.com> wrote: > > Reviving. Sorry everyone for such big delay, $life. > > > > On Tue, Aug 19, 2014 at 02:24:16PM -0500, Alan Cox wrote: > >> On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik <mjguzik@gmail.com> wrote: > >> > Well, my memory-barrier-and-so-on-fu is rather weak. > >> > > >> > I had another look at the issue. At least on amd64, it looks like only > >> > compiler barrier is required for both reads and writes. > >> > > >> > According to AMD64 Architecture Programmer’s Manual Volume 2: System > >> > Programming, 7.2 Multiprocessor Memory Access Ordering states: > >> > > >> > "Loads do not pass previous loads (loads are not reordered). Stores do > >> > not pass previous stores (stores are not reordered)" > >> > > >> > Since the code modifying stuff only performs a series of writes and we > >> > expect exclusive writers, I find it applicable to this scenario. > >> > > >> > I checked linux sources and generated assembly, they indeed issue only > >> > a compiler barrier on amd64 (and for intel processors as well). > >> > > >> > atomic_store_rel_int on amd64 seems fine in this regard, but the only > >> > function for loads issues lock cmpxhchg which kills performance > >> > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain. > >> > > >> > Additionally release and acquire semantics seems to be a stronger than > >> > needed guarantee. > >> > > >> > > >> > >> This statement left me puzzled and got me to look at our x86 atomic.h for > >> the first time in years. It appears that our implementation of > >> atomic_load_acq_int() on x86 is, umm ..., unconventional. That is, it is > >> enforcing a constraint that simple acquire loads don't normally enforce. > >> For example, the C11 stdatomic.h simple acquire load doesn't enforce this > >> constraint. Moreover, our own implementation of atomic_load_acq_int() on > >> ia64, where the mapping from atomic_load_acq_int() to machine instructions > >> is straightforward, doesn't enforce this constraint either. > >> > > > > By 'this constraint' I presume you mean full memory barrier. > > > > It is unclear to me if one can just get rid of it currently. It > > definitely would be beneficial. > > > > In the meantime, if for some reason full barrier is still needed, we can > > speed up concurrent load_acq of the same var considerably. There is no > > need to lock cmpxchg on the same address. We should be able to replace > > it with +/-: > > lock add $0,(%rsp); > > movl ...; > > When I looked into some AMD manual (I think the same one which reports > using lock add $0, (%rsp)) I recall that the (reported) added > instructions latencies of "lock add" + "movl" is superior than the > single "cmpxchg". > Moreover, I think that the simple movl is going to lock the cache-line > anyway, so I doubt the "lock add" is going to provide any benefit. The > only benefit I can think of is that we will be able to use an _acq() > barriers on read-only memory with this trick (which is not possible > today as timecounters code can testify). > > If the latencies for "lock add" + "movl" is changed in the latest > Intel processors I can't say for sure, it may be worth to look at it. > I stated in my previous mail that it is faster, and I have trivial benchmark to back it up. In fget_unlocked there is an atomic_load_acq at the beginning (I have patches which get rid of it, btw). After the code is changed to lock add + movl, we get a significant speed up in a microbenchmark of 15 threads going read -> fget_unlocked. x vanilla-readpipe + lockadd-readpipe N Min Max Median Avg Stddev x 20 11073800 13429593 12266195 12190982 629380.16 + 20 53414354 54152272 53567250 53791945 322012.74 Difference at 95.0% confidence 4.1601e+07 +/- 319962 341.244% +/- 2.62458% (Student's t, pooled s = 499906) This is on Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz. Seems to make sense since we only read from shared area and lock add is performed on addresses private to executing threads. fwiw, lock cmpxchg on %rsp gives comparable speed up. Of course one would need to actually measure this stuff to get a better idea what's really going on within cpu. -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141005063750.GA9262>