Date: Tue, 28 Oct 2014 14:18:41 +0100 From: Attilio Rao <attilio@freebsd.org> To: Mateusz Guzik <mjguzik@gmail.com> Cc: Adrian Chadd <adrian@freebsd.org>, Alan Cox <alc@rice.edu>, Konstantin Belousov <kib@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org> Subject: Re: atomic ops Message-ID: <CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q@mail.gmail.com> In-Reply-To: <20141028025222.GA19223@dft-labs.eu> References: <20141028025222.GA19223@dft-labs.eu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com> wrote: > As was mentioned sometime ago, our situation related to atomic ops is > not ideal. > > atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide > full memory barriers, which is stronger than needed. > > Moreover, load is implemented as lock cmpchg on var address, so it is > addditionally slower especially when cpus compete. I already explained this once privately: fully memory barriers is not stronger than needed. FreeBSD has a different semantic than Linux. We historically enforce a full barrier on _acq() and _rel() rather then just a read and write barrier, hence we need a different implementation than Linux. There is code that relies on this property, like the locking primitives (release a mutex, for instance). In short: optimizing the implementation for performance is fine and due. Changing the semantic is not fine, unless you have reviewed and fixed all the uses of _rel() and _acq(). > On amd64 it is sufficient to place a compiler barrier in such cases. > > Next, we lack some atomic ops in the first place. > > Let's define some useful terms: > smp_wmb - no writes can be reordered past this point > smp_rmb - no reads can be reordered past this point > > With this in mind, we lack ops which would guarantee only the following: > > 1. var = tmp; smp_wmb(); > 2. tmp = var; smp_rmb(); > 3. smp_rmb(); tmp = var; > > This matters since what we can use already to emulate this is way > heavier than needed on aforementioned amd64 and most likely other archs. I can see the value of such barriers in case you want to just synchronize operation regards read or writes. I also believe that on newest intel processors (for which we should optimize) rmb() and wmb() got significantly faster than mb(). However the most interesting case would be for arm and mips, I assume. That's where you would see a bigger perf difference if you optimize the membar paths. Last time I looked into it, in FreeBSD kernel the Linux-ish rmb()/wmb()/etc. were used primilarly in 3 places: Linux-derived code, handling of 16-bits operand and implementation of "faster" bus barriers. Initially I had thought about just confining the smp_*() in a Linux compat layer and fix the other 2 in this way: for 16-bits operands just pad to 32-bits, as the C11 standard also does. For the bus barriers, just grow more versions to actually include the rmb()/wmb() scheme within. At this point, I understand we may want to instead support the concept of write-only or read-only barrier. This means that if we want to keep the concept tied to the current _acq()/_rel() scheme we will end up with a KPI explosion. I'm not the one making the call here, but for a faster and more granluar approach, possibly we can end up using smp_rmb() and smp_wmb() directly. As I said I'm not the one making the call. Attilio -- Peace can only be achieved by understanding - A. Einstein
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q>