Date: Wed, 29 Oct 2014 17:33:35 +0100 From: Attilio Rao <attilio@freebsd.org> To: John Baldwin <jhb@freebsd.org> Cc: Adrian Chadd <adrian@freebsd.org>, Mateusz Guzik <mjguzik@gmail.com>, Alan Cox <alc@rice.edu>, Andrew Turner <andrew@fubar.geek.nz>, Konstantin Belousov <kib@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org> Subject: Re: atomic ops Message-ID: <CAJ-FndAxOuA4faFfUUbXkO7aLxNh_EKm6sZ65NE9EnU903GEOQ@mail.gmail.com> In-Reply-To: <201410291059.16829.jhb@freebsd.org> References: <20141028025222.GA19223@dft-labs.eu> <20141028175318.709d2ef6@bender.lan> <CAJ-FndCsvLV_B3Q0boyK78980chM79hFf_dRyEqRtxzMJkpD5g@mail.gmail.com> <201410291059.16829.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Oct 29, 2014 at 3:59 PM, John Baldwin <jhb@freebsd.org> wrote: > On Tuesday, October 28, 2014 4:08:27 pm Attilio Rao wrote: >> On Tue, Oct 28, 2014 at 6:53 PM, Andrew Turner <andrew@fubar.geek.nz> wrote: >> > On Tue, 28 Oct 2014 15:33:06 +0100 >> > Attilio Rao <attilio@freebsd.org> wrote: >> >> On Tue, Oct 28, 2014 at 3:25 PM, Andrew Turner <andrew@fubar.geek.nz> >> >> wrote: >> >> > On Tue, 28 Oct 2014 14:18:41 +0100 >> >> > Attilio Rao <attilio@freebsd.org> wrote: >> >> > >> >> >> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com> >> >> >> wrote: >> >> >> > As was mentioned sometime ago, our situation related to atomic >> >> >> > ops is not ideal. >> >> >> > >> >> >> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64) >> >> >> > provide full memory barriers, which is stronger than needed. >> >> >> > >> >> >> > Moreover, load is implemented as lock cmpchg on var address, so >> >> >> > it is addditionally slower especially when cpus compete. >> >> >> >> >> >> I already explained this once privately: fully memory barriers is >> >> >> not stronger than needed. >> >> >> FreeBSD has a different semantic than Linux. We historically >> >> >> enforce a full barrier on _acq() and _rel() rather then just a >> >> >> read and write barrier, hence we need a different implementation >> >> >> than Linux. There is code that relies on this property, like the >> >> >> locking primitives (release a mutex, for instance). >> >> > >> >> > On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support) >> >> > there are only full barriers. On both 32 and 64-bit ARMv8 ARM has >> >> > added support for load-acquire and store-release atomic >> >> > instructions. For the use in atomic instructions we can assume >> >> > these only operate of the address passed to them. >> >> > >> >> > It is unlikely we will use them in the 32-bit port however I would >> >> > like to know the expected semantics of these atomic functions to >> >> > make sure we get them correct in the arm64 port. I have been >> >> > advised by one of the ARM Linux kernel maintainers on the problems >> >> > they have found using these instructions but have yet to determine >> >> > what our atomic functions guarantee. >> >> >> >> For FreeBSD the "reference doc" is atomic(9). >> >> It clearly states: >> > >> > There may also be a difference between what it states, how they are >> > implemented, and what developers assume they do. I'm trying to make >> > sure I get them correct. >> >> atomic(9) is our reference so there might be no difference between >> what it states and what all architectures implement. >> I can say that x86 follows atomic(9) well. I'm not competent enough to >> judge if all the !x86 arches follow it completely. >> I can understand that developers may get confused. The FreeBSD scheme >> is pretty unique. It comes from the fact that historically the membar >> support was made to initially support x86. The super-widespread Linux >> design, instead, tried to catch all architectures in its description. >> It become very well known and I think it also "pushed" for companies >> like Intel to invest in improving performance of things like explicit >> read/write barriers, etc. > > Actually, it was designed to support ia64 (and specifically the .acq and > .rel modifiers on the ld, st, and cmpxchg instructions). Some of the > langage is wrong (and is my fault) in that they are not "read" and > "write" barriers. They truly are "acquire" and "release". That said, > x86 has stronger barriers than that, partly because on i386 there wasn't > a whole lot of options (though atomic_store_rel on even i386 should just > be a simple store). > >> >> The second variant of each operation includes a read memory barrier. >> >> This barrier ensures that the effects of this operation are completed >> >> before the effects of any later data accesses. As a result, the >> >> opera- tion is said to have acquire semantics as it acquires a >> >> pseudo-lock requiring further operations to wait until it has >> >> completed. To denote this, the suffix ``_acq'' is inserted into the >> >> function name immediately prior to the ``_<type>'' suffix. For >> >> example, to subtract two integers ensuring that any later writes will >> >> happen after the subtraction is per- formed, use >> >> atomic_subtract_acq_int(). >> > >> > It depends on the point we guarantee the acquire barrier to be. On ARMv8 >> > the function will be a load/modify/write sequence. If we use a >> > load-acquire operation for atomic_subtract_acq_int, for example, for a >> > pointer P and value to subtract X: >> > >> > loop: >> > load-acquire *P to N >> > perform N = N - X >> > store-exclusive N to *P >> > if the store failed goto loop >> > >> > where N and X are both registers. >> > >> > This will mean no access after this loop will happen before it, but >> > they may happen within it, e.g. if there was a later access A the >> > following may be possible: >> > >> > Load P >> > Access A >> > Store P >> >> No, this will be broken in FreeBSD if "Access A" is later. >> If "Access A" is prior the membar it doesn't really matter if it gets >> interleaved with any of the operations in the atomic instruction. >> Ideally, it could even surpass the Store P itself. >> But if "Access A" is later (and you want to implement an _acq() >> barrier) then it cannot absolutely gets in the middle of the atomic_* >> operation. > > Eh, that isn't broken. It is subtle however. The reason it isn't broken > is that if any access to P occurs afer the 'load P', then the store will > fail and the load-acquire will be retried, if A was accessed during the > atomi op, the load-acquire during the try will discard that and force A > to be re-accessed. If P is not accessed during the atomic op, then it is > safe to access A during the atomic op itself. This is specific to armv8, which I know 0 about. Good to know. >From a general point of view the description didn't seem ok. >> > We know the store will happen as if it fails, e.g. another processor >> > access *P, the store will have failed and will iterate over the loop. >> > >> > The other point is we can guarantee any store-release, and therefore >> > any prior access, has happened before a later load-acquire even if it's >> > on another processor. >> >> No, we can never guarantee on the visibility of the operations by other CPUs. >> We just make guarantee on how the operations are posted on the system >> bus (or how they are locally visible). >> Keeping in mind that FreeBSD model cames from x86, you can sense that >> some things are sized on the x86 model, which doesn't have any rule or >> ordering on global visibility of the operations. > > 1) Again, it's actually based on ia64. > > 2) x86 _does_ have rules on ordering of global visiblity in that most > stores (aside from some SSE special cases) will become visible in > program order. Now, you can't force the _timing_ of when the stores > become visible (and this is true in general, in MI code you can't > assume that a barrier is equivalent to a cache flush). Yes, this is what I mean. You can't have guarantee on the global timing of the memory accesses. Attilio -- Peace can only be achieved by understanding - A. Einstein
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndAxOuA4faFfUUbXkO7aLxNh_EKm6sZ65NE9EnU903GEOQ>