Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Oct 2014 10:58:15 -0600
From:      Ian Lepore <ian@FreeBSD.org>
To:        John Baldwin <jhb@freebsd.org>
Cc:        Adrian Chadd <adrian@freebsd.org>, Mateusz Guzik <mjguzik@gmail.com>, Alan Cox <alc@rice.edu>, Andrew Turner <andrew@fubar.geek.nz>, attilio@freebsd.org, Konstantin Belousov <kib@freebsd.org>, freebsd-arch@freebsd.org
Subject:   Re: atomic ops
Message-ID:  <1414601895.17308.89.camel@revolution.hippie.lan>
In-Reply-To: <201410291059.16829.jhb@freebsd.org>
References:  <20141028025222.GA19223@dft-labs.eu> <20141028175318.709d2ef6@bender.lan> <CAJ-FndCsvLV_B3Q0boyK78980chM79hFf_dRyEqRtxzMJkpD5g@mail.gmail.com> <201410291059.16829.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 2014-10-29 at 10:59 -0400, John Baldwin wrote:
> On Tuesday, October 28, 2014 4:08:27 pm Attilio Rao wrote:
> > On Tue, Oct 28, 2014 at 6:53 PM, Andrew Turner <andrew@fubar.geek.nz> wrote:
> > > On Tue, 28 Oct 2014 15:33:06 +0100
> > > Attilio Rao <attilio@freebsd.org> wrote:
> > >> On Tue, Oct 28, 2014 at 3:25 PM, Andrew Turner <andrew@fubar.geek.nz>
> > >> wrote:
> > >> > On Tue, 28 Oct 2014 14:18:41 +0100
> > >> > Attilio Rao <attilio@freebsd.org> wrote:
> > >> >
> > >> >> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com>
> > >> >> wrote:
> > >> >> > As was mentioned sometime ago, our situation related to atomic
> > >> >> > ops is not ideal.
> > >> >> >
> > >> >> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
> > >> >> > provide full memory barriers, which is stronger than needed.
> > >> >> >
> > >> >> > Moreover, load is implemented as lock cmpchg on var address, so
> > >> >> > it is addditionally slower especially when cpus compete.
> > >> >>
> > >> >> I already explained this once privately: fully memory barriers is
> > >> >> not stronger than needed.
> > >> >> FreeBSD has a different semantic than Linux. We historically
> > >> >> enforce a full barrier on _acq() and _rel() rather then just a
> > >> >> read and write barrier, hence we need a different implementation
> > >> >> than Linux. There is code that relies on this property, like the
> > >> >> locking primitives (release a mutex, for instance).
> > >> >
> > >> > On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
> > >> > there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
> > >> > added support for load-acquire and store-release atomic
> > >> > instructions. For the use in atomic instructions we can assume
> > >> > these only operate of the address passed to them.
> > >> >
> > >> > It is unlikely we will use them in the 32-bit port however I would
> > >> > like to know the expected semantics of these atomic functions to
> > >> > make sure we get them correct in the arm64 port. I have been
> > >> > advised by one of the ARM Linux kernel maintainers on the problems
> > >> > they have found using these instructions but have yet to determine
> > >> > what our atomic functions guarantee.
> > >>
> > >> For FreeBSD the "reference doc" is atomic(9).
> > >> It clearly states:
> > >
> > > There may also be a difference between what it states, how they are
> > > implemented, and what developers assume they do. I'm trying to make
> > > sure I get them correct.
> > 
> > atomic(9) is our reference so there might be no difference between
> > what it states and what all architectures implement.
> > I can say that x86 follows atomic(9) well. I'm not competent enough to
> > judge if all the !x86 arches follow it completely.
> > I can understand that developers may get confused. The FreeBSD scheme
> > is pretty unique. It comes from the fact that historically the membar
> > support was made to initially support x86. The super-widespread Linux
> > design, instead, tried to catch all architectures in its description.
> > It become very well known and I think it also "pushed" for companies
> > like Intel to invest in improving performance of things like explicit
> > read/write barriers, etc.
> 
> Actually, it was designed to support ia64 (and specifically the .acq and
> .rel modifiers on the ld, st, and cmpxchg instructions).  Some of the
> langage is wrong (and is my fault) in that they are not "read" and
> "write" barriers.  They truly are "acquire" and "release".  That said,
> x86 has stronger barriers than that, partly because on i386 there wasn't
> a whole lot of options (though atomic_store_rel on even i386 should just
> be a simple store).
> 
> > >> The second variant of each operation includes a read memory barrier.
> > >> This barrier ensures that the effects of this operation are completed
> > >> before the effects of any later data accesses.  As a result, the
> > >> opera- tion is said to have acquire semantics as it acquires a
> > >> pseudo-lock requiring further operations to wait until it has
> > >> completed.  To denote this, the suffix ``_acq'' is inserted into the
> > >> function name immediately prior to the ``_<type>'' suffix.  For
> > >> example, to subtract two integers ensuring that any later writes will
> > >> happen after the subtraction is per- formed, use
> > >> atomic_subtract_acq_int().
> > >
> > > It depends on the point we guarantee the acquire barrier to be. On ARMv8
> > > the function will be a load/modify/write sequence. If we use a
> > > load-acquire operation for atomic_subtract_acq_int, for example, for a
> > > pointer P and value to subtract X:
> > >
> > > loop:
> > >  load-acquire *P to N
> > >  perform N = N - X
> > >  store-exclusive N to *P
> > >  if the store failed goto loop
> > >
> > > where N and X are both registers.
> > >
> > > This will mean no access after this loop will happen before it, but
> > > they may happen within it, e.g. if there was a later access A the
> > > following may be possible:
> > >
> > > Load P
> > > Access A
> > > Store P
> > 
> > No, this will be broken in FreeBSD if "Access A" is later.
> > If "Access A" is prior the membar it doesn't really matter if it gets
> > interleaved with any of the operations in the atomic instruction.
> > Ideally, it could even surpass the Store P itself.
> > But if "Access A" is later (and you want to implement an _acq()
> > barrier) then it cannot absolutely gets in the middle of the atomic_*
> > operation.
> 
> Eh, that isn't broken.  It is subtle however.  The reason it isn't broken
> is that if any access to P occurs afer the 'load P', then the store will
> fail and the load-acquire will be retried, if A was accessed during the
> atomi op, the load-acquire during the try will discard that and force A
> to be re-accessed.  If P is not accessed during the atomic op, then it is
> safe to access A during the atomic op itself.
> 

I'm not sure I completely agree with all of this. 

First, for 

        if any access to P occurs afer the 'load P', then the store will
        fail and the load-acquire will be retried

The term 'access' needs to be changed to 'store'.  Other read accesses
to P will not cause the store-exclusive to fail.

Next, when we consider 'Access A' I'm not sure it's true that the access
will replay if the store-exclusive fails and the operation loops.  The
access to A may have been a prefetch, even a prefetch for data on a
predicted upcoming execution branch which may or may not end up being
taken.

I think the only think that makes an ldrex/strex sequence safe for use
in implementing synchronization primitives is to insert a 'dmb' after
the acquire loop (after the strex succeeds), and 'dsb' before the
release loop (dsb is required for SMP, dmb might be good enough on UP).

Looking into this has made me realize our current armv6/7 atomics are
incorrect in this regard.  Guess I'll see about fixing them up Real Soon
Now.  :)

-- Ian

> > > We know the store will happen as if it fails, e.g. another processor
> > > access *P, the store will have failed and will iterate over the loop.
> > >
> > > The other point is we can guarantee any store-release, and therefore
> > > any prior access, has happened before a later load-acquire even if it's
> > > on another processor.
> > 
> > No, we can never guarantee on the visibility of the operations by other CPUs.
> > We just make guarantee on how the operations are posted on the system
> > bus (or how they are locally visible).
> > Keeping in mind that FreeBSD model cames from x86, you can sense that
> > some things are sized on the x86 model, which doesn't have any rule or
> > ordering on global visibility of the operations.
> 
> 1) Again, it's actually based on ia64.
> 
> 2) x86 _does_ have rules on ordering of global visiblity in that most
>    stores (aside from some SSE special cases) will become visible in
>    program order.  Now, you can't force the _timing_ of when the stores
>    become visible (and this is true in general, in MI code you can't
>    assume that a barrier is equivalent to a cache flush).
> 
> 3) In this case I think Andrew is using "armv8" for "we" and you can
>    depend on architecture-specific semantics to determine the implementation
>    of atomic(9).
> 





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1414601895.17308.89.camel>