Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Oct 2014 10:59:16 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-arch@freebsd.org, attilio@freebsd.org
Cc:        Adrian Chadd <adrian@freebsd.org>, Mateusz Guzik <mjguzik@gmail.com>, Konstantin Belousov <kib@freebsd.org>, Andrew Turner <andrew@fubar.geek.nz>, Alan Cox <alc@rice.edu>
Subject:   Re: atomic ops
Message-ID:  <201410291059.16829.jhb@freebsd.org>
In-Reply-To: <CAJ-FndCsvLV_B3Q0boyK78980chM79hFf_dRyEqRtxzMJkpD5g@mail.gmail.com>
References:  <20141028025222.GA19223@dft-labs.eu> <20141028175318.709d2ef6@bender.lan> <CAJ-FndCsvLV_B3Q0boyK78980chM79hFf_dRyEqRtxzMJkpD5g@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tuesday, October 28, 2014 4:08:27 pm Attilio Rao wrote:
> On Tue, Oct 28, 2014 at 6:53 PM, Andrew Turner <andrew@fubar.geek.nz> wrote:
> > On Tue, 28 Oct 2014 15:33:06 +0100
> > Attilio Rao <attilio@freebsd.org> wrote:
> >> On Tue, Oct 28, 2014 at 3:25 PM, Andrew Turner <andrew@fubar.geek.nz>
> >> wrote:
> >> > On Tue, 28 Oct 2014 14:18:41 +0100
> >> > Attilio Rao <attilio@freebsd.org> wrote:
> >> >
> >> >> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com>
> >> >> wrote:
> >> >> > As was mentioned sometime ago, our situation related to atomic
> >> >> > ops is not ideal.
> >> >> >
> >> >> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
> >> >> > provide full memory barriers, which is stronger than needed.
> >> >> >
> >> >> > Moreover, load is implemented as lock cmpchg on var address, so
> >> >> > it is addditionally slower especially when cpus compete.
> >> >>
> >> >> I already explained this once privately: fully memory barriers is
> >> >> not stronger than needed.
> >> >> FreeBSD has a different semantic than Linux. We historically
> >> >> enforce a full barrier on _acq() and _rel() rather then just a
> >> >> read and write barrier, hence we need a different implementation
> >> >> than Linux. There is code that relies on this property, like the
> >> >> locking primitives (release a mutex, for instance).
> >> >
> >> > On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
> >> > there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
> >> > added support for load-acquire and store-release atomic
> >> > instructions. For the use in atomic instructions we can assume
> >> > these only operate of the address passed to them.
> >> >
> >> > It is unlikely we will use them in the 32-bit port however I would
> >> > like to know the expected semantics of these atomic functions to
> >> > make sure we get them correct in the arm64 port. I have been
> >> > advised by one of the ARM Linux kernel maintainers on the problems
> >> > they have found using these instructions but have yet to determine
> >> > what our atomic functions guarantee.
> >>
> >> For FreeBSD the "reference doc" is atomic(9).
> >> It clearly states:
> >
> > There may also be a difference between what it states, how they are
> > implemented, and what developers assume they do. I'm trying to make
> > sure I get them correct.
> 
> atomic(9) is our reference so there might be no difference between
> what it states and what all architectures implement.
> I can say that x86 follows atomic(9) well. I'm not competent enough to
> judge if all the !x86 arches follow it completely.
> I can understand that developers may get confused. The FreeBSD scheme
> is pretty unique. It comes from the fact that historically the membar
> support was made to initially support x86. The super-widespread Linux
> design, instead, tried to catch all architectures in its description.
> It become very well known and I think it also "pushed" for companies
> like Intel to invest in improving performance of things like explicit
> read/write barriers, etc.

Actually, it was designed to support ia64 (and specifically the .acq and
.rel modifiers on the ld, st, and cmpxchg instructions).  Some of the
langage is wrong (and is my fault) in that they are not "read" and
"write" barriers.  They truly are "acquire" and "release".  That said,
x86 has stronger barriers than that, partly because on i386 there wasn't
a whole lot of options (though atomic_store_rel on even i386 should just
be a simple store).

> >> The second variant of each operation includes a read memory barrier.
> >> This barrier ensures that the effects of this operation are completed
> >> before the effects of any later data accesses.  As a result, the
> >> opera- tion is said to have acquire semantics as it acquires a
> >> pseudo-lock requiring further operations to wait until it has
> >> completed.  To denote this, the suffix ``_acq'' is inserted into the
> >> function name immediately prior to the ``_<type>'' suffix.  For
> >> example, to subtract two integers ensuring that any later writes will
> >> happen after the subtraction is per- formed, use
> >> atomic_subtract_acq_int().
> >
> > It depends on the point we guarantee the acquire barrier to be. On ARMv8
> > the function will be a load/modify/write sequence. If we use a
> > load-acquire operation for atomic_subtract_acq_int, for example, for a
> > pointer P and value to subtract X:
> >
> > loop:
> >  load-acquire *P to N
> >  perform N = N - X
> >  store-exclusive N to *P
> >  if the store failed goto loop
> >
> > where N and X are both registers.
> >
> > This will mean no access after this loop will happen before it, but
> > they may happen within it, e.g. if there was a later access A the
> > following may be possible:
> >
> > Load P
> > Access A
> > Store P
> 
> No, this will be broken in FreeBSD if "Access A" is later.
> If "Access A" is prior the membar it doesn't really matter if it gets
> interleaved with any of the operations in the atomic instruction.
> Ideally, it could even surpass the Store P itself.
> But if "Access A" is later (and you want to implement an _acq()
> barrier) then it cannot absolutely gets in the middle of the atomic_*
> operation.

Eh, that isn't broken.  It is subtle however.  The reason it isn't broken
is that if any access to P occurs afer the 'load P', then the store will
fail and the load-acquire will be retried, if A was accessed during the
atomi op, the load-acquire during the try will discard that and force A
to be re-accessed.  If P is not accessed during the atomic op, then it is
safe to access A during the atomic op itself.

> > We know the store will happen as if it fails, e.g. another processor
> > access *P, the store will have failed and will iterate over the loop.
> >
> > The other point is we can guarantee any store-release, and therefore
> > any prior access, has happened before a later load-acquire even if it's
> > on another processor.
> 
> No, we can never guarantee on the visibility of the operations by other CPUs.
> We just make guarantee on how the operations are posted on the system
> bus (or how they are locally visible).
> Keeping in mind that FreeBSD model cames from x86, you can sense that
> some things are sized on the x86 model, which doesn't have any rule or
> ordering on global visibility of the operations.

1) Again, it's actually based on ia64.

2) x86 _does_ have rules on ordering of global visiblity in that most
   stores (aside from some SSE special cases) will become visible in
   program order.  Now, you can't force the _timing_ of when the stores
   become visible (and this is true in general, in MI code you can't
   assume that a barrier is equivalent to a cache flush).

3) In this case I think Andrew is using "armv8" for "we" and you can
   depend on architecture-specific semantics to determine the implementation
   of atomic(9).

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201410291059.16829.jhb>