From owner-freebsd-arch@FreeBSD.ORG  Wed Oct 29 16:33:38 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 88151130;
 Wed, 29 Oct 2014 16:33:38 +0000 (UTC)
Received: from mail-wi0-x22e.google.com (mail-wi0-x22e.google.com
 [IPv6:2a00:1450:400c:c05::22e])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id A27DD1F1;
 Wed, 29 Oct 2014 16:33:37 +0000 (UTC)
Received: by mail-wi0-f174.google.com with SMTP id d1so2217340wiv.7
 for <multiple recipients>; Wed, 29 Oct 2014 09:33:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:reply-to:sender:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=TF/uYCoSxUDq2sBIVwBDOiYd5dX90SoSJuXcg7pkGKY=;
 b=I4y3JFhgPbTUASnA/hcFMavI66+djGrw96bVaLQYIhtTce3HtZ/3cYfmOZK6K7P27D
 fjk6WGUQJ+zL4PX+0YveMH44HyAyFIzDWVuYy4O4XsqcxRVex3uBzuQO83IVjczbiI16
 W10TWdN1t1Net12z5D3TgLD1bZJ09KooxBmdo3Hoj0VEWrv63TmoAtANj28FCVJW0myR
 aTqJBGcIUeo7xpjl3BRL2nPdlcxZqXfNmcDe2qqvXwOe/nRDd5k792H/CXTUuxBY0ytQ
 Z3gXx+zKgG2ho4HkipvUlrZWk6MzEMyJUHBpGxMyNt00q4hqD1CJXIMokDXdyFXvqdY1
 KZNg==
MIME-Version: 1.0
X-Received: by 10.180.19.234 with SMTP id i10mr7995696wie.28.1414600415661;
 Wed, 29 Oct 2014 09:33:35 -0700 (PDT)
Reply-To: attilio@FreeBSD.org
Sender: asmrookie@gmail.com
Received: by 10.217.69.73 with HTTP; Wed, 29 Oct 2014 09:33:35 -0700 (PDT)
In-Reply-To: <201410291059.16829.jhb@freebsd.org>
References: <20141028025222.GA19223@dft-labs.eu>
 <20141028175318.709d2ef6@bender.lan>
 <CAJ-FndCsvLV_B3Q0boyK78980chM79hFf_dRyEqRtxzMJkpD5g@mail.gmail.com>
 <201410291059.16829.jhb@freebsd.org>
Date: Wed, 29 Oct 2014 17:33:35 +0100
X-Google-Sender-Auth: wInE1xvvT49TWCYSJ5g93hdZTYc
Message-ID: <CAJ-FndAxOuA4faFfUUbXkO7aLxNh_EKm6sZ65NE9EnU903GEOQ@mail.gmail.com>
Subject: Re: atomic ops
From: Attilio Rao <attilio@freebsd.org>
To: John Baldwin <jhb@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Cc: Adrian Chadd <adrian@freebsd.org>, Mateusz Guzik <mjguzik@gmail.com>,
 Alan Cox <alc@rice.edu>, Andrew Turner <andrew@fubar.geek.nz>,
 Konstantin Belousov <kib@freebsd.org>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Oct 2014 16:33:38 -0000

On Wed, Oct 29, 2014 at 3:59 PM, John Baldwin <jhb@freebsd.org> wrote:
> On Tuesday, October 28, 2014 4:08:27 pm Attilio Rao wrote:
>> On Tue, Oct 28, 2014 at 6:53 PM, Andrew Turner <andrew@fubar.geek.nz> wrote:
>> > On Tue, 28 Oct 2014 15:33:06 +0100
>> > Attilio Rao <attilio@freebsd.org> wrote:
>> >> On Tue, Oct 28, 2014 at 3:25 PM, Andrew Turner <andrew@fubar.geek.nz>
>> >> wrote:
>> >> > On Tue, 28 Oct 2014 14:18:41 +0100
>> >> > Attilio Rao <attilio@freebsd.org> wrote:
>> >> >
>> >> >> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com>
>> >> >> wrote:
>> >> >> > As was mentioned sometime ago, our situation related to atomic
>> >> >> > ops is not ideal.
>> >> >> >
>> >> >> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
>> >> >> > provide full memory barriers, which is stronger than needed.
>> >> >> >
>> >> >> > Moreover, load is implemented as lock cmpchg on var address, so
>> >> >> > it is addditionally slower especially when cpus compete.
>> >> >>
>> >> >> I already explained this once privately: fully memory barriers is
>> >> >> not stronger than needed.
>> >> >> FreeBSD has a different semantic than Linux. We historically
>> >> >> enforce a full barrier on _acq() and _rel() rather then just a
>> >> >> read and write barrier, hence we need a different implementation
>> >> >> than Linux. There is code that relies on this property, like the
>> >> >> locking primitives (release a mutex, for instance).
>> >> >
>> >> > On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
>> >> > there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
>> >> > added support for load-acquire and store-release atomic
>> >> > instructions. For the use in atomic instructions we can assume
>> >> > these only operate of the address passed to them.
>> >> >
>> >> > It is unlikely we will use them in the 32-bit port however I would
>> >> > like to know the expected semantics of these atomic functions to
>> >> > make sure we get them correct in the arm64 port. I have been
>> >> > advised by one of the ARM Linux kernel maintainers on the problems
>> >> > they have found using these instructions but have yet to determine
>> >> > what our atomic functions guarantee.
>> >>
>> >> For FreeBSD the "reference doc" is atomic(9).
>> >> It clearly states:
>> >
>> > There may also be a difference between what it states, how they are
>> > implemented, and what developers assume they do. I'm trying to make
>> > sure I get them correct.
>>
>> atomic(9) is our reference so there might be no difference between
>> what it states and what all architectures implement.
>> I can say that x86 follows atomic(9) well. I'm not competent enough to
>> judge if all the !x86 arches follow it completely.
>> I can understand that developers may get confused. The FreeBSD scheme
>> is pretty unique. It comes from the fact that historically the membar
>> support was made to initially support x86. The super-widespread Linux
>> design, instead, tried to catch all architectures in its description.
>> It become very well known and I think it also "pushed" for companies
>> like Intel to invest in improving performance of things like explicit
>> read/write barriers, etc.
>
> Actually, it was designed to support ia64 (and specifically the .acq and
> .rel modifiers on the ld, st, and cmpxchg instructions).  Some of the
> langage is wrong (and is my fault) in that they are not "read" and
> "write" barriers.  They truly are "acquire" and "release".  That said,
> x86 has stronger barriers than that, partly because on i386 there wasn't
> a whole lot of options (though atomic_store_rel on even i386 should just
> be a simple store).
>
>> >> The second variant of each operation includes a read memory barrier.
>> >> This barrier ensures that the effects of this operation are completed
>> >> before the effects of any later data accesses.  As a result, the
>> >> opera- tion is said to have acquire semantics as it acquires a
>> >> pseudo-lock requiring further operations to wait until it has
>> >> completed.  To denote this, the suffix ``_acq'' is inserted into the
>> >> function name immediately prior to the ``_<type>'' suffix.  For
>> >> example, to subtract two integers ensuring that any later writes will
>> >> happen after the subtraction is per- formed, use
>> >> atomic_subtract_acq_int().
>> >
>> > It depends on the point we guarantee the acquire barrier to be. On ARMv8
>> > the function will be a load/modify/write sequence. If we use a
>> > load-acquire operation for atomic_subtract_acq_int, for example, for a
>> > pointer P and value to subtract X:
>> >
>> > loop:
>> >  load-acquire *P to N
>> >  perform N = N - X
>> >  store-exclusive N to *P
>> >  if the store failed goto loop
>> >
>> > where N and X are both registers.
>> >
>> > This will mean no access after this loop will happen before it, but
>> > they may happen within it, e.g. if there was a later access A the
>> > following may be possible:
>> >
>> > Load P
>> > Access A
>> > Store P
>>
>> No, this will be broken in FreeBSD if "Access A" is later.
>> If "Access A" is prior the membar it doesn't really matter if it gets
>> interleaved with any of the operations in the atomic instruction.
>> Ideally, it could even surpass the Store P itself.
>> But if "Access A" is later (and you want to implement an _acq()
>> barrier) then it cannot absolutely gets in the middle of the atomic_*
>> operation.
>
> Eh, that isn't broken.  It is subtle however.  The reason it isn't broken
> is that if any access to P occurs afer the 'load P', then the store will
> fail and the load-acquire will be retried, if A was accessed during the
> atomi op, the load-acquire during the try will discard that and force A
> to be re-accessed.  If P is not accessed during the atomic op, then it is
> safe to access A during the atomic op itself.

This is specific to armv8, which I know 0 about. Good to know.
>From a general point of view the description didn't seem ok.

>> > We know the store will happen as if it fails, e.g. another processor
>> > access *P, the store will have failed and will iterate over the loop.
>> >
>> > The other point is we can guarantee any store-release, and therefore
>> > any prior access, has happened before a later load-acquire even if it's
>> > on another processor.
>>
>> No, we can never guarantee on the visibility of the operations by other CPUs.
>> We just make guarantee on how the operations are posted on the system
>> bus (or how they are locally visible).
>> Keeping in mind that FreeBSD model cames from x86, you can sense that
>> some things are sized on the x86 model, which doesn't have any rule or
>> ordering on global visibility of the operations.
>
> 1) Again, it's actually based on ia64.
>
> 2) x86 _does_ have rules on ordering of global visiblity in that most
>    stores (aside from some SSE special cases) will become visible in
>    program order.  Now, you can't force the _timing_ of when the stores
>    become visible (and this is true in general, in MI code you can't
>    assume that a barrier is equivalent to a cache flush).

Yes, this is what I mean. You can't have guarantee on the global
timing of the memory accesses.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein