From owner-freebsd-arch@FreeBSD.ORG Wed Oct 29 16:33:38 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 88151130; Wed, 29 Oct 2014 16:33:38 +0000 (UTC) Received: from mail-wi0-x22e.google.com (mail-wi0-x22e.google.com [IPv6:2a00:1450:400c:c05::22e]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A27DD1F1; Wed, 29 Oct 2014 16:33:37 +0000 (UTC) Received: by mail-wi0-f174.google.com with SMTP id d1so2217340wiv.7 for ; Wed, 29 Oct 2014 09:33:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=TF/uYCoSxUDq2sBIVwBDOiYd5dX90SoSJuXcg7pkGKY=; b=I4y3JFhgPbTUASnA/hcFMavI66+djGrw96bVaLQYIhtTce3HtZ/3cYfmOZK6K7P27D fjk6WGUQJ+zL4PX+0YveMH44HyAyFIzDWVuYy4O4XsqcxRVex3uBzuQO83IVjczbiI16 W10TWdN1t1Net12z5D3TgLD1bZJ09KooxBmdo3Hoj0VEWrv63TmoAtANj28FCVJW0myR aTqJBGcIUeo7xpjl3BRL2nPdlcxZqXfNmcDe2qqvXwOe/nRDd5k792H/CXTUuxBY0ytQ Z3gXx+zKgG2ho4HkipvUlrZWk6MzEMyJUHBpGxMyNt00q4hqD1CJXIMokDXdyFXvqdY1 KZNg== MIME-Version: 1.0 X-Received: by 10.180.19.234 with SMTP id i10mr7995696wie.28.1414600415661; Wed, 29 Oct 2014 09:33:35 -0700 (PDT) Reply-To: attilio@FreeBSD.org Sender: asmrookie@gmail.com Received: by 10.217.69.73 with HTTP; Wed, 29 Oct 2014 09:33:35 -0700 (PDT) In-Reply-To: <201410291059.16829.jhb@freebsd.org> References: <20141028025222.GA19223@dft-labs.eu> <20141028175318.709d2ef6@bender.lan> <201410291059.16829.jhb@freebsd.org> Date: Wed, 29 Oct 2014 17:33:35 +0100 X-Google-Sender-Auth: wInE1xvvT49TWCYSJ5g93hdZTYc Message-ID: Subject: Re: atomic ops From: Attilio Rao To: John Baldwin Content-Type: text/plain; charset=UTF-8 Cc: Adrian Chadd , Mateusz Guzik , Alan Cox , Andrew Turner , Konstantin Belousov , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Oct 2014 16:33:38 -0000 On Wed, Oct 29, 2014 at 3:59 PM, John Baldwin wrote: > On Tuesday, October 28, 2014 4:08:27 pm Attilio Rao wrote: >> On Tue, Oct 28, 2014 at 6:53 PM, Andrew Turner wrote: >> > On Tue, 28 Oct 2014 15:33:06 +0100 >> > Attilio Rao wrote: >> >> On Tue, Oct 28, 2014 at 3:25 PM, Andrew Turner >> >> wrote: >> >> > On Tue, 28 Oct 2014 14:18:41 +0100 >> >> > Attilio Rao wrote: >> >> > >> >> >> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik >> >> >> wrote: >> >> >> > As was mentioned sometime ago, our situation related to atomic >> >> >> > ops is not ideal. >> >> >> > >> >> >> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64) >> >> >> > provide full memory barriers, which is stronger than needed. >> >> >> > >> >> >> > Moreover, load is implemented as lock cmpchg on var address, so >> >> >> > it is addditionally slower especially when cpus compete. >> >> >> >> >> >> I already explained this once privately: fully memory barriers is >> >> >> not stronger than needed. >> >> >> FreeBSD has a different semantic than Linux. We historically >> >> >> enforce a full barrier on _acq() and _rel() rather then just a >> >> >> read and write barrier, hence we need a different implementation >> >> >> than Linux. There is code that relies on this property, like the >> >> >> locking primitives (release a mutex, for instance). >> >> > >> >> > On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support) >> >> > there are only full barriers. On both 32 and 64-bit ARMv8 ARM has >> >> > added support for load-acquire and store-release atomic >> >> > instructions. For the use in atomic instructions we can assume >> >> > these only operate of the address passed to them. >> >> > >> >> > It is unlikely we will use them in the 32-bit port however I would >> >> > like to know the expected semantics of these atomic functions to >> >> > make sure we get them correct in the arm64 port. I have been >> >> > advised by one of the ARM Linux kernel maintainers on the problems >> >> > they have found using these instructions but have yet to determine >> >> > what our atomic functions guarantee. >> >> >> >> For FreeBSD the "reference doc" is atomic(9). >> >> It clearly states: >> > >> > There may also be a difference between what it states, how they are >> > implemented, and what developers assume they do. I'm trying to make >> > sure I get them correct. >> >> atomic(9) is our reference so there might be no difference between >> what it states and what all architectures implement. >> I can say that x86 follows atomic(9) well. I'm not competent enough to >> judge if all the !x86 arches follow it completely. >> I can understand that developers may get confused. The FreeBSD scheme >> is pretty unique. It comes from the fact that historically the membar >> support was made to initially support x86. The super-widespread Linux >> design, instead, tried to catch all architectures in its description. >> It become very well known and I think it also "pushed" for companies >> like Intel to invest in improving performance of things like explicit >> read/write barriers, etc. > > Actually, it was designed to support ia64 (and specifically the .acq and > .rel modifiers on the ld, st, and cmpxchg instructions). Some of the > langage is wrong (and is my fault) in that they are not "read" and > "write" barriers. They truly are "acquire" and "release". That said, > x86 has stronger barriers than that, partly because on i386 there wasn't > a whole lot of options (though atomic_store_rel on even i386 should just > be a simple store). > >> >> The second variant of each operation includes a read memory barrier. >> >> This barrier ensures that the effects of this operation are completed >> >> before the effects of any later data accesses. As a result, the >> >> opera- tion is said to have acquire semantics as it acquires a >> >> pseudo-lock requiring further operations to wait until it has >> >> completed. To denote this, the suffix ``_acq'' is inserted into the >> >> function name immediately prior to the ``_'' suffix. For >> >> example, to subtract two integers ensuring that any later writes will >> >> happen after the subtraction is per- formed, use >> >> atomic_subtract_acq_int(). >> > >> > It depends on the point we guarantee the acquire barrier to be. On ARMv8 >> > the function will be a load/modify/write sequence. If we use a >> > load-acquire operation for atomic_subtract_acq_int, for example, for a >> > pointer P and value to subtract X: >> > >> > loop: >> > load-acquire *P to N >> > perform N = N - X >> > store-exclusive N to *P >> > if the store failed goto loop >> > >> > where N and X are both registers. >> > >> > This will mean no access after this loop will happen before it, but >> > they may happen within it, e.g. if there was a later access A the >> > following may be possible: >> > >> > Load P >> > Access A >> > Store P >> >> No, this will be broken in FreeBSD if "Access A" is later. >> If "Access A" is prior the membar it doesn't really matter if it gets >> interleaved with any of the operations in the atomic instruction. >> Ideally, it could even surpass the Store P itself. >> But if "Access A" is later (and you want to implement an _acq() >> barrier) then it cannot absolutely gets in the middle of the atomic_* >> operation. > > Eh, that isn't broken. It is subtle however. The reason it isn't broken > is that if any access to P occurs afer the 'load P', then the store will > fail and the load-acquire will be retried, if A was accessed during the > atomi op, the load-acquire during the try will discard that and force A > to be re-accessed. If P is not accessed during the atomic op, then it is > safe to access A during the atomic op itself. This is specific to armv8, which I know 0 about. Good to know. >From a general point of view the description didn't seem ok. >> > We know the store will happen as if it fails, e.g. another processor >> > access *P, the store will have failed and will iterate over the loop. >> > >> > The other point is we can guarantee any store-release, and therefore >> > any prior access, has happened before a later load-acquire even if it's >> > on another processor. >> >> No, we can never guarantee on the visibility of the operations by other CPUs. >> We just make guarantee on how the operations are posted on the system >> bus (or how they are locally visible). >> Keeping in mind that FreeBSD model cames from x86, you can sense that >> some things are sized on the x86 model, which doesn't have any rule or >> ordering on global visibility of the operations. > > 1) Again, it's actually based on ia64. > > 2) x86 _does_ have rules on ordering of global visiblity in that most > stores (aside from some SSE special cases) will become visible in > program order. Now, you can't force the _timing_ of when the stores > become visible (and this is true in general, in MI code you can't > assume that a barrier is equivalent to a cache flush). Yes, this is what I mean. You can't have guarantee on the global timing of the memory accesses. Attilio -- Peace can only be achieved by understanding - A. Einstein