From owner-freebsd-arch@FreeBSD.ORG  Sat Oct  4 09:37:19 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 6310F2AE;
 Sat,  4 Oct 2014 09:37:19 +0000 (UTC)
Received: from mail-wi0-x230.google.com (mail-wi0-x230.google.com
 [IPv6:2a00:1450:400c:c05::230])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id A4087AF7;
 Sat,  4 Oct 2014 09:37:18 +0000 (UTC)
Received: by mail-wi0-f176.google.com with SMTP id hi2so776038wib.15
 for <multiple recipients>; Sat, 04 Oct 2014 02:37:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:reply-to:sender:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type:content-transfer-encoding;
 bh=Ii9f544oQnFb0MX0RpKEem1vRD7GJGrztIE5zxlDeKE=;
 b=OK3Qa9ia9wRiZMYGDre9tA64z4ipk9qqF2qAw/6R6YQjusLQvCSQq2yPc/wXZNsiNp
 6DzJpfxfeFVg/W9l3NrTKxwWRigxLlEzarkGZc9WTtJgUxIuc8QxZveJ29tcNShKDSNT
 jO6uk3K8gmafwC9ZvVpVfXW0TkUJA39MB0rHdG+o/4YIZLBz3UcPckr1d23TNXByliNu
 yYJBU3JwgY3O4MFVDapabrveVCHTTruoFfR+oLXq7sFo0ZutcEmz/DYJ3ZV1FEQXXooQ
 xQ1zgH9VoAJ6dm2ni1G1WJdTTarPXfBFff4ArFNwG8LHLWJrWU8DhjQB49XaDeTVp1+1
 aSxQ==
MIME-Version: 1.0
X-Received: by 10.194.223.2 with SMTP id qq2mr1680120wjc.122.1412415437023;
 Sat, 04 Oct 2014 02:37:17 -0700 (PDT)
Reply-To: attilio@FreeBSD.org
Sender: asmrookie@gmail.com
Received: by 10.217.39.135 with HTTP; Sat, 4 Oct 2014 02:37:16 -0700 (PDT)
In-Reply-To: <20141004052851.GA27891@dft-labs.eu>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <1408064112-573-2-git-send-email-mjguzik@gmail.com>
 <20140816093811.GX2737@kib.kiev.ua>
 <20140816185406.GD2737@kib.kiev.ua>
 <20140817012646.GA21025@dft-labs.eu>
 <CAJUyCcPA7ZDNbwyfx3fT7mq3SE7M-mL5he=eXZ8bY3z-xUCJ-g@mail.gmail.com>
 <20141004052851.GA27891@dft-labs.eu>
Date: Sat, 4 Oct 2014 11:37:16 +0200
X-Google-Sender-Auth: VALReNx5CyBY7rx0KgSeiYj7VNc
Message-ID: <CAJ-FndAHawNWC+Yh2BRtmg4e-f3dUdRVonScwfaADABqWuF3Tg@mail.gmail.com>
Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory
 barriers.
From: Attilio Rao <attilio@freebsd.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: Alan Cox <alc@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>,
 Johan Schuijt <johan@transip.nl>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Oct 2014 09:37:19 -0000

On Sat, Oct 4, 2014 at 7:28 AM, Mateusz Guzik <mjguzik@gmail.com> wrote:
> Reviving. Sorry everyone for such big delay, $life.
>
> On Tue, Aug 19, 2014 at 02:24:16PM -0500, Alan Cox wrote:
>> On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik <mjguzik@gmail.com> wrote=
:
>> > Well, my memory-barrier-and-so-on-fu is rather weak.
>> >
>> > I had another look at the issue. At least on amd64, it looks like only
>> > compiler barrier is required for both reads and writes.
>> >
>> > According to AMD64 Architecture Programmer=E2=80=99s Manual Volume 2: =
System
>> > Programming, 7.2 Multiprocessor Memory Access Ordering states:
>> >
>> > "Loads do not pass previous loads (loads are not reordered). Stores do
>> > not pass previous stores (stores are not reordered)"
>> >
>> > Since the code modifying stuff only performs a series of writes and we
>> > expect exclusive writers, I find it applicable to this scenario.
>> >
>> > I checked linux sources and generated assembly, they indeed issue only
>> > a compiler barrier on amd64 (and for intel processors as well).
>> >
>> > atomic_store_rel_int on amd64 seems fine in this regard, but the only
>> > function for loads issues lock cmpxhchg which kills performance
>> > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain.
>> >
>> > Additionally release and acquire semantics seems to be a stronger than
>> > needed guarantee.
>> >
>> >
>>
>> This statement left me puzzled and got me to look at our x86 atomic.h fo=
r
>> the first time in years.  It appears that our implementation of
>> atomic_load_acq_int() on x86 is, umm ..., unconventional.  That is, it i=
s
>> enforcing a constraint that simple acquire loads don't normally enforce.
>> For example, the C11 stdatomic.h simple acquire load doesn't enforce thi=
s
>> constraint.  Moreover, our own implementation of atomic_load_acq_int() o=
n
>> ia64, where the mapping from atomic_load_acq_int() to machine instructio=
ns
>> is straightforward, doesn't enforce this constraint either.
>>
>
> By 'this constraint' I presume you mean full memory barrier.
>
> It is unclear to me if one can just get rid of it currently. It
> definitely would be beneficial.
>
> In the meantime, if for some reason full barrier is still needed, we can
> speed up concurrent load_acq of the same var considerably. There is no
> need to lock cmpxchg on the same address. We should be able to replace
> it with +/-:
> lock add $0,(%rsp);
> movl ...;

When I looked into some AMD manual (I think the same one which reports
using lock add $0, (%rsp)) I recall that the (reported) added
instructions latencies of "lock add" + "movl" is superior than the
single "cmpxchg".
Moreover, I think that the simple movl is going to lock the cache-line
anyway, so I doubt the "lock add" is going to provide any benefit. The
only benefit I can think of is that we will be able to use an _acq()
barriers on read-only memory with this trick (which is not possible
today as timecounters code can testify).

If the latencies for "lock add" + "movl" is changed in the latest
Intel processors I can't say for sure, it may be worth to look at it.

Attilio


--=20
Peace can only be achieved by understanding - A. Einstein