From owner-freebsd-arch@FreeBSD.ORG  Sat Oct  4 05:28:57 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 97CDD122;
 Sat,  4 Oct 2014 05:28:57 +0000 (UTC)
Received: from mail-wi0-x236.google.com (mail-wi0-x236.google.com
 [IPv6:2a00:1450:400c:c05::236])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id B1B14EB;
 Sat,  4 Oct 2014 05:28:56 +0000 (UTC)
Received: by mail-wi0-f182.google.com with SMTP id n3so543007wiv.3
 for <multiple recipients>; Fri, 03 Oct 2014 22:28:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-type:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=tmR1lACNoNWKjFrxzkamfRTEbWVllEQjLqMV0iLSzMc=;
 b=hl2f+8WszAh4Ox+TxMpwOZy3S/f0w15UaJ+dMAJnM/klxoTFE0M7FAUflAjL09jbFg
 t3zrk2r6VpkuuLVKazrJmyS6jq/E8Gu/gilzp/81CeHyKZDW4FvNBze2mgtL2ZcJZxud
 MOgmLtfs2F+inaXHsuAHfZvWQi5yv/be4tiYFHdBcYZgPUdJBwWVOxX7Gx1Drn7PPXPM
 dPiNC7TklgBHCj+V1Z0DPHet+XM1hWFqRPjIlZECsGlQuDXW+lp+Eo0DQWZvz1WOXIeS
 o4XFbLEE+7I41iV6uiAw/YUAO3U3MQ4Irny5YXJQ1w5WzvBvwp2DtJAVeKnxTZJ6W3nR
 8Q9A==
X-Received: by 10.194.120.138 with SMTP id lc10mr12230208wjb.55.1412400534872; 
 Fri, 03 Oct 2014 22:28:54 -0700 (PDT)
Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net.
 [2001:470:1f08:1f7::2])
 by mx.google.com with ESMTPSA id eg8sm3978832wib.15.2014.10.03.22.28.53
 for <multiple recipients>
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Fri, 03 Oct 2014 22:28:53 -0700 (PDT)
Date: Sat, 4 Oct 2014 07:28:51 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: alc@freebsd.org
Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory
 barriers.
Message-ID: <20141004052851.GA27891@dft-labs.eu>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <1408064112-573-2-git-send-email-mjguzik@gmail.com>
 <20140816093811.GX2737@kib.kiev.ua>
 <20140816185406.GD2737@kib.kiev.ua>
 <20140817012646.GA21025@dft-labs.eu>
 <CAJUyCcPA7ZDNbwyfx3fT7mq3SE7M-mL5he=eXZ8bY3z-xUCJ-g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAJUyCcPA7ZDNbwyfx3fT7mq3SE7M-mL5he=eXZ8bY3z-xUCJ-g@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Konstantin Belousov <kostikbel@gmail.com>, attilio@freebsd.org,
 Johan Schuijt <johan@transip.nl>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 04 Oct 2014 05:28:57 -0000

Reviving. Sorry everyone for such big delay, $life.

On Tue, Aug 19, 2014 at 02:24:16PM -0500, Alan Cox wrote:
> On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > Well, my memory-barrier-and-so-on-fu is rather weak.
> >
> > I had another look at the issue. At least on amd64, it looks like only
> > compiler barrier is required for both reads and writes.
> >
> > According to AMD64 Architecture Programmer’s Manual Volume 2: System
> > Programming, 7.2 Multiprocessor Memory Access Ordering states:
> >
> > "Loads do not pass previous loads (loads are not reordered). Stores do
> > not pass previous stores (stores are not reordered)"
> >
> > Since the code modifying stuff only performs a series of writes and we
> > expect exclusive writers, I find it applicable to this scenario.
> >
> > I checked linux sources and generated assembly, they indeed issue only
> > a compiler barrier on amd64 (and for intel processors as well).
> >
> > atomic_store_rel_int on amd64 seems fine in this regard, but the only
> > function for loads issues lock cmpxhchg which kills performance
> > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain.
> >
> > Additionally release and acquire semantics seems to be a stronger than
> > needed guarantee.
> >
> >
> 
> This statement left me puzzled and got me to look at our x86 atomic.h for
> the first time in years.  It appears that our implementation of
> atomic_load_acq_int() on x86 is, umm ..., unconventional.  That is, it is
> enforcing a constraint that simple acquire loads don't normally enforce.
> For example, the C11 stdatomic.h simple acquire load doesn't enforce this
> constraint.  Moreover, our own implementation of atomic_load_acq_int() on
> ia64, where the mapping from atomic_load_acq_int() to machine instructions
> is straightforward, doesn't enforce this constraint either.
> 

By 'this constraint' I presume you mean full memory barrier.

It is unclear to me if one can just get rid of it currently. It
definitely would be beneficial.

In the meantime, if for some reason full barrier is still needed, we can
speed up concurrent load_acq of the same var considerably. There is no
need to lock cmpxchg on the same address. We should be able to replace
it with +/-:
lock add $0,(%rsp);
movl ...;

I believe it is possible that cpu will perform some writes before doing
read listed here, but this should be fine.

If this is considered too risky to hit 10.1, I would like to implement
it within seq as a temporary hack to be fixed up later.

something along:
static inline int
atomic_load_acq_rmb(volatile u_int *p)
{
	volaitle u_int *v;

	v = *p;
	atomic_load_acq(&v);
	return (v);
}

This hack fixes aforementioned performance degradation and covers all
architectures.

> Give us a chance to sort this out before you do anything further.  As
> Kostik said, but in different words, we've always written our
> machine-independent layer code using acquires and releases to express the
> required ordering constraints and not {r,w}mb() primitives.
> 

-- 
Mateusz Guzik <mjguzik gmail.com>