From owner-freebsd-arch@FreeBSD.ORG  Wed Oct 29 19:05:07 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 0C21695A;
 Wed, 29 Oct 2014 19:05:07 +0000 (UTC)
Received: from mail-wi0-x22f.google.com (mail-wi0-x22f.google.com
 [IPv6:2a00:1450:400c:c05::22f])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 27AA97C9;
 Wed, 29 Oct 2014 19:05:06 +0000 (UTC)
Received: by mail-wi0-f175.google.com with SMTP id ex7so2594192wid.2
 for <multiple recipients>; Wed, 29 Oct 2014 12:05:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=EfEBLIfl4bpXq56QWoRctyt1C/8hgDLUw9AuS5DDoq0=;
 b=A/l1/XTnIvNRPsC59hxvUuJWo/P3d9KmX5fSMRlPo3yh1SDyjl7CPPozx691jtpebh
 jcZaEYfgZTK3Akq6GRZMX5YQ+5rbzkGAkhaofvqGLCsS8aOROssvxllUz1tOZiCKE+Pb
 Vhy6Lwh9jSzo96UWoQraNxmaYwQFc8klfgRdp1m/njnKG/GCvJLx4XPtaVVsVMRMzZxh
 TixRmqJQaas4kRJwdk0B8sSa0QgkgoATbggEW2QTEK/HQraXOcdXtPiO59QtzRb7FNwJ
 3cll3iL3rIVPYmPPb2UTqvIzfQtRF0uY7/kZN800SqwExghtd+yYDkD4/bmcl6kxFU3v
 dGBQ==
X-Received: by 10.180.21.140 with SMTP id v12mr38671171wie.44.1414609503373;
 Wed, 29 Oct 2014 12:05:03 -0700 (PDT)
Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net.
 [2001:470:1f08:1f7::2])
 by mx.google.com with ESMTPSA id rx8sm1582962wjb.30.2014.10.29.12.05.01
 for <multiple recipients>
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Wed, 29 Oct 2014 12:05:02 -0700 (PDT)
Date: Wed, 29 Oct 2014 20:04:59 +0100
From: Mateusz Guzik <mjguzik@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Subject: Re: atomic ops
Message-ID: <20141029190459.GA25368@dft-labs.eu>
References: <20141028025222.GA19223@dft-labs.eu>
 <CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Adrian Chadd <adrian@freebsd.org>, Alan Cox <alc@rice.edu>,
 Konstantin Belousov <kib@freebsd.org>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Oct 2014 19:05:07 -0000

On Tue, Oct 28, 2014 at 02:18:41PM +0100, Attilio Rao wrote:
> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > As was mentioned sometime ago, our situation related to atomic ops is
> > not ideal.
> >
> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide
> > full memory barriers, which is stronger than needed.
> >
> > Moreover, load is implemented as lock cmpchg on var address, so it is
> > addditionally slower especially when cpus compete.
> 
> I already explained this once privately: fully memory barriers is not
> stronger than needed.
> FreeBSD has a different semantic than Linux. We historically enforce a
> full barrier on _acq() and _rel() rather then just a read and write
> barrier, hence we need a different implementation than Linux.
> There is code that relies on this property, like the locking
> primitives (release a mutex, for instance).
> 

I mean stronger than needed in some cases, popular one is fget_unlocked
and we provide no "lightest sufficient" barrier (which would also be
cheaper).

Other case which benefits greatly is sys/sys/seq.h. As noted in some
other thread, using load_acq as it is destroys performance.

I don't dispute the need for full barriers, although it is unclear what
current consumers of load_acq actually need a full barrier..

> In short: optimizing the implementation for performance is fine and
> due. Changing the semantic is not fine, unless you have reviewed and
> fixed all the uses of _rel() and _acq().
> 
> > On amd64 it is sufficient to place a compiler barrier in such cases.
> >
> > Next, we lack some atomic ops in the first place.
> >
> > Let's define some useful terms:
> > smp_wmb - no writes can be reordered past this point
> > smp_rmb - no reads can be reordered past this point
> >
> > With this in mind, we lack ops which would guarantee only the following:
> >
> > 1. var = tmp; smp_wmb();
> > 2. tmp = var; smp_rmb();
> > 3. smp_rmb(); tmp = var;
> >
> > This matters since what we can use already to emulate this is way
> > heavier than needed on aforementioned amd64 and most likely other archs.
> 
> I can see the value of such barriers in case you want to just
> synchronize operation regards read or writes.
> I also believe that on newest intel processors (for which we should
> optimize) rmb() and wmb() got significantly faster than mb(). However
> the most interesting case would be for arm and mips, I assume. That's
> where you would see a bigger perf difference if you optimize the
> membar paths.
> 
> Last time I looked into it, in FreeBSD kernel the Linux-ish
> rmb()/wmb()/etc. were used primilarly in 3 places: Linux-derived code,
> handling of 16-bits operand and implementation of "faster" bus
> barriers.
> Initially I had thought about just confining the smp_*() in a Linux
> compat layer and fix the other 2 in this way: for 16-bits operands
> just pad to 32-bits, as the C11 standard also does. For the bus
> barriers, just grow more versions to actually include the rmb()/wmb()
> scheme within.
> 
> At this point, I understand we may want to instead  support the
> concept of write-only or read-only barrier. This means that if we want
> to keep the concept tied to the current _acq()/_rel() scheme we will
> end up with a KPI explosion.
> 
> I'm not the one making the call here, but for a faster and more
> granluar approach, possibly we can end up using smp_rmb() and
> smp_wmb() directly. As I said I'm not the one making the call.
> 

Well, I don't know original motivation for expressing stuff with
_load_acq and _store_rel.

Anyway, maybe we could do something along (expressing intent, not actual
code):

mb_producer_start(p, v) { *p = v; smp_wmb(); }
mb_producer(p, v) { smp_wmb(); *p = v; }
mb_producer_end(p, v) { mb_producer(p, v); }

type mb_consumer(p) { var = *p; smp_rmb(); return (var); }
type mb_consumer_start(p) { return (mb_consumer(p)); } 
type mb_consumer_end(p) { smp_rmb(); return (*p); }

-- 
Mateusz Guzik <mjguzik gmail.com>