From owner-freebsd-arch@FreeBSD.ORG Wed Oct 29 19:05:07 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0C21695A; Wed, 29 Oct 2014 19:05:07 +0000 (UTC) Received: from mail-wi0-x22f.google.com (mail-wi0-x22f.google.com [IPv6:2a00:1450:400c:c05::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 27AA97C9; Wed, 29 Oct 2014 19:05:06 +0000 (UTC) Received: by mail-wi0-f175.google.com with SMTP id ex7so2594192wid.2 for ; Wed, 29 Oct 2014 12:05:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=EfEBLIfl4bpXq56QWoRctyt1C/8hgDLUw9AuS5DDoq0=; b=A/l1/XTnIvNRPsC59hxvUuJWo/P3d9KmX5fSMRlPo3yh1SDyjl7CPPozx691jtpebh jcZaEYfgZTK3Akq6GRZMX5YQ+5rbzkGAkhaofvqGLCsS8aOROssvxllUz1tOZiCKE+Pb Vhy6Lwh9jSzo96UWoQraNxmaYwQFc8klfgRdp1m/njnKG/GCvJLx4XPtaVVsVMRMzZxh TixRmqJQaas4kRJwdk0B8sSa0QgkgoATbggEW2QTEK/HQraXOcdXtPiO59QtzRb7FNwJ 3cll3iL3rIVPYmPPb2UTqvIzfQtRF0uY7/kZN800SqwExghtd+yYDkD4/bmcl6kxFU3v dGBQ== X-Received: by 10.180.21.140 with SMTP id v12mr38671171wie.44.1414609503373; Wed, 29 Oct 2014 12:05:03 -0700 (PDT) Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net. [2001:470:1f08:1f7::2]) by mx.google.com with ESMTPSA id rx8sm1582962wjb.30.2014.10.29.12.05.01 for (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 29 Oct 2014 12:05:02 -0700 (PDT) Date: Wed, 29 Oct 2014 20:04:59 +0100 From: Mateusz Guzik To: Attilio Rao Subject: Re: atomic ops Message-ID: <20141029190459.GA25368@dft-labs.eu> References: <20141028025222.GA19223@dft-labs.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Adrian Chadd , Alan Cox , Konstantin Belousov , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Oct 2014 19:05:07 -0000 On Tue, Oct 28, 2014 at 02:18:41PM +0100, Attilio Rao wrote: > On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik wrote: > > As was mentioned sometime ago, our situation related to atomic ops is > > not ideal. > > > > atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide > > full memory barriers, which is stronger than needed. > > > > Moreover, load is implemented as lock cmpchg on var address, so it is > > addditionally slower especially when cpus compete. > > I already explained this once privately: fully memory barriers is not > stronger than needed. > FreeBSD has a different semantic than Linux. We historically enforce a > full barrier on _acq() and _rel() rather then just a read and write > barrier, hence we need a different implementation than Linux. > There is code that relies on this property, like the locking > primitives (release a mutex, for instance). > I mean stronger than needed in some cases, popular one is fget_unlocked and we provide no "lightest sufficient" barrier (which would also be cheaper). Other case which benefits greatly is sys/sys/seq.h. As noted in some other thread, using load_acq as it is destroys performance. I don't dispute the need for full barriers, although it is unclear what current consumers of load_acq actually need a full barrier.. > In short: optimizing the implementation for performance is fine and > due. Changing the semantic is not fine, unless you have reviewed and > fixed all the uses of _rel() and _acq(). > > > On amd64 it is sufficient to place a compiler barrier in such cases. > > > > Next, we lack some atomic ops in the first place. > > > > Let's define some useful terms: > > smp_wmb - no writes can be reordered past this point > > smp_rmb - no reads can be reordered past this point > > > > With this in mind, we lack ops which would guarantee only the following: > > > > 1. var = tmp; smp_wmb(); > > 2. tmp = var; smp_rmb(); > > 3. smp_rmb(); tmp = var; > > > > This matters since what we can use already to emulate this is way > > heavier than needed on aforementioned amd64 and most likely other archs. > > I can see the value of such barriers in case you want to just > synchronize operation regards read or writes. > I also believe that on newest intel processors (for which we should > optimize) rmb() and wmb() got significantly faster than mb(). However > the most interesting case would be for arm and mips, I assume. That's > where you would see a bigger perf difference if you optimize the > membar paths. > > Last time I looked into it, in FreeBSD kernel the Linux-ish > rmb()/wmb()/etc. were used primilarly in 3 places: Linux-derived code, > handling of 16-bits operand and implementation of "faster" bus > barriers. > Initially I had thought about just confining the smp_*() in a Linux > compat layer and fix the other 2 in this way: for 16-bits operands > just pad to 32-bits, as the C11 standard also does. For the bus > barriers, just grow more versions to actually include the rmb()/wmb() > scheme within. > > At this point, I understand we may want to instead support the > concept of write-only or read-only barrier. This means that if we want > to keep the concept tied to the current _acq()/_rel() scheme we will > end up with a KPI explosion. > > I'm not the one making the call here, but for a faster and more > granluar approach, possibly we can end up using smp_rmb() and > smp_wmb() directly. As I said I'm not the one making the call. > Well, I don't know original motivation for expressing stuff with _load_acq and _store_rel. Anyway, maybe we could do something along (expressing intent, not actual code): mb_producer_start(p, v) { *p = v; smp_wmb(); } mb_producer(p, v) { smp_wmb(); *p = v; } mb_producer_end(p, v) { mb_producer(p, v); } type mb_consumer(p) { var = *p; smp_rmb(); return (var); } type mb_consumer_start(p) { return (mb_consumer(p)); } type mb_consumer_end(p) { smp_rmb(); return (*p); } -- Mateusz Guzik