Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Mar 2020 02:48:06 -0500
From:      Alan Cox <alc@rice.edu>
To:        Mateusz Guzik <mjguzik@gmail.com>, Konstantin Belousov <kostikbel@gmail.com>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r358733 - head/sys/sys
Message-ID:  <0d1145b3-1c29-aa21-1bd5-1129fdb88144@rice.edu>
In-Reply-To: <CAGudoHFfHkaT9zTYyyj%2B8toGp7eYVwf5ECfKs-xTdTJUMwy4YA@mail.gmail.com>
References:  <202003080022.0280MX9j055805@repo.freebsd.org> <CAGudoHE4hTN5niCAsA4tfggywkEnGxiaoPz=rTX3AHA=fgZfkQ@mail.gmail.com> <20200309213512.GS98340@kib.kiev.ua> <CAGudoHHxRvrABfNThT=9_bv1dXsSExKrnFTScrHBDY1Z6GzGgg@mail.gmail.com> <20200310192326.GC1992@kib.kiev.ua> <CAGudoHFfHkaT9zTYyyj%2B8toGp7eYVwf5ECfKs-xTdTJUMwy4YA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On 3/11/20 9:16 PM, Mateusz Guzik wrote:
> On 3/10/20, Konstantin Belousov <kostikbel@gmail.com> wrote:
>> On Tue, Mar 10, 2020 at 12:22:09AM +0100, Mateusz Guzik wrote:
>>> On 3/9/20, Konstantin Belousov <kostikbel@gmail.com> wrote:
>>>> On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote:
>>>>> On 3/8/20, Mateusz Guzik <mjg@freebsd.org> wrote:
>>>>>> Author: mjg
>>>>>> Date: Sun Mar  8 00:22:32 2020
>>>>>> New Revision: 358733
>>>>>> URL: https://svnweb.freebsd.org/changeset/base/358733
>>>>>>
>>>>>> Log:
>>>>>>    seqc: tidy up
>>>>>>
>>>>>>    - avoid hand-rolled read
>>>>>>    - match begin/end in terms of fence style
>>>>>>
>>>>> There were off lists questions about this so let me clarify.
>>>>>
>>>>> The first bit is a cosmetic change, but the second one is not.
>>>>>
>>>>>> Modified:
>>>>>>    head/sys/sys/seqc.h
>>>>>>
>>>>>> Modified: head/sys/sys/seqc.h
>>>>>> ==============================================================================
>>>>>> --- head/sys/sys/seqc.h	Sat Mar  7 15:37:23 2020	(r358732)
>>>>>> +++ head/sys/sys/seqc.h	Sun Mar  8 00:22:32 2020	(r358733)
>>>>>> @@ -66,7 +66,8 @@ static __inline void
>>>>>>   seqc_write_end(seqc_t *seqcp)
>>>>>>   {
>>>>>>
>>>>>> -	atomic_store_rel_int(seqcp, *seqcp + 1);
>>>>>> +	atomic_thread_fence_rel();
>>>>>> +	*seqcp += 1;
>>>>>>   	MPASS(!seqc_in_modify(*seqcp));
>>>>>>   	critical_exit();
>>>>>>   }
>>>>> For correct operation the counter has to be modified *before* any work
>>>>> is done and *after* it is completed.
>>>>>
>>>>> Consider a trivial case:
>>>>> seqc++;
>>>>> foo[0] = 0;
>>>>> foo[1] = 1;
>>>>> seqc++;
>>>>>
>>>>> There are 2 ways in which this can be mucked with:
>>>>> - the compiler can be looking at reordering or reworking the increment
>>>>> (e.g., to collapse it to just += 2 instead). a compiler barrier
>>>>> prevents that.
>>>>> - even with generated machine code which increments in correct places
>>>>> the cpu can be looking at reordering the operation (which is heavily
>>>>> dependent on architecture), in particular it could end up issuing
>>>>> seqc++; foo[1] = 1; which would defeat the point. This is where
>>>>> release fences come in.
>>>>>
>>>>> With the patched code there is:
>>>>> seqc++;
>>>>> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
>>>>> starts getting modified */
>>>>> foo[0] = 0;
>>>>> foo[1] = 1;
>>>>> atomic_thread_fence_rel(); /* make sure modifications to foo don't
>>>>> leak past this spot */
>>>>> seqc++;
>>>>>
>>>>> In comparison, the previous code was:
>>>>> seqc++;
>>>>> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
>>>>> starts getting modified */
>>>>> foo[0] = 0;
>>>>> foo[1] = 1;
>>>>> atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
>>>>> too don't leak past this spot and update seqc immediately */
>>>>>
>>>>> There are 2 differences here -- one is an improvement and second one
>>>>> does not matter:
>>>>> the win is: the previous code forces the compiler to compute an
>>>>> incremented value and then store it. On amd64 this translates to the
>>>>> following (with rdx holding the address of the counter):
>>>>>
>>>>> mov    (%rdx),%eax /* load the value */
>>>>> add    $0x1,%eax    /* add 1 */
>>>>> mov    %eax,(%rdx) /* store it */
>>>>>
>>>>> On patched kernel this is:
>>>>> addl   $0x1,(%rdx)
>>>>>
>>>>> which is clearly much nicer.
>>>> I am not sure that the new code on amd64 is much nicer.
>>>>
>>>> But the point was that on non-amd64, i.e. armv8 and potentially on
>>>> powerpc 3.0, the patch substitutes release store with full barrier. The
>>>> later is (much) slower.
>>>>
>>> If an arch performs something significantly more expensive here it's
>>> probably a performance bug in the port.
>> It is a question of how much hardware support for fine-grained fences.
>> Relaxed architectures add some optimized operations already, but did not
>> (yet) implemented complete C11 optimized set.
>> I think they eventually converge.
>>
>> Until they did not, I think it is more important to not pessimize
>> to-be Tier1 arches than to collapse three fast integer instructions
>> on amd64 into one rw op.
>>
>>> But perhaps more importantly
>>> there are significantly more frequent users of
>>> atomic_thread_fence_rel, like refcount(9). If the issue is real that
>>> should be looked at.
>> For refcount(9) use of release fence removal, we would need
>> 	atomic_fetchadd_rel_int().
>>
> To reiterate, if atomic_thread_fence_rel is indeed very expensive on
> certain architectures compared to other _rel variants, the kernel is
> already heavily pessimized because of use of said standalone fence in
> refcount(9) and few other places and this needs to be fixed.


Just to be clear, this is not just a matter of whether our release 
fences are poorly or inefficiently implemented.  Release fences (as 
defined by the C/C++ standards that inspire our atomics) have 
*different* semantics than release stores. Specifically, a release fence 
places more ordering restrictions on the nearby store accesses than a 
release store does.  (In fact, "man 9 atomic" describes this.)  Thus, it 
is almost inevitable that a release fence will have a higher cost than 
the best possible implementation of a release store.


> If the cost can't be reduced (e.g., perhaps this can do a no-op store
> + fence somewhere akin to how lock xadd 0,(%esp) trickery used to
> work), then the atomic APIs do indeed not to be completed w.r.t.
> adding all the missing acq/rel variants (that's at least fetchadd and
> swap).
>
> We can also extend the API to do relevant addition/subtraction without
> guaranteeing atomicity vs concurrent updates, but while providing
> relevant fences. This would preserve the intended state achieved with
> this commit while not interfering with others.
>
> I can add the missing amd64 bits if other people cover their
> architectures, but otherwise I'm not interested in doing anything
> here.
>
> seqc itself, apart from being a minor consumer compared to refcount,
> was already using the standalone fence even prior to this change thus
> I don't think there is any rush to do anything with regards to this
> particular commit.


Indeed, it was.  It was using a release fence where it was semantically 
required to do so for *correctness*.  Specifically, where the extra 
ordering restrictions of a release fence were needed.  But, it was also 
using the simpler release store where the extra ordering restrictions 
were not needed.


>
>>>>> the not a problem: the code does not issue the release fence after
>>>>> incrementing the counter the second time, meaning that in principle it
>>>>> can happen arbitrarily late, possibly disturbing code which waits for
>>>>> the counter to become even. This is not a problem because in whoever
>>>>> modifies it is supposed to have a lock and release it shortly after,
>>>>> which also posts the release fence and consequently makes sure the
>>>>> counter update is visible. Also note these routines come with
>>>>> preemption disablement around the modification -- if the thread gets
>>>>> preempted as a result of critical_exit from seqc_write_end, it will
>>>>> also publish the updated counter. If the updater never does anything
>>>>> which flushes the store buffer then in the absolute worst case they
>>>>> will get preempted, at which point once more we are covered. Note that
>>>>> code which wants more deterministic behavior will want to
>>>>> seqc_read_any, test the counter once and if it is caught uneven resort
>>>>> to locking.
>>>>>
>>>>>> @@ -84,7 +85,7 @@ seqc_read(const seqc_t *seqcp)
>>>>>>   	seqc_t ret;
>>>>>>
>>>>>>   	for (;;) {
>>>>>> -		ret = atomic_load_acq_int(__DECONST(seqc_t *, seqcp));
>>>>>> +		ret = seqc_read_any(seqcp);
>>>>>>   		if (__predict_false(seqc_in_modify(ret))) {
>>>>>>   			cpu_spinwait();
>>>>>>   			continue;
>>>>>> _______________________________________________
>>>>>> svn-src-all@freebsd.org mailing list
>>>>>> https://lists.freebsd.org/mailman/listinfo/svn-src-all
>>>>>> To unsubscribe, send any mail to
>>>>>> "svn-src-all-unsubscribe@freebsd.org"
>>>>>>
>>>>>
>>>>> --
>>>>> Mateusz Guzik <mjguzik gmail.com>
>>>
>>> --
>>> Mateusz Guzik <mjguzik gmail.com>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0d1145b3-1c29-aa21-1bd5-1129fdb88144>