Date: Thu, 12 Mar 2020 02:48:06 -0500 From: Alan Cox <alc@rice.edu> To: Mateusz Guzik <mjguzik@gmail.com>, Konstantin Belousov <kostikbel@gmail.com> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r358733 - head/sys/sys Message-ID: <0d1145b3-1c29-aa21-1bd5-1129fdb88144@rice.edu> In-Reply-To: <CAGudoHFfHkaT9zTYyyj%2B8toGp7eYVwf5ECfKs-xTdTJUMwy4YA@mail.gmail.com> References: <202003080022.0280MX9j055805@repo.freebsd.org> <CAGudoHE4hTN5niCAsA4tfggywkEnGxiaoPz=rTX3AHA=fgZfkQ@mail.gmail.com> <20200309213512.GS98340@kib.kiev.ua> <CAGudoHHxRvrABfNThT=9_bv1dXsSExKrnFTScrHBDY1Z6GzGgg@mail.gmail.com> <20200310192326.GC1992@kib.kiev.ua> <CAGudoHFfHkaT9zTYyyj%2B8toGp7eYVwf5ECfKs-xTdTJUMwy4YA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 3/11/20 9:16 PM, Mateusz Guzik wrote: > On 3/10/20, Konstantin Belousov <kostikbel@gmail.com> wrote: >> On Tue, Mar 10, 2020 at 12:22:09AM +0100, Mateusz Guzik wrote: >>> On 3/9/20, Konstantin Belousov <kostikbel@gmail.com> wrote: >>>> On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote: >>>>> On 3/8/20, Mateusz Guzik <mjg@freebsd.org> wrote: >>>>>> Author: mjg >>>>>> Date: Sun Mar 8 00:22:32 2020 >>>>>> New Revision: 358733 >>>>>> URL: https://svnweb.freebsd.org/changeset/base/358733 >>>>>> >>>>>> Log: >>>>>> seqc: tidy up >>>>>> >>>>>> - avoid hand-rolled read >>>>>> - match begin/end in terms of fence style >>>>>> >>>>> There were off lists questions about this so let me clarify. >>>>> >>>>> The first bit is a cosmetic change, but the second one is not. >>>>> >>>>>> Modified: >>>>>> head/sys/sys/seqc.h >>>>>> >>>>>> Modified: head/sys/sys/seqc.h >>>>>> ============================================================================== >>>>>> --- head/sys/sys/seqc.h Sat Mar 7 15:37:23 2020 (r358732) >>>>>> +++ head/sys/sys/seqc.h Sun Mar 8 00:22:32 2020 (r358733) >>>>>> @@ -66,7 +66,8 @@ static __inline void >>>>>> seqc_write_end(seqc_t *seqcp) >>>>>> { >>>>>> >>>>>> - atomic_store_rel_int(seqcp, *seqcp + 1); >>>>>> + atomic_thread_fence_rel(); >>>>>> + *seqcp += 1; >>>>>> MPASS(!seqc_in_modify(*seqcp)); >>>>>> critical_exit(); >>>>>> } >>>>> For correct operation the counter has to be modified *before* any work >>>>> is done and *after* it is completed. >>>>> >>>>> Consider a trivial case: >>>>> seqc++; >>>>> foo[0] = 0; >>>>> foo[1] = 1; >>>>> seqc++; >>>>> >>>>> There are 2 ways in which this can be mucked with: >>>>> - the compiler can be looking at reordering or reworking the increment >>>>> (e.g., to collapse it to just += 2 instead). a compiler barrier >>>>> prevents that. >>>>> - even with generated machine code which increments in correct places >>>>> the cpu can be looking at reordering the operation (which is heavily >>>>> dependent on architecture), in particular it could end up issuing >>>>> seqc++; foo[1] = 1; which would defeat the point. This is where >>>>> release fences come in. >>>>> >>>>> With the patched code there is: >>>>> seqc++; >>>>> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo >>>>> starts getting modified */ >>>>> foo[0] = 0; >>>>> foo[1] = 1; >>>>> atomic_thread_fence_rel(); /* make sure modifications to foo don't >>>>> leak past this spot */ >>>>> seqc++; >>>>> >>>>> In comparison, the previous code was: >>>>> seqc++; >>>>> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo >>>>> starts getting modified */ >>>>> foo[0] = 0; >>>>> foo[1] = 1; >>>>> atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to >>>>> too don't leak past this spot and update seqc immediately */ >>>>> >>>>> There are 2 differences here -- one is an improvement and second one >>>>> does not matter: >>>>> the win is: the previous code forces the compiler to compute an >>>>> incremented value and then store it. On amd64 this translates to the >>>>> following (with rdx holding the address of the counter): >>>>> >>>>> mov (%rdx),%eax /* load the value */ >>>>> add $0x1,%eax /* add 1 */ >>>>> mov %eax,(%rdx) /* store it */ >>>>> >>>>> On patched kernel this is: >>>>> addl $0x1,(%rdx) >>>>> >>>>> which is clearly much nicer. >>>> I am not sure that the new code on amd64 is much nicer. >>>> >>>> But the point was that on non-amd64, i.e. armv8 and potentially on >>>> powerpc 3.0, the patch substitutes release store with full barrier. The >>>> later is (much) slower. >>>> >>> If an arch performs something significantly more expensive here it's >>> probably a performance bug in the port. >> It is a question of how much hardware support for fine-grained fences. >> Relaxed architectures add some optimized operations already, but did not >> (yet) implemented complete C11 optimized set. >> I think they eventually converge. >> >> Until they did not, I think it is more important to not pessimize >> to-be Tier1 arches than to collapse three fast integer instructions >> on amd64 into one rw op. >> >>> But perhaps more importantly >>> there are significantly more frequent users of >>> atomic_thread_fence_rel, like refcount(9). If the issue is real that >>> should be looked at. >> For refcount(9) use of release fence removal, we would need >> atomic_fetchadd_rel_int(). >> > To reiterate, if atomic_thread_fence_rel is indeed very expensive on > certain architectures compared to other _rel variants, the kernel is > already heavily pessimized because of use of said standalone fence in > refcount(9) and few other places and this needs to be fixed. Just to be clear, this is not just a matter of whether our release fences are poorly or inefficiently implemented. Release fences (as defined by the C/C++ standards that inspire our atomics) have *different* semantics than release stores. Specifically, a release fence places more ordering restrictions on the nearby store accesses than a release store does. (In fact, "man 9 atomic" describes this.) Thus, it is almost inevitable that a release fence will have a higher cost than the best possible implementation of a release store. > If the cost can't be reduced (e.g., perhaps this can do a no-op store > + fence somewhere akin to how lock xadd 0,(%esp) trickery used to > work), then the atomic APIs do indeed not to be completed w.r.t. > adding all the missing acq/rel variants (that's at least fetchadd and > swap). > > We can also extend the API to do relevant addition/subtraction without > guaranteeing atomicity vs concurrent updates, but while providing > relevant fences. This would preserve the intended state achieved with > this commit while not interfering with others. > > I can add the missing amd64 bits if other people cover their > architectures, but otherwise I'm not interested in doing anything > here. > > seqc itself, apart from being a minor consumer compared to refcount, > was already using the standalone fence even prior to this change thus > I don't think there is any rush to do anything with regards to this > particular commit. Indeed, it was. It was using a release fence where it was semantically required to do so for *correctness*. Specifically, where the extra ordering restrictions of a release fence were needed. But, it was also using the simpler release store where the extra ordering restrictions were not needed. > >>>>> the not a problem: the code does not issue the release fence after >>>>> incrementing the counter the second time, meaning that in principle it >>>>> can happen arbitrarily late, possibly disturbing code which waits for >>>>> the counter to become even. This is not a problem because in whoever >>>>> modifies it is supposed to have a lock and release it shortly after, >>>>> which also posts the release fence and consequently makes sure the >>>>> counter update is visible. Also note these routines come with >>>>> preemption disablement around the modification -- if the thread gets >>>>> preempted as a result of critical_exit from seqc_write_end, it will >>>>> also publish the updated counter. If the updater never does anything >>>>> which flushes the store buffer then in the absolute worst case they >>>>> will get preempted, at which point once more we are covered. Note that >>>>> code which wants more deterministic behavior will want to >>>>> seqc_read_any, test the counter once and if it is caught uneven resort >>>>> to locking. >>>>> >>>>>> @@ -84,7 +85,7 @@ seqc_read(const seqc_t *seqcp) >>>>>> seqc_t ret; >>>>>> >>>>>> for (;;) { >>>>>> - ret = atomic_load_acq_int(__DECONST(seqc_t *, seqcp)); >>>>>> + ret = seqc_read_any(seqcp); >>>>>> if (__predict_false(seqc_in_modify(ret))) { >>>>>> cpu_spinwait(); >>>>>> continue; >>>>>> _______________________________________________ >>>>>> svn-src-all@freebsd.org mailing list >>>>>> https://lists.freebsd.org/mailman/listinfo/svn-src-all >>>>>> To unsubscribe, send any mail to >>>>>> "svn-src-all-unsubscribe@freebsd.org" >>>>>> >>>>> >>>>> -- >>>>> Mateusz Guzik <mjguzik gmail.com> >>> >>> -- >>> Mateusz Guzik <mjguzik gmail.com> >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0d1145b3-1c29-aa21-1bd5-1129fdb88144>