Date: Fri, 4 Sep 2020 14:36:29 +0100 From: Jessica Clarke <jrtc27@freebsd.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Marcin Wojtas <mw@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-12@freebsd.org Subject: Re: svn commit: r365326 - stable/12/sys/sys Message-ID: <B8C7F5D3-E30F-4CD7-838D-62EE2C973733@freebsd.org> In-Reply-To: <20200904130946.GG94807@kib.kiev.ua> References: <202009041122.084BMIYn040628@repo.freebsd.org> <20200904130946.GG94807@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 4 Sep 2020, at 14:09, Konstantin Belousov <kostikbel@gmail.com> = wrote: > On Fri, Sep 04, 2020 at 11:22:18AM +0000, Marcin Wojtas wrote: >> Author: mw >> Date: Fri Sep 4 11:22:18 2020 >> New Revision: 365326 >> URL: https://svnweb.freebsd.org/changeset/base/365326 >>=20 >> Log: >> MFC: r346593 >>=20 >> Add barrier in buf ring peek function to prevent race in ARM and = ARM64. >>=20 >> Obtained from: Semihalf >> Sponsored by: Amazon, Inc. >>=20 >> Modified: >> stable/12/sys/sys/buf_ring.h >> Directory Properties: >> stable/12/ (props changed) >>=20 >> Modified: stable/12/sys/sys/buf_ring.h >> = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D >> --- stable/12/sys/sys/buf_ring.h Fri Sep 4 04:31:56 2020 = (r365325) >> +++ stable/12/sys/sys/buf_ring.h Fri Sep 4 11:22:18 2020 = (r365326) >> @@ -310,14 +310,23 @@ buf_ring_peek_clear_sc(struct buf_ring *br) >> if (!mtx_owned(br->br_lock)) >> panic("lock not held on single consumer dequeue"); >> #endif=09 >> - /* >> - * I believe it is safe to not have a memory barrier >> - * here because we control cons and tail is worst case >> - * a lagging indicator so we worst case we might >> - * return NULL immediately after a buffer has been enqueued >> - */ >> + >> if (br->br_cons_head =3D=3D br->br_prod_tail) >> return (NULL); >> + >> +#if defined(__arm__) || defined(__aarch64__) >> + /* >> + * The barrier is required there on ARM and ARM64 to ensure, = that >> + * br->br_ring[br->br_cons_head] will not be fetched before the = above >> + * condition is checked. >> + * Without the barrier, it is possible, that buffer will be = fetched >> + * before the enqueue will put mbuf into br, then, in the = meantime, the >> + * enqueue will update the array and the br_prod_tail, and the >> + * conditional check will be true, so we will return previously = fetched >> + * (and invalid) buffer. >> + */ >> + atomic_thread_fence_acq(); >> +#endif >=20 > Putting the semantic of the change aside, why did you added the fence = (it is > a fence, not barrier as stated in the comment) only to arm* ? If it = is > needed, it is needed for all arches. Agreed. The code looks fine, though I would have made it an acquire load of br_prod_tail myself to be able to take advantage load-acquire instructions when present, and better document what the exact issue is. I also don't think the comment needs to be quite so extensive (especially since atomic_load_acq_32 is somewhat self-documenting in terms of one half of the race); if we had a comment like this for every fence in the kernel we'd never get anything done. There's also an ARM-specific fence in buf_ring_dequeue_sc: > /* > * This is a workaround to allow using buf_ring on ARM and = ARM64. > * ARM64TODO: Fix buf_ring in a generic way. > * REMARKS: It is suspected that br_cons_head does not require > * load_acq operation, but this change was extensively tested > * and confirmed it's working. To be reviewed once again in > * FreeBSD-12. > * > * Preventing following situation: >=20 > * Core(0) - buf_ring_enqueue() = Core(1) - buf_ring_dequeue_sc() > * ----------------------------------------- = ---------------------------------------------- > * > * = cons_head =3D br->br_cons_head; > * atomic_cmpset_acq_32(&br->br_prod_head, ...)); > * = buf =3D br->br_ring[cons_head]; <see <1>> > * br->br_ring[prod_head] =3D buf; > * atomic_store_rel_32(&br->br_prod_tail, ...); > * = prod_tail =3D br->br_prod_tail; > * = if (cons_head =3D=3D prod_tail)=20 > * = return (NULL); > * = <condition is false and code uses invalid(old) buf>`=09 > * > * <1> Load (on core 1) from br->br_ring[cons_head] can be = reordered (speculative readed) by CPU. > */=09 > #if defined(__arm__) || defined(__aarch64__) > cons_head =3D atomic_load_acq_32(&br->br_cons_head); > #else > cons_head =3D br->br_cons_head; > #endif > prod_tail =3D atomic_load_acq_32(&br->br_prod_tail); The comment is completely correct that the ARM-specific fence is a waste of time. It's the single-consumer path, so such fences are just synchronising with the current thread and thus pointless. The important one is the load-acquire of br_prod_tail, as has been discovered (sort of) in the peek case leading to this comment, which already stops the reordering in question. Jess
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?B8C7F5D3-E30F-4CD7-838D-62EE2C973733>