Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Apr 2019 21:36:19 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>
Cc:        Bruce Evans <brde@optusnet.com.au>, Konstantin Belousov <kib@freebsd.org>
Subject:   powerpc64 or 32-bit power context: FreeBSD lwsync use vs. th->th_generation  handling (and related th-> fields) looks broken to me
Message-ID:  <50CFD7F1-6892-4375-967B-4713517C2520@yahoo.com>

next in thread | raw e-mail | index | archive | help
First I review below lwsync behavior. It is based on a =
comparison/contrast
paper for the powerpc vs. arm memory models. It sets context for later
material specific to powerpc64 or 32-bit powerpc FreeBSD.

"For a write before a read, separated by a lwsync, the barrier will =
ensure that the write is
committed before the read is satisfied but lets the read be satisfied =
before the write has
been propagated to any other thread."

(By contrast, sync, guarantees that the write has propagated to all =
threads before the
read in question is satisfied, the read having been separated from the =
write by the
sync.)

Another wording in case it helps (from the same paper):

"The POWER lwsync does *not* ensure that writes before the barrier have =
propagated to
any other thread before sequent actions, though it does keep writes =
before and after
an lwsync in order as far as [each thread is] concerned". (Original used =
plural form:
"all threads are". I tired to avoid any potential implication of cross =
(hardware)
"thread" ordering constraints for seeing the updates when lwsync is =
used.)


Next I note FreeBSD powerpc64 and 32-bit powerpc details
that happen to involve lwsync, though lwsync is not the
only issue:

atomic_store_rel_int(&th->th_generation, ogen);

and:

gen =3D atomic_load_acq_int(&th->th_generation);

with:

static __inline void                                            \
atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)       \
{                                                               \
                                                                \
        powerpc_lwsync();                                       \
        *p =3D v;                                                 \
}

and:

static __inline u_##TYPE                                        \
atomic_load_acq_##TYPE(volatile u_##TYPE *p)                    \
{                                                               \
        u_##TYPE v;                                             \
                                                                \
        v =3D *p;                                                 \
        powerpc_lwsync();                                       \
        return (v);                                             \
}                                                               \

also:

static __inline void
atomic_thread_fence_acq(void)
{

        powerpc_lwsync();
}



First I list a simpler-than-full-context example to
try to make things clearer . . .

Here is a sequence, listing in an overall time
order, omitting other activity, despite the distinct
cpus, (N!=3DM):


(Presume th->th_generation=3D=3Dogen-1 initially, then:)

cpu N: atomic_store_rel_int(&th->th_generation, ogen);
       (same th value as for cpu M below)

cpu M: gen =3D atomic_load_acq_int(&th->th_generation);


For the above sequence:

There is no barrier between the store and the later
load at all. This is important below.


So, if I have that much right . . .

Now for more actual "load side" context:
(Presume, for simplicity, that there is only one=20
timehands instance instead of 2 or more timehands. So
th does not vary below and is the same on both cpu's
in the later example sequence of activity.)

        do {
                th =3D timehands;
                gen =3D atomic_load_acq_int(&th->th_generation);
                *bt =3D th->th_offset;
                bintime_addx(bt, th->th_scale * tc_delta(th));
                atomic_thread_fence_acq();
        } while (gen =3D=3D 0 || gen !=3D th->th_generation);

For simplicity of referring to things: I again show
a specific sequence in time. I only show the
&th->th_generation activity from cpu N, again for
simplicity.

(Presume timehands->th_generation=3D=3Dogen-1 initially
and that M!=3DN:)

cpu M: th =3D timehands;
       (Could be after the "cpu N" lines.)

cpu N: atomic_store_rel_int(&th->th_generation, ogen);
       (same th value as for cpu M)

cpu M: gen =3D atomic_load_acq_int(&th->th_generation);
cpu M: *bt =3D th->th_offset;
cpu M: bintime_addx(bt, th->th_scale * tc_delta(th));
cpu M: atomic_thread_fence_acq();
cpu M: gen !=3D th->th_generation
       (evaluated to false or to true)

So here:

A) gen ends up with: gen=3D=3Dogen-1 || gen=3D=3Dogen
   (either is allowed because of the lack of
   any barrier between the store and the
   involved load).

B) When gen=3D=3Dogen: there was no barrier
   before the assignment to gen to guarantee
   other th-> field-value staging relationships.

C) When gen=3D=3Dogen: gen!=3Dth->th_generation false
   does not guarantee the *bt=3D. . . and
   bintime_addx(. . .) activities were based
   on a coherent set of th-> field-values.


If I'm correct about (C) then the likes of the
binuptime and sbinuptime implementations appear
to be broken on powerpc64 and 32-bit powerpc
unless there are extra guarantees always present.

So have I found at least a powerpc64/32-bit-powerpc
FreeBSD implementation problem?


Note: While I'm still testing, I've seen problems
on the two 970MP based 2-socket/2-cores-each G5
PowerMac11,2's that I've so far not seen on three
2-socket/1-core-each PowerMacs, two such 7455 G4
PowerMac3,6's and one such 970 G5 PowerMac7,2.
The two PowerMac11,2's are far more tested at
this point. But proving that any test-failure is
specifically because of (C) is problematical.


Note: arm apparently has no equivalent of lwsync,
just of sync (aka. hwsync and sync 0). If I
understand correctly, PowerPC/Power has the weakest
memory model of the modern tier-1/tier-2
architectures and, so, they might be broken for
memory model handling when everything else is
working.



=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50CFD7F1-6892-4375-967B-4713517C2520>