Date: Thu, 18 Apr 2019 21:36:19 -0700 From: Mark Millard <marklmi@yahoo.com> To: FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org> Cc: Bruce Evans <brde@optusnet.com.au>, Konstantin Belousov <kib@freebsd.org> Subject: powerpc64 or 32-bit power context: FreeBSD lwsync use vs. th->th_generation handling (and related th-> fields) looks broken to me Message-ID: <50CFD7F1-6892-4375-967B-4713517C2520@yahoo.com>
next in thread | raw e-mail | index | archive | help
First I review below lwsync behavior. It is based on a = comparison/contrast paper for the powerpc vs. arm memory models. It sets context for later material specific to powerpc64 or 32-bit powerpc FreeBSD. "For a write before a read, separated by a lwsync, the barrier will = ensure that the write is committed before the read is satisfied but lets the read be satisfied = before the write has been propagated to any other thread." (By contrast, sync, guarantees that the write has propagated to all = threads before the read in question is satisfied, the read having been separated from the = write by the sync.) Another wording in case it helps (from the same paper): "The POWER lwsync does *not* ensure that writes before the barrier have = propagated to any other thread before sequent actions, though it does keep writes = before and after an lwsync in order as far as [each thread is] concerned". (Original used = plural form: "all threads are". I tired to avoid any potential implication of cross = (hardware) "thread" ordering constraints for seeing the updates when lwsync is = used.) Next I note FreeBSD powerpc64 and 32-bit powerpc details that happen to involve lwsync, though lwsync is not the only issue: atomic_store_rel_int(&th->th_generation, ogen); and: gen =3D atomic_load_acq_int(&th->th_generation); with: static __inline void \ atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v) \ { \ \ powerpc_lwsync(); \ *p =3D v; \ } and: static __inline u_##TYPE \ atomic_load_acq_##TYPE(volatile u_##TYPE *p) \ { \ u_##TYPE v; \ \ v =3D *p; \ powerpc_lwsync(); \ return (v); \ } \ also: static __inline void atomic_thread_fence_acq(void) { powerpc_lwsync(); } First I list a simpler-than-full-context example to try to make things clearer . . . Here is a sequence, listing in an overall time order, omitting other activity, despite the distinct cpus, (N!=3DM): (Presume th->th_generation=3D=3Dogen-1 initially, then:) cpu N: atomic_store_rel_int(&th->th_generation, ogen); (same th value as for cpu M below) cpu M: gen =3D atomic_load_acq_int(&th->th_generation); For the above sequence: There is no barrier between the store and the later load at all. This is important below. So, if I have that much right . . . Now for more actual "load side" context: (Presume, for simplicity, that there is only one=20 timehands instance instead of 2 or more timehands. So th does not vary below and is the same on both cpu's in the later example sequence of activity.) do { th =3D timehands; gen =3D atomic_load_acq_int(&th->th_generation); *bt =3D th->th_offset; bintime_addx(bt, th->th_scale * tc_delta(th)); atomic_thread_fence_acq(); } while (gen =3D=3D 0 || gen !=3D th->th_generation); For simplicity of referring to things: I again show a specific sequence in time. I only show the &th->th_generation activity from cpu N, again for simplicity. (Presume timehands->th_generation=3D=3Dogen-1 initially and that M!=3DN:) cpu M: th =3D timehands; (Could be after the "cpu N" lines.) cpu N: atomic_store_rel_int(&th->th_generation, ogen); (same th value as for cpu M) cpu M: gen =3D atomic_load_acq_int(&th->th_generation); cpu M: *bt =3D th->th_offset; cpu M: bintime_addx(bt, th->th_scale * tc_delta(th)); cpu M: atomic_thread_fence_acq(); cpu M: gen !=3D th->th_generation (evaluated to false or to true) So here: A) gen ends up with: gen=3D=3Dogen-1 || gen=3D=3Dogen (either is allowed because of the lack of any barrier between the store and the involved load). B) When gen=3D=3Dogen: there was no barrier before the assignment to gen to guarantee other th-> field-value staging relationships. C) When gen=3D=3Dogen: gen!=3Dth->th_generation false does not guarantee the *bt=3D. . . and bintime_addx(. . .) activities were based on a coherent set of th-> field-values. If I'm correct about (C) then the likes of the binuptime and sbinuptime implementations appear to be broken on powerpc64 and 32-bit powerpc unless there are extra guarantees always present. So have I found at least a powerpc64/32-bit-powerpc FreeBSD implementation problem? Note: While I'm still testing, I've seen problems on the two 970MP based 2-socket/2-cores-each G5 PowerMac11,2's that I've so far not seen on three 2-socket/1-core-each PowerMacs, two such 7455 G4 PowerMac3,6's and one such 970 G5 PowerMac7,2. The two PowerMac11,2's are far more tested at this point. But proving that any test-failure is specifically because of (C) is problematical. Note: arm apparently has no equivalent of lwsync, just of sync (aka. hwsync and sync 0). If I understand correctly, PowerPC/Power has the weakest memory model of the modern tier-1/tier-2 architectures and, so, they might be broken for memory model handling when everything else is working. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50CFD7F1-6892-4375-967B-4713517C2520>