Date: Mon, 4 Mar 2019 01:40:18 -0800 From: Mark Millard <marklmi@yahoo.com> To: FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>, Mark Millard via freebsd-hackers <freebsd-hackers@freebsd.org> Subject: Re: powerpc64 on PowerMac G5 4-core (system total): a hack that so far seem to avoid the stuck-sleeping issue [self-hosted buildworld/buildkernel completed] Message-ID: <75A8BB07-3273-423E-9436-798395BC8640@yahoo.com> In-Reply-To: <76E8BF75-A8F5-4A48-9B7C-6494F4A9520B@yahoo.com> References: <B898BF60-2872-4FFC-AD72-A32591BC7D20@yahoo.com> <76E8BF75-A8F5-4A48-9B7C-6494F4A9520B@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
[I did some testing of other figures than testing for < 0x10.] On 2019-Mar-3, at 13:23, Mark Millard <marklmi at yahoo.com> wrote: > [So far the hack has been successful. Details given later > below.] >=20 > On 2019-Mar-2, at 21:20, Mark Millard <marklmi at yahoo.com> wrote: >=20 >> [This note goes in a different direction compared to my >> prior evidence report for overflows and the later activity >> that has been happening for it. This does *not* involve >> the patches associated with that report.] >>=20 >> I view the following as an evidence-gathering hack: >> showing the change in behavior with the code changes, >> not as directly what FreeBSD should do for powerpc64. >> In code for defined(__powerpc64__) && defined(AIM) >> I freely use knowledge of the PowerMac G5 context >> instead of attempting general code. >>=20 >> Also: the code is set up to record some information >> that I've been looking at via ddb. The recording is >> not part of what changes the behavior but I decided >> to show that code too. >>=20 >> It is preliminary, but, so far, the hack has avoided >> buf*daemon* threads and pmac_thermal getting stuck >> sleeping (or, at least, far less frequently). >>=20 >>=20 >> The tbr-value hack: >>=20 >> =46rom what I see the G5 various cores have each tbr running at the >> same rate but have some some offsets as far as the base time >> goes. cpu_mp_unleash does: >>=20 >> ap_awake =3D 1; >>=20 >> /* Provide our current DEC and TB values for APs */ >> ap_timebase =3D mftb() + 10; >> __asm __volatile("msync; isync"); >>=20 >> /* Let APs continue */ >> atomic_store_rel_int(&ap_letgo, 1); >>=20 >> platform_smp_timebase_sync(ap_timebase, 0); >>=20 >> and machdep_ap_bootstrap does: >>=20 >> /* >> * Set timebase as soon as possible to meet an implicit = rendezvous >> * from cpu_mp_unleash(), which sets ap_letgo and then = immediately >> * sets timebase. >> * >> * Note that this is instrinsically racy and is only relevant = on >> * platforms that do not support better mechanisms. >> */ >> platform_smp_timebase_sync(ap_timebase, 1); >>=20 >>=20 >> which attempts to set the tbrs appropriately. >>=20 >> But on small scales of differences the various tbr >> values from different cpus end up not well ordered >> relative to time, synchronizes with, and the like. >> Only large enough differences can well indicate an >> ordering of interest. >>=20 >> Note: tc->tc_get_timecount(tc) only provides the >> least signficant 32 bits of the tbr value. >> th->th_offset_count is also 32 bits and based on >> truncated tbr values. >>=20 >> So I made binuptime avoid finishing when it sees >> a small (<0x10) step backwards for a new >> tc->tc_get_timecount(tc) value vs. the existing >> th->th_offset_count value (values strongly tied >> to powerpc64 tbr values): >>=20 >> void >> binuptime(struct bintime *bt) >> { >> struct timehands *th; >> u_int gen; >>=20 >> struct bintime old_bt=3D *bt; // HACK!!! >> struct timecounter *tc; // HACK!!! >> u_int tim_cnt, tim_offset, tim_diff; // HACK!!! >> uint64_t freq, scale_factor, diff_scaled; // HACK!!! >>=20 >> u_int try_cnt=3D 0ull; // HACK!!! >>=20 >> do { >> do { // HACK!!! >> th =3D timehands; >> tc =3D th->th_counter; >> gen =3D atomic_load_acq_int(&th->th_generation); >> tim_cnt=3D tc->tc_get_timecount(tc); >> tim_offset=3D th->th_offset_count; >> } while (tim_cnt<tim_offset && = tim_offset-tim_cnt<0x10); >> *bt =3D th->th_offset; >> tim_diff=3D (tim_cnt - tim_offset) & = tc->tc_counter_mask; >> scale_factor=3D th->th_scale; >> diff_scaled=3D scale_factor * tim_diff; >> bintime_addx(bt, diff_scaled); >> freq=3D tc->tc_frequency; >> atomic_thread_fence_acq(); >> try_cnt++; >> } while (gen =3D=3D 0 || gen !=3D th->th_generation); >>=20 >> if (*(volatile uint64_t*)0xc000000000000020=3D=3D0u && = (0xffffffffffffffffull/scale_factor)<tim_diff) { // HACK!!! >> *(volatile uint64_t*)0xc000000000000020=3D = bttosbt(old_bt); >> *(volatile uint64_t*)0xc000000000000028=3D = bttosbt(*bt); >> *(volatile uint64_t*)0xc000000000000030=3D freq; >> *(volatile uint64_t*)0xc000000000000038=3D = scale_factor; >> *(volatile uint64_t*)0xc000000000000040=3D tim_offset; >> *(volatile uint64_t*)0xc000000000000048=3D tim_cnt; >> *(volatile uint64_t*)0xc000000000000050=3D tim_diff; >> *(volatile uint64_t*)0xc000000000000058=3D try_cnt; >> *(volatile uint64_t*)0xc000000000000060=3D diff_scaled; >> *(volatile uint64_t*)0xc000000000000068=3D = scale_factor*freq; >> __asm__ ("sync"); >> } else if (*(volatile uint64_t*)0xc0000000000000a0=3D=3D0u && = (0xffffffffffffffffull/scale_factor)<tim_diff) { // HACK!!! >> *(volatile uint64_t*)0xc0000000000000a0=3D = bttosbt(old_bt); >> *(volatile uint64_t*)0xc0000000000000a8=3D = bttosbt(*bt); >> *(volatile uint64_t*)0xc0000000000000b0=3D freq; >> *(volatile uint64_t*)0xc0000000000000b8=3D = scale_factor; >> *(volatile uint64_t*)0xc0000000000000c0=3D tim_offset; >> *(volatile uint64_t*)0xc0000000000000c8=3D tim_cnt; >> *(volatile uint64_t*)0xc0000000000000d0=3D tim_diff; >> *(volatile uint64_t*)0xc0000000000000d8=3D try_cnt; >> *(volatile uint64_t*)0xc0000000000000e0=3D diff_scaled; >> *(volatile uint64_t*)0xc0000000000000e8=3D = scale_factor*freq; >> __asm__ ("sync"); >> } >> } >> #else >> . . . >> #endif >>=20 >> So far as I can tell, the FreeBSD code is not designed to deal >> with small differences in tc->tc_get_timecount(tc) not actually >> indicating a useful < vs. =3D=3D vs. > ordering relation uniquely. >>=20 >> (I make no claim that the hack is a proper way to deal with >> such.) >=20 > I did a somewhat over 7 hours buildworld buildkernel on the > PowerMac G5. Overall the G5 has been up over 13 hours and > none of the buf*daemon* threads have gotten stuck sleeping. > Nor has pmac_thermal gotten stuck. Similarly for vnlru > and syncer: "top -HIStopid" still shows them all as > periodically active. >=20 > Previously for this usefdt=3D1 context (with the modern > VM_MAX_KERNEL_ADDRESS), going more than a few minutes > without at least one of those threads getting stuck > sleeping was rare on the G5 (powerpc64 example). >=20 > So this hack has managed to avoid finding sbinuptime() > in sleepq_timeout being less than the earlier (by call > structure/code sequencing) sbinuptime() in timercb that > lead to the sleepq_timeout callout being called in the > first place. >=20 > So in the sleepq_timeout callout's: >=20 > if (td->td_sleeptimo > sbinuptime() || td->td_sleeptimo =3D=3D = 0) { > /* > * The thread does not want a timeout (yet). > */ > } else . . . >=20 > td->td_sleeptimo > sbinuptime() ends up false now for small > enough original differences. >=20 > This case does not set up another timeout, it just leaves the > thread stuck sleeping, no longer doing periodic activities. >=20 > As stands what I did (presuming an appropriate definition > of "small differences in the problematical direction") should > leave this and other sbinuptime-using code with: >=20 > td->td_sleeptimo <=3D sbinuptime() >=20 > for what were originally "small" tbr value differences in the > problematical direction (in case other places require it in > some way). >=20 > If, instead, just sleepq_timeout's test could allow for > some slop in the ordering, it could be a cheaper hack then > looping in binuptime . >=20 > At this point I've no clue what a correct/efficient FreeBSD > design for allowing the sloppy match across tbr's for different > CPUs would be. Instead of 0x10 in "&& tim_offset-tim_cnt<0x10" I tried the each of following and they all failed: && tim_offset-tim_cnt<0x2 && tim_offset-tim_cnt<0x4 && tim_offset-tim_cnt<0x8 && tim_offset-tim_cnt<0xc 0x2, 0x4, and 0x8 failed for the first boot attempt, almost mediately having stuck-in-sleep threads. 0xc seemed to be working for the first boot (including a buildworld buildkernel that did not have to rebuild much). But the 2nd boot attempt had a stuck-in-sleep thread by the time I logged in. By contrast, for: && tim_offset-tim_cnt<0x10 I've not it fail so far, after many reboots, a full buildworld buildkernel, and running over 24 hours (that included the somewhat over 7 hours for build world buildkernel). But it might be that some boots would need a bigger figure. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?75A8BB07-3273-423E-9436-798395BC8640>