Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Oct 2014 09:53:28 -0700
From:      Justin Hibbits <chmeeedalf@gmail.com>
To:        Mark Millard <markmi@dsl-only.net>
Cc:        FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed]
Message-ID:  <CAHSQbTCKi_MBhERh6d=kX2y-=%2B2OzqpGM%2BN=ZEShi-kX2r8NPQ@mail.gmail.com>
In-Reply-To: <A2AB9066-259B-4B7D-BDDC-D03AE5827E13@dsl-only.net>
References:  <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <C614025F-6455-4929-8468-462E76079274@dsl-only.net> <A2AB9066-259B-4B7D-BDDC-D03AE5827E13@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Interesting.  Perhaps, instead of using %r1, and relying purely on the
stack we use yet another (non-volatile) register to hold the MSR.
Once we reload the MSR we can get back the saved registers, because
the stack will be valid again.

Nathan, thoughts?

- Justin

On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard <markmi@dsl-only.net> wrote:
> Additional notes from additional experiments... (So far from one G5.)
>
> I got back trace, show registers, and my openfirmware-history list going =
for failure reporting based on explicit before vs. after tests of %r1 value=
s. (Explicit breakpoint call for unequal, being careful to save/restore %r3=
 around the call.) I filled several registers with potentially interesting =
values that would otherwise have had zero as a value (%r15-%r19, although %=
r15 is redundant with %r6 currently).
>
> An interesting property resulted: every time %r1 had changed from having =
the before-value (stack pointer value) %r1 instead ended up with a value eq=
ual to what openfirmware put in %r3.
>
> And more then that: For builds with the same ofwstk position the %r3 valu=
e involved was fixed for the failures, for example when 0x30400=3Dofwstk+0x=
fe0 (%r1 before) was reported %r3 and %r1 end up as 0xd23450 for the failur=
es. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 ended up for failure as 0xd244=
50 instead. Yep: offset by the same amount as ofwstk.
>
> And I got one example where the openfirmware %r1-value-change failure was=
 instead much later in the boot, well after pmap_bootstrapped went true: It=
 was just after the message lines...
>
> vgapci0: Boot video device ...
> pcib1: <IBM CPC9X5 Hypertransport tunnel> ...
>
> with back trace (from OF_peer down):
>
> .OF_peer+0x8c
> .cpcht_attach+0x884
> .device_attach+0x3ac
> .device_probe_and_attach+0x3c
> .bus_generic_new_pass+0x12c
> .bus_generic_new_pass+0x114
> .bus_generic_new_pass+0x114 (yep: listed twice)
> .bus_set_pass+0xc0
> .root_bus_configure+0x14
> .mi_startup+0x10c
> btext+0xbc
>
> %r1 before: 0xc30400 ofwstk+0xfe0
> %r1 after:  0xd23450
> %r3 after:  0xd23450
> FreeBSD msr to restore: 0x9000000000001032
> ofmsr[0]  to restore:   0x1000000000003030
>
> The same after-openfirmware %r1 and %r3 values that had been showing up f=
or the before-copyright examples of ofwcall failures.
>
> And note that it again was a peer request. All the ofwcall-tied boot-fail=
ures have been for peer requests as far as I remember.
>
> I later did some experiments where I had it report but not stop when the =
after-value was different from the before-value for %r1. When this happened=
 for these types of tests it seem to be an isolated example: later calls no=
rmally have the stack pointer value still in %r1 after openfirmware returns=
. In more detail: At most one report was made for such a boot, the rest of =
the boot went fine. (Of course to get that far my hacked ofwcall code avoid=
s using the after-openfirmware %r1 value to extract the 3 saved values to b=
e restored from the bottom of ofwstk.)
>
>
>
> I was not successful at using "capture on" in DDB for this early-boot con=
text. (It hangs things after the first report.) So I've been limited to one=
 screen's report and only when I have it stop at the end of the report (so =
it does not scroll away). (No input to DDB available that early.) Otherwise=
 the information just scrolls by rather quickly for reading any detail. Sti=
ll it was useful to see that other reports were not produced after the firs=
t (when there was a first). (I can not claim multiple are impossible. It ju=
st appears at least infrequent.)
>
> I have not yet investigated making analogous powerpc/GENERIC code and bui=
lds.
>
> Nor have I dealt with having it report more detail about the peer request=
s that fail.
>
> Nor have I seen examples of what "not failing/%r1-unchanged" looks like o=
verall.
>
> I still have no examples of unstable/incomplete initialization(s) or race=
 condition(s) to explain why both ways can and do occur from one attempt to=
 the next --or that difference peer requests in the sequence can be where t=
he problem happens.
>
> =3D=3D=3D
> Mark Millard
> markmi@dsl-only.net
>
> On Oct 13, 2014, at 3:39 AM, Mark Millard <markmi@dsl-only.net> wrote:
>
> While I do not yet have "show register" or other information displayed wh=
en %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have now had=
 two cases happen for the same, unmodified boot SSD in the same PowerMac G5=
:
>
> A) Boots without failure or finding any changes to %r1 for before vs. aft=
er openfirmware calls.
>
> B) I had it stop the boot after the code finds that %r1 had instead chang=
ed. The usual before-copyright-notice sort of timing for where it stopped, =
after pmap_bootstrapped became true. (I need "show register" or other such =
to have more detail.)
>
>
> I still have no examples of unstable/incomplete initialization(s) or race=
 condition(s) to explain why both ways can and do occur from one attempt to=
 the next. Both both do.
>
>
>
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>
> On Oct 12, 2014, at 11:34 PM, Mark Millard <markmi at dsl-only.net> wrote=
:
>
> Fixing stupid typos that reverse what I should have said: removing the !'=
s in front of pmap_bootstrapped (from a copy/paste sequence error)...
>
> Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the=
 G5's except for when pmap_bootstrapped in (variants of) powerpc64/GENERIC6=
4. (Only covers when I had enough debug context in place to know that much.=
 Similarly for other notes.) These ofwcall related failures are the vast ma=
jority of the boot failures that I've seen.
>
> ...
>
> The only other ofwcall failure that I've seen happened only once and was =
where prior ofwcall's with pmap_bootstrapped had already happened (as repor=
ted by the ofwcall history list in my debug/DDB hacks). But this was before=
 the %r1 before and after code was in place: that is a recent addition to m=
y investigation.
>
>
>
>
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>
> On Oct 12, 2014, at 11:20 PM, Mark Millard <markmi at dsl-only.net> wrote=
:
>
> Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the=
 G5's except for when !pmap_bootstrapped in (variants of) powerpc64/GENERIC=
64. (Only covers when I had enough debug context in place to know that much=
. Similarly for other notes.) These ofwcall related failures are the vast m=
ajority of the boot failures that I've seen.
>
> A nice thing about what I've found is that I can now figure out how to us=
e a comparison of the before and after stack pointers and to force DDB's in=
volvement if and only if they are not equal.
>
> That would also report %r1 differences that happen to not to produce fail=
ures (if there are such). (There has to be some explanation for why sometim=
es it works and sometimes it does not, say, unstable initializations, race =
conditions, or something meeting both criteria.)
>
> Which in turn makes the general technique appropriate to powerpc/GENERIC =
contexts as well. (Coding details may vary.)
>
> I can not promise how quickly I'll get to any specific part of this. But =
I should gradually progress on it.
>
>
> I should have mentioned some things about the kind of evidence I have vs.=
 do not (yet) have:
>
>
> A) The property defining the only context where I have observed the %r1 i=
ssue is as noted above.
>
> In all but one of the ofwcall failure cases it was the first ofwcall in t=
hat !pmap_bootstrapped context that had the problem.
>
> The only other ofwcall failure that I've seen happened only once and was =
where prior ofwcall's with !pmap_bootstrapped had already happened (as repo=
rted by the ofwcall history list in my debug/DDB hacks). But this was befor=
e the %r1 before and after code was in place: that is a recent addition to =
my investigation.
>
>
> B) While I've not been building debug code variants for powerpc/GENERIC I=
've never seen the powerpc/GENERIC code fail to boot the G5's. And I have s=
pent some sessions doing reboot after reboot to see if I'd get some failure=
s (in addition to some other more normal uses).
>
>
> C) So far I've only been looking at "show registers" when it gets a boot-=
time exception that a DDB processes with the automatic script: the crashes.=
 I do not (yet) have any observations of what things look like during such =
points for successful boots. (I'm figuring out ways to get and see the evid=
ence spanning early boot time as I go.) And so I've only been looking with =
such special debug code where I knew I could reproduce the failures (3 Powe=
rMac G5's when using variants of powerpc64/GENERIC64.)
>
> In fact if the hack that I put in place completely masks the problem then=
 I currently would not ever observe any problem-specific information from t=
he successful boots. Thus the before/after comparison would seem to be next=
 for my investigation.
>
>
>
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>
> On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn <nwhitehorn at freebsd.org>=
 wrote:
>
> Interesting. If OF is changing the value of r1, there must be some proble=
m with the ABI thunk the 64-bit kernel uses or a problem with trap handlers=
. This is obviously not systematic if loader and the kernel up to that poin=
t have no problems. Does a 32-bit kernel have the same problems on your har=
dware? That would test whether it is the ABI translation.
> -Nathan
>
> On 10/12/14 17:53, Mark Millard wrote:
>> NOTE: I make no claim that any of the below hacks for ofwcall are approp=
riate code for FreeBSD's general context. I only claim that it seems to mak=
e the specific PowerMac G5 problem go away, gives solid evidence for at lea=
st some of what is going on (justifying the investigative and testing hacks=
) and so gives evidence for an appropriate, more general FreeBSD solution.
>>
>>
>> The big issue is: The PowerMac G5 openfirmware does not always preserve =
the %r1 value (the stack pointer contents) that it is initially given, at l=
east when the early "before copyright" crash problem is happening but possi=
bly other times as well.
>>
>> I had the following investigative code in ofwcall, snapshotting the valu=
e of %r1 before and after openfirmware's code is used:
>>
>>       lis     %r4,openfirmware_entry@ha
>>       ld      %r4,openfirmware_entry@l(%r4)
>> ...
>>       mr   %r17,%r1 /* ADDED HACK TO RECORD %r1 before...
>>       /* Finally, branch to OF */
>>       mtctr   %r4
>>       bctrl
>>       mr   %r18,%r1 /* ADDED HACK TO RECORD %r1 after...
>>
>> then the DDB show registers from the crash that I'd hacked in would show=
 these values instead of the zeros they otherwise always display, in additi=
on to what the show registers has always shown for r1.
>>
>> The results were like the following example for every such crash:
>>
>> r17 =3D 0xC31400 ofwstk+0xfe0
>> r18 =3D 0xd24450
>> r1  =3D 0xd24450
>>
>> Because of that %r1 value the later code such as:
>>
>>       /* Reload stack pointer and MSR from the OFW stack */
>>       ld      %r6,24(%r1)
>>       ld      %r2,16(%r1)
>>       ld      %r1,8(%r1)
>>
>> gets garbage-in/garbage-out results, including %r6 being values like 0xb=
c0568 instead of the value saved msr to later be restored: 0x90000000000010=
32.
>>
>> So one PowerMac G5 specific hack involved in my working-boots context is=
 to force the original %r1 value to be used (based on %r17 being a before-c=
all copy, similar to the above):
>>
>>       ld      %r6,24(%r17)
>>       ld      %r2,16(%r17)
>>       ld      %r1,8(%r17)
>>
>> But the exception report from DDB has had problems in part because sprg0=
 still has the openfirmware value at the time even though the exception is =
after openfirmware returned (the wrong value results in the register for GE=
T_CPUINFO(<register>). So I hacked in a before-exception restore of FreeBSD=
's sprg0 inside ofwcall to make the exception handler code have that much F=
reeBSD context available at the exception (if it occurs, anyway). This was =
really just to help with information gathering, although I've not tested on=
ly having the %r17 changes.
>>
>> So overall PowerMac G5 specific hacking the ofwcall code to have instead=
 (based on what was reported above):
>>
>> root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S
>> Index: /usr/src/sys/powerpc/ofw/ofwcall64.S
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> --- /usr/src/sys/powerpc/ofw/ofwcall64.S      (revision 272558)
>> +++ /usr/src/sys/powerpc/ofw/ofwcall64.S      (working copy)
>> @@ -52,6 +52,12 @@
>> GLOBAL(rtas_entry)
>>       .llong  0                       /* RTAS entry point */
>> + /* HACK: part of having sprg0 in place for trap */
>> +ofwsprg0save:
>> +     .space  8 /* sizeof(register_t) */
>> +GLOBAL(ofw_sprg0_save)
>> +     .llong  0
>> +
>> /*
>> * Open Firmware Real-mode Entry Point. This is a huge pain.
>> */
>> @@ -97,6 +103,10 @@
>>       lis     %r4,openfirmware_entry@ha
>>       ld      %r4,openfirmware_entry@l(%r4)
>> +     /* HACK: part of having FreeBSD's sprg0 in place for the exception=
 problem */
>> +     lis     %r14,ofw_sprg0_save@ha
>> +     ld      %r14,ofw_sprg0_save@l(%r14)
>> +
>>       /*
>>        * Set the MSR to the OF value. This has the side effect of disabl=
ing
>>        * exceptions, which is important for the next few steps.
>> @@ -123,14 +133,27 @@
>>       stw     %r5,4(%r1)
>>       stw     %r5,0(%r1)
>> +     /* HACK: part of having FreeBSD's sprg0 in place for the exception=
 problem */
>> +     lis     %r6,ofwsprg0save@ha
>> +     std     %r14,ofwsprg0save@l(%r6)
>> +
>> +     /* HACK: part of IGNORING the later %r1 value from openfirmware */
>> +     mr      %r17,%r1
>> +
>>       /* Finally, branch to OF */
>>       mtctr   %r4
>>       bctrl
>> +     /* HACK: part of having FreeBSD's sprg0 in place for the exception=
 problem */
>> +     lis     %r6,ofwsprg0save@ha
>> +     ld      %r6,ofwsprg0save@l(%r6)
>> +     mtsprg0 %r6
>> +
>>       /* Reload stack pointer and MSR from the OFW stack */
>> -     ld      %r6,24(%r1)
>> -     ld      %r2,16(%r1)
>> -     ld      %r1,8(%r1)
>> +     /* HACKED to ignore the %r1 value that results from openfirmware's=
 call */
>> +     ld      %r6,24(%r17)
>> +     ld      %r2,16(%r17)
>> +     ld      %r1,8(%r17)
>>       /* Now set the real MSR */
>>       mtmsrd  %r6
>>
>> This results in no crashes happening so far in my testing, not even the =
16 GByte RAM machine that crashed so much.
>>
>> NOTE: owf_machdep.c was changed to use "extern register_t ofw_sprg0_save=
;" to match the above.
>>
>> I still have ps3 disabled in GENERIC64 so that I can also have the sc op=
tions in GENERIC64. And the DDB and GDB options are still present as well.
>>
>> And I still have my hack to force a DDB script that does show registers =
and shows the ofwcall history information that I hacked in, even for the ve=
ry early crashes before input is possible. Not that I'm now getting such ex=
ecutions of the script. (A before possible-crash backtrace is also shown by=
 the added code. That still shows up.)
>>
>> I'll probably next switch to reverting the DDB related code changes and =
to removing the DDB/GDB options and see how that goes.
>>
>>
>> =3D=3D=3D
>> Mark Millard
>> markmi at dsl-only.net
>>
>>
>
>
>
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAHSQbTCKi_MBhERh6d=kX2y-=%2B2OzqpGM%2BN=ZEShi-kX2r8NPQ>