Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Oct 2014 10:18:05 -0700
From:      Nathan Whitehorn <nwhitehorn@freebsd.org>
To:        Justin Hibbits <chmeeedalf@gmail.com>, Mark Millard <markmi@dsl-only.net>
Cc:        FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed]
Message-ID:  <543D5ACD.20901@freebsd.org>
In-Reply-To: <CAHSQbTCKi_MBhERh6d=kX2y-=%2B2OzqpGM%2BN=ZEShi-kX2r8NPQ@mail.gmail.com>
References:  <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net>	<543B3828.8070806@freebsd.org>	<9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net>	<379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net>	<C614025F-6455-4929-8468-462E76079274@dsl-only.net>	<A2AB9066-259B-4B7D-BDDC-D03AE5827E13@dsl-only.net> <CAHSQbTCKi_MBhERh6d=kX2y-=%2B2OzqpGM%2BN=ZEShi-kX2r8NPQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
r1 *must be* preserved by the standard and for anything to work. It's 
being corrupted somehow (Mark's comment about r3 is illuminating), and 
if r1 is being corrupted, you can't rely on anything. I suspect it might 
be an exception handling issue since it's non-deterministic, but it's 
hard to tell. It could also be triggered by the way we've set up the OF 
stack frame. It would be good to check if that makes sense.
-Nathan

On 10/14/14 09:53, Justin Hibbits wrote:
> Interesting.  Perhaps, instead of using %r1, and relying purely on the
> stack we use yet another (non-volatile) register to hold the MSR.
> Once we reload the MSR we can get back the saved registers, because
> the stack will be valid again.
>
> Nathan, thoughts?
>
> - Justin
>
> On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard <markmi@dsl-only.net> wrote:
>> Additional notes from additional experiments... (So far from one G5.)
>>
>> I got back trace, show registers, and my openfirmware-history list going for failure reporting based on explicit before vs. after tests of %r1 values. (Explicit breakpoint call for unequal, being careful to save/restore %r3 around the call.) I filled several registers with potentially interesting values that would otherwise have had zero as a value (%r15-%r19, although %r15 is redundant with %r6 currently).
>>
>> An interesting property resulted: every time %r1 had changed from having the before-value (stack pointer value) %r1 instead ended up with a value equal to what openfirmware put in %r3.
>>
>> And more then that: For builds with the same ofwstk position the %r3 value involved was fixed for the failures, for example when 0x30400=ofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as 0xd23450 for the failures. When 0x31400=ofwstk+0xfe0: %r3 and %r1 ended up for failure as 0xd24450 instead. Yep: offset by the same amount as ofwstk.
>>
>> And I got one example where the openfirmware %r1-value-change failure was instead much later in the boot, well after pmap_bootstrapped went true: It was just after the message lines...
>>
>> vgapci0: Boot video device ...
>> pcib1: <IBM CPC9X5 Hypertransport tunnel> ...
>>
>> with back trace (from OF_peer down):
>>
>> .OF_peer+0x8c
>> .cpcht_attach+0x884
>> .device_attach+0x3ac
>> .device_probe_and_attach+0x3c
>> .bus_generic_new_pass+0x12c
>> .bus_generic_new_pass+0x114
>> .bus_generic_new_pass+0x114 (yep: listed twice)
>> .bus_set_pass+0xc0
>> .root_bus_configure+0x14
>> .mi_startup+0x10c
>> btext+0xbc
>>
>> %r1 before: 0xc30400 ofwstk+0xfe0
>> %r1 after:  0xd23450
>> %r3 after:  0xd23450
>> FreeBSD msr to restore: 0x9000000000001032
>> ofmsr[0]  to restore:   0x1000000000003030
>>
>> The same after-openfirmware %r1 and %r3 values that had been showing up for the before-copyright examples of ofwcall failures.
>>
>> And note that it again was a peer request. All the ofwcall-tied boot-failures have been for peer requests as far as I remember.
>>
>> I later did some experiments where I had it report but not stop when the after-value was different from the before-value for %r1. When this happened for these types of tests it seem to be an isolated example: later calls normally have the stack pointer value still in %r1 after openfirmware returns. In more detail: At most one report was made for such a boot, the rest of the boot went fine. (Of course to get that far my hacked ofwcall code avoids using the after-openfirmware %r1 value to extract the 3 saved values to be restored from the bottom of ofwstk.)
>>
>>
>>
>> I was not successful at using "capture on" in DDB for this early-boot context. (It hangs things after the first report.) So I've been limited to one screen's report and only when I have it stop at the end of the report (so it does not scroll away). (No input to DDB available that early.) Otherwise the information just scrolls by rather quickly for reading any detail. Still it was useful to see that other reports were not produced after the first (when there was a first). (I can not claim multiple are impossible. It just appears at least infrequent.)
>>
>> I have not yet investigated making analogous powerpc/GENERIC code and builds.
>>
>> Nor have I dealt with having it report more detail about the peer requests that fail.
>>
>> Nor have I seen examples of what "not failing/%r1-unchanged" looks like overall.
>>
>> I still have no examples of unstable/incomplete initialization(s) or race condition(s) to explain why both ways can and do occur from one attempt to the next --or that difference peer requests in the sequence can be where the problem happens.
>>
>> ===
>> Mark Millard
>> markmi@dsl-only.net
>>
>> On Oct 13, 2014, at 3:39 AM, Mark Millard <markmi@dsl-only.net> wrote:
>>
>> While I do not yet have "show register" or other information displayed when %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have now had two cases happen for the same, unmodified boot SSD in the same PowerMac G5:
>>
>> A) Boots without failure or finding any changes to %r1 for before vs. after openfirmware calls.
>>
>> B) I had it stop the boot after the code finds that %r1 had instead changed. The usual before-copyright-notice sort of timing for where it stopped, after pmap_bootstrapped became true. (I need "show register" or other such to have more detail.)
>>
>>
>> I still have no examples of unstable/incomplete initialization(s) or race condition(s) to explain why both ways can and do occur from one attempt to the next. Both both do.
>>
>>
>>
>> ===
>> Mark Millard
>> markmi at dsl-only.net
>>
>> On Oct 12, 2014, at 11:34 PM, Mark Millard <markmi at dsl-only.net> wrote:
>>
>> Fixing stupid typos that reverse what I should have said: removing the !'s in front of pmap_bootstrapped (from a copy/paste sequence error)...
>>
>> Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the G5's except for when pmap_bootstrapped in (variants of) powerpc64/GENERIC64. (Only covers when I had enough debug context in place to know that much. Similarly for other notes.) These ofwcall related failures are the vast majority of the boot failures that I've seen.
>>
>> ...
>>
>> The only other ofwcall failure that I've seen happened only once and was where prior ofwcall's with pmap_bootstrapped had already happened (as reported by the ofwcall history list in my debug/DDB hacks). But this was before the %r1 before and after code was in place: that is a recent addition to my investigation.
>>
>>
>>
>>
>> ===
>> Mark Millard
>> markmi at dsl-only.net
>>
>> On Oct 12, 2014, at 11:20 PM, Mark Millard <markmi at dsl-only.net> wrote:
>>
>> Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the G5's except for when !pmap_bootstrapped in (variants of) powerpc64/GENERIC64. (Only covers when I had enough debug context in place to know that much. Similarly for other notes.) These ofwcall related failures are the vast majority of the boot failures that I've seen.
>>
>> A nice thing about what I've found is that I can now figure out how to use a comparison of the before and after stack pointers and to force DDB's involvement if and only if they are not equal.
>>
>> That would also report %r1 differences that happen to not to produce failures (if there are such). (There has to be some explanation for why sometimes it works and sometimes it does not, say, unstable initializations, race conditions, or something meeting both criteria.)
>>
>> Which in turn makes the general technique appropriate to powerpc/GENERIC contexts as well. (Coding details may vary.)
>>
>> I can not promise how quickly I'll get to any specific part of this. But I should gradually progress on it.
>>
>>
>> I should have mentioned some things about the kind of evidence I have vs. do not (yet) have:
>>
>>
>> A) The property defining the only context where I have observed the %r1 issue is as noted above.
>>
>> In all but one of the ofwcall failure cases it was the first ofwcall in that !pmap_bootstrapped context that had the problem.
>>
>> The only other ofwcall failure that I've seen happened only once and was where prior ofwcall's with !pmap_bootstrapped had already happened (as reported by the ofwcall history list in my debug/DDB hacks). But this was before the %r1 before and after code was in place: that is a recent addition to my investigation.
>>
>>
>> B) While I've not been building debug code variants for powerpc/GENERIC I've never seen the powerpc/GENERIC code fail to boot the G5's. And I have spent some sessions doing reboot after reboot to see if I'd get some failures (in addition to some other more normal uses).
>>
>>
>> C) So far I've only been looking at "show registers" when it gets a boot-time exception that a DDB processes with the automatic script: the crashes. I do not (yet) have any observations of what things look like during such points for successful boots. (I'm figuring out ways to get and see the evidence spanning early boot time as I go.) And so I've only been looking with such special debug code where I knew I could reproduce the failures (3 PowerMac G5's when using variants of powerpc64/GENERIC64.)
>>
>> In fact if the hack that I put in place completely masks the problem then I currently would not ever observe any problem-specific information from the successful boots. Thus the before/after comparison would seem to be next for my investigation.
>>
>>
>>
>> ===
>> Mark Millard
>> markmi at dsl-only.net
>>
>> On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn <nwhitehorn at freebsd.org> wrote:
>>
>> Interesting. If OF is changing the value of r1, there must be some problem with the ABI thunk the 64-bit kernel uses or a problem with trap handlers. This is obviously not systematic if loader and the kernel up to that point have no problems. Does a 32-bit kernel have the same problems on your hardware? That would test whether it is the ABI translation.
>> -Nathan
>>
>> On 10/12/14 17:53, Mark Millard wrote:
>>> NOTE: I make no claim that any of the below hacks for ofwcall are appropriate code for FreeBSD's general context. I only claim that it seems to make the specific PowerMac G5 problem go away, gives solid evidence for at least some of what is going on (justifying the investigative and testing hacks) and so gives evidence for an appropriate, more general FreeBSD solution.
>>>
>>>
>>> The big issue is: The PowerMac G5 openfirmware does not always preserve the %r1 value (the stack pointer contents) that it is initially given, at least when the early "before copyright" crash problem is happening but possibly other times as well.
>>>
>>> I had the following investigative code in ofwcall, snapshotting the value of %r1 before and after openfirmware's code is used:
>>>
>>>        lis     %r4,openfirmware_entry@ha
>>>        ld      %r4,openfirmware_entry@l(%r4)
>>> ...
>>>        mr   %r17,%r1 /* ADDED HACK TO RECORD %r1 before...
>>>        /* Finally, branch to OF */
>>>        mtctr   %r4
>>>        bctrl
>>>        mr   %r18,%r1 /* ADDED HACK TO RECORD %r1 after...
>>>
>>> then the DDB show registers from the crash that I'd hacked in would show these values instead of the zeros they otherwise always display, in addition to what the show registers has always shown for r1.
>>>
>>> The results were like the following example for every such crash:
>>>
>>> r17 = 0xC31400 ofwstk+0xfe0
>>> r18 = 0xd24450
>>> r1  = 0xd24450
>>>
>>> Because of that %r1 value the later code such as:
>>>
>>>        /* Reload stack pointer and MSR from the OFW stack */
>>>        ld      %r6,24(%r1)
>>>        ld      %r2,16(%r1)
>>>        ld      %r1,8(%r1)
>>>
>>> gets garbage-in/garbage-out results, including %r6 being values like 0xbc0568 instead of the value saved msr to later be restored: 0x9000000000001032.
>>>
>>> So one PowerMac G5 specific hack involved in my working-boots context is to force the original %r1 value to be used (based on %r17 being a before-call copy, similar to the above):
>>>
>>>        ld      %r6,24(%r17)
>>>        ld      %r2,16(%r17)
>>>        ld      %r1,8(%r17)
>>>
>>> But the exception report from DDB has had problems in part because sprg0 still has the openfirmware value at the time even though the exception is after openfirmware returned (the wrong value results in the register for GET_CPUINFO(<register>). So I hacked in a before-exception restore of FreeBSD's sprg0 inside ofwcall to make the exception handler code have that much FreeBSD context available at the exception (if it occurs, anyway). This was really just to help with information gathering, although I've not tested only having the %r17 changes.
>>>
>>> So overall PowerMac G5 specific hacking the ofwcall code to have instead (based on what was reported above):
>>>
>>> root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S
>>> Index: /usr/src/sys/powerpc/ofw/ofwcall64.S
>>> ===================================================================
>>> --- /usr/src/sys/powerpc/ofw/ofwcall64.S      (revision 272558)
>>> +++ /usr/src/sys/powerpc/ofw/ofwcall64.S      (working copy)
>>> @@ -52,6 +52,12 @@
>>> GLOBAL(rtas_entry)
>>>        .llong  0                       /* RTAS entry point */
>>> + /* HACK: part of having sprg0 in place for trap */
>>> +ofwsprg0save:
>>> +     .space  8 /* sizeof(register_t) */
>>> +GLOBAL(ofw_sprg0_save)
>>> +     .llong  0
>>> +
>>> /*
>>> * Open Firmware Real-mode Entry Point. This is a huge pain.
>>> */
>>> @@ -97,6 +103,10 @@
>>>        lis     %r4,openfirmware_entry@ha
>>>        ld      %r4,openfirmware_entry@l(%r4)
>>> +     /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */
>>> +     lis     %r14,ofw_sprg0_save@ha
>>> +     ld      %r14,ofw_sprg0_save@l(%r14)
>>> +
>>>        /*
>>>         * Set the MSR to the OF value. This has the side effect of disabling
>>>         * exceptions, which is important for the next few steps.
>>> @@ -123,14 +133,27 @@
>>>        stw     %r5,4(%r1)
>>>        stw     %r5,0(%r1)
>>> +     /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */
>>> +     lis     %r6,ofwsprg0save@ha
>>> +     std     %r14,ofwsprg0save@l(%r6)
>>> +
>>> +     /* HACK: part of IGNORING the later %r1 value from openfirmware */
>>> +     mr      %r17,%r1
>>> +
>>>        /* Finally, branch to OF */
>>>        mtctr   %r4
>>>        bctrl
>>> +     /* HACK: part of having FreeBSD's sprg0 in place for the exception problem */
>>> +     lis     %r6,ofwsprg0save@ha
>>> +     ld      %r6,ofwsprg0save@l(%r6)
>>> +     mtsprg0 %r6
>>> +
>>>        /* Reload stack pointer and MSR from the OFW stack */
>>> -     ld      %r6,24(%r1)
>>> -     ld      %r2,16(%r1)
>>> -     ld      %r1,8(%r1)
>>> +     /* HACKED to ignore the %r1 value that results from openfirmware's call */
>>> +     ld      %r6,24(%r17)
>>> +     ld      %r2,16(%r17)
>>> +     ld      %r1,8(%r17)
>>>        /* Now set the real MSR */
>>>        mtmsrd  %r6
>>>
>>> This results in no crashes happening so far in my testing, not even the 16 GByte RAM machine that crashed so much.
>>>
>>> NOTE: owf_machdep.c was changed to use "extern register_t ofw_sprg0_save;" to match the above.
>>>
>>> I still have ps3 disabled in GENERIC64 so that I can also have the sc options in GENERIC64. And the DDB and GDB options are still present as well.
>>>
>>> And I still have my hack to force a DDB script that does show registers and shows the ofwcall history information that I hacked in, even for the very early crashes before input is possible. Not that I'm now getting such executions of the script. (A before possible-crash backtrace is also shown by the added code. That still shows up.)
>>>
>>> I'll probably next switch to reverting the DDB related code changes and to removing the DDB/GDB options and see how that goes.
>>>
>>>
>>> ===
>>> Mark Millard
>>> markmi at dsl-only.net
>>>
>>>
>>
>>
>>
>>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?543D5ACD.20901>