Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 18 Oct 2014 02:36:07 -0700
From:      Mark Millard <markmi@dsl-only.net>
To:        Nathan Whitehorn <nwhitehorn@freebsd.org>
Cc:        Justin Hibbits <chmeeedalf@gmail.com>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [%r3 corrupted too]
Message-ID:  <0CEC8978-E208-4F57-8481-DD9C321EF673@dsl-only.net>
In-Reply-To: <49920E63-CB4A-429C-AB3A-984075AE183D@dsl-only.net>
References:  <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net>	<543B3828.8070806@freebsd.org>	<9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net>	<379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net>	<C614025F-6455-4929-8468-462E76079274@dsl-only.net>	<A2AB9066-259B-4B7D-BDDC-D03AE5827E13@dsl-only.net> <CAHSQbTCKi_MBhERh6d=kX2y-=%2B2OzqpGM%2BN=ZEShi-kX2r8NPQ@mail.gmail.com> <543D5ACD.20901@freebsd.org> <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> <0F85ACBD-F6D6-4ABA-B8FA-00C586A086DE@dsl-only.net> <EE3EC252-DC9E-4F95-977C-FAF9F364CA92@dsl-only.net> <49920E63-CB4A-429C-AB3A-984075AE183D@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
I did an experiment recording interrupt numbers in with the ofwcall =
history that I'd hacked in and have been using. I did this addition to =
the history in powerpc_interrupt. I managed to see a data storage =
interrupt recorded in the history from a forced failure somewhat after =
the normal ofwcall failure place (as evidence that the mechanism was =
good enough for early boot time, which is all I needed or was after).

But when the %r1!=3D%r3 detection happens in ofwcall after openfirmware =
and the information was reported no interrupt was listed in the history. =
I suppose openfirmware could temporarily save, replace, and later =
restore some interrupt handlers. That would mean I'd not catch the =
interrupt activity for such times. But it does not appear that =
powerpc_interrupt is involved if any interrupts are involved in the =
ofwcall failures. It would need to be something that avoids =
powerpc_interrupt. Otherwise I should have seen the report =
("!<hex-number>!") in the history that it listed.

=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Oct 17, 2014, at 5:25 AM, Mark Millard <markmi at dsl-only.net> =
wrote:

I noticed for the normal PowerMac G5 failing place that %r2 was 0xd23050 =
(generally in my recent builds: even when it works) and that the failing =
%r1 and %r3 were then 0xd23450. In other words: %r2+0x400. An accident =
or is that what the code is doing when it fails?

So I cleared the %r2 value in ofwcall (leaving the TOC value in the =
ofwstk place and using that to put the TOC back but having %r2=3Dzero =
between).

Failure then produced %0x400 for %r1 and %r3: again %r2+0x400. And %r2 =
stayed at zero until the ofwcall code put the TOC value back.

In other words: the failing values are from adding/oring 0x400 to %r2, =
whatever its value.

Trying %r2=3D0x400 instead of zero resulted in %r1 and %r3 having 0x800: =
So %r2+0x400 instead of %r2 | 0x400.



After this I adjusted ofwcall to clear %r0, %r2, %r4-%r12 before the =
openfirmware code executes and to leave them alone after. (Thus I =
changed the register usage.) This allows (probably) seeing what other =
volatile registers may have changed, especially when past different =
value reports are considered/compared with the results for this context. =
%r19-%r28 also set to zero. %r13-%r18,%r29 deliberately record things to =
report on failure (breakpoint use). (Prior tests showed %r14-%r19 as =
zero after so this tests if they are left alone vs. set to zero.) (The =
%r13-%r18 values also are generally put to some normal use as well.) =
%r30-%r31 are also put to various uses (not so simple to tell if =
preserved).

Results beyond %r1/%r3's corruption: %r4 gets the lr value used by =
openfirmware in returning to ofwcall: %r4 no longer zero. Past activity =
shows %r0 is actually set to zero if it is initially non-zero instead, =
which not visible for how I did this test. The other registers that were =
intended to be easily tested were preserved, including %r2.

[This only tests/classifies the failing openfirmware call. So the normal =
test here is of the first ofwcall once pmap_bootstrapped is true. It =
happens to be a peer request. Other requests could be different in the =
details for all I know. I've not hit any other ofwcall failure points so =
far.]



The above notes should apply even if the "Count of times ..." string is =
an accident of %r2's normal value at the time of the failure (0xd23050) =
--rather than a deliberate attempted addressing of the string's =
reference in the .got. [.got: global offset table used, for example, for =
position independent code access to things in memory.]


=3D=3D=3D
Mark Millard
markmi at dsl-only.net



More information on the odd %r1 and %r3 value...

The current and recent kernels that I've built get 0xd23450 for the =
corrupted values in %r1 and %r3 after openfirmware returns.

So I decided to look up what that might be...

objdump -h /boot/kernel/kernel shows  (.got: "global object table" or =
some such?) ...

Sections:
Idx Name          Size      VMA               LMA               File off =
 Algn
...
35 .got          0002f5c0  0000000000cfb248  0000000000cfb248  00bfb248  =
2**3
                CONTENTS, ALLOC, LOAD, DATA
36 .dynamic      000000d0  0000000000d2a808  0000000000d2a808  00c2a808  =
2**3
                CONTENTS, ALLOC, LOAD, DATA
...

and objdump -s -j .got /boot/kernel/kernel shows...

d23438 00000000 00bbfd48 00000000 00bbfd60  .......H.......`
d23448 00000000 00bbfd90 00000000 00bbfdb0  ................
d23458 00000000 00bbfdf0 00000000 00e17dd0  ..............}.

Then for 0xbbfdb0 from the above: objdump -h /boot/kernel/kernel =
shows...

6 .rodata.str1.8 000834a8  0000000000b4ddf8  0000000000b4ddf8  00a4ddf8  =
2**3
                CONTENTS, ALLOC, LOAD, READONLY, DATA
7 set_sysinit_set 00002538  0000000000bd12a0  0000000000bd12a0  00ad12a0 =
 2**3
                CONTENTS, ALLOC, LOAD, READONLY, DATA

and objdump -s -j .rodata.str1.8 /boot/kernel/kernel shows...

bbfda8 6f756e74 00000000 436f756e 74206f66  ount....Count of
bbfdb8 2074696d 65732074 68726f74 746c696e   times throttlin
bbfdc8 67206261 73656420 6f6e2072 65717565  g based on reque
bbfdd8 73742073 70616365 20686173 206f6363  st space has occ
bbfde8 75727265 64000000 25733a20 6d617374  urred...%s: mast

So 0xd23450 appears to possibly be a indirect reference to the string =
"Count of times throttling based on request space has occurred" or =
similar indirect content based on some offset from 0xd23450 indirectly =
getting to something else through the .got section. That string that I =
quoted is from /usr/src/sys/rpc/svc.c:

SVCPOOL*
svcpool_create(const char *name, struct sysctl_oid_list *sysctl_base)
{
...

              SYSCTL_ADD_INT(&pool->sp_sysctl, sysctl_base, OID_AUTO,
                  "request_space_throttle_count", CTLFLAG_RD,
                  &pool->sp_space_throttle_count, 0,
                  "Count of times throttling based on request space has =
occurred");
      }

      return pool;
}

(I have not done this lookup sequence across various FreeBSD updates and =
rebuilds that also get 0xd23450 in %r1 and %r3. Nor with FreeBSD builds =
that get some other corruption value. I do not know that the indirect =
lookup would have always gotten to that same string.)



=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Oct 14, 2014, at 11:30 PM, Mark Millard <markmi at dsl-only.net> =
wrote:

I added including after-ofwcall %r1 and %r3 values to my ofwcall history =
buffer that I have ddb report when there is a problem.

This makes it apparent that %r3 has also been corrupted when %r1 has =
been.

I say that because the usual/normal %r3 value is 0 in what the code =
records and reports and I gather from the FreeBSD source code that the =
error indicator is -1. But all along I've been reporting %r3 values for =
the crashes that look more like 0xd18868 or other such. Never a 0 or -1 =
(0xfff...). And the %r3 crash values even move around when the ofwstk =
changes place from build to build.

(This "usual"/"error-check" mix suggests %r3 from openfirmware is a =
multi-bit representation of a Boolean value, with one's complemented =
alternative values and zero as one of the two bit patterns --when %r3 is =
not corrupted.)



I also got an example of a somewhat later than normal ofwcall failure: =
about 23 ofwcall's later than normal. It was not a peer request:

...
OF_finddevice+0x90
powermac_smp_get_bsp+0x20
platform_smp_get_bsp+0x78
cpu_mp_start+0x24
mp_startup+0x7c
mi_startup+0x10c
btext_0xbc

So pmap_bootstrapped had been true for a while by this point. Available =
memory had been displayed as of when this example stopped to report the =
%r1 change.




=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Oct 14, 2014, at 3:18 PM, Mark Millard <markmi at dsl-only.net> =
wrote:

For openfirmware: is %r3 on return any more then a failed vs. not flag =
with a particular failed-value? Is there any way to validate that %r3 =
values for non-failure look reasonable vs. not looking reasonable? (For =
all I know %r3 could also be corrupt.)

I do not have any documentation for the PowerMac G5 openfirmware API =
that is in use or the associated ABI as far as I remember. I do not know =
if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if =
there was some conversion going back and forth (as there is for FreeBSD, =
at least for powerpc64). For openfirmware I derive properties from what =
I see in FreeBSD's code (which has to be more explicit then when a =
compiler's code generation happens to match at least large parts of an =
ABI directly).

As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's =
ABI but FreeBSD does. If true I do not know what other differences that =
there might be (even ignoring the 32 bit vs. 64 bit issues for the =
kernels). But the point would be an existence proof of at least one =
difference. My understanding is that %r1 was as in FreeBSD.

I vaguely seem to remember that for Darwin/Mac OS X some register was =
volatile in leaf functions but non-volatile otherwise, or at least when =
nested functions were involved. And that brings to mind that the =
condition code sets in cr might have had a mix of volatile and =
non-volatile status despite being in one register? Did Darwin/Mac OS X =
have something special for register usage for Thread-Specific Storage? =
Position Independent Code? Indirect Calls? Frame Pointers?

I may have some Darwin/Mac OS X information around but I doubt that it =
is complete, especially for the 64-bit ABI or for privileged contexts. =
For the 32-bit ABI (non-priviledged) I likely have the information about =
the above possible ABI properties.

I assume that openfirmware avoids the FPU and other such --but I do not =
know. But it is privileged code.

Are there any known sources of at least some of the information for the =
the PowerMac G5 openfirmware ABI(s)? What are good references for the =
FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)?

[I cut off some of the older history.]

=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn <nwhitehorn at =
freebsd.org> wrote:

r1 *must be* preserved by the standard and for anything to work. It's =
being corrupted somehow (Mark's comment about r3 is illuminating), and =
if r1 is being corrupted, you can't rely on anything. I suspect it might =
be an exception handling issue since it's non-deterministic, but it's =
hard to tell. It could also be triggered by the way we've set up the OF =
stack frame. It would be good to check if that makes sense.
-Nathan

On 10/14/14 09:53, Justin Hibbits wrote:
> Interesting.  Perhaps, instead of using %r1, and relying purely on the
> stack we use yet another (non-volatile) register to hold the MSR.
> Once we reload the MSR we can get back the saved registers, because
> the stack will be valid again.
>=20
> Nathan, thoughts?
>=20
> - Justin
>=20
> On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard <markmi at dsl-only.net> =
wrote:
>> Additional notes from additional experiments... (So far from one G5.)
>>=20
>> I got back trace, show registers, and my openfirmware-history list =
going for failure reporting based on explicit before vs. after tests of =
%r1 values. (Explicit breakpoint call for unequal, being careful to =
save/restore %r3 around the call.) I filled several registers with =
potentially interesting values that would otherwise have had zero as a =
value (%r15-%r19, although %r15 is redundant with %r6 currently).
>>=20
>> An interesting property resulted: every time %r1 had changed from =
having the before-value (stack pointer value) %r1 instead ended up with =
a value equal to what openfirmware put in %r3.
>>=20
>> And more then that: For builds with the same ofwstk position the %r3 =
value involved was fixed for the failures, for example when =
0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as =
0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 =
ended up for failure as 0xd24450 instead. Yep: offset by the same amount =
as ofwstk.
>>=20
>> And I got one example where the openfirmware %r1-value-change failure =
was instead much later in the boot, well after pmap_bootstrapped went =
true: It was just after the message lines...
>>=20
>> vgapci0: Boot video device ...
>> pcib1: <IBM CPC9X5 Hypertransport tunnel> ...
>>=20
>> with back trace (from OF_peer down):
>>=20
>> .OF_peer+0x8c
>> .cpcht_attach+0x884
>> .device_attach+0x3ac
>> .device_probe_and_attach+0x3c
>> .bus_generic_new_pass+0x12c
>> .bus_generic_new_pass+0x114
>> .bus_generic_new_pass+0x114 (yep: listed twice)
>> .bus_set_pass+0xc0
>> .root_bus_configure+0x14
>> .mi_startup+0x10c
>> btext+0xbc
>>=20
>> %r1 before: 0xc30400 ofwstk+0xfe0
>> %r1 after:  0xd23450
>> %r3 after:  0xd23450
>> FreeBSD msr to restore: 0x9000000000001032
>> ofmsr[0]  to restore:   0x1000000000003030
>>=20
>> The same after-openfirmware %r1 and %r3 values that had been showing =
up for the before-copyright examples of ofwcall failures.
>>=20
>> And note that it again was a peer request. All the ofwcall-tied =
boot-failures have been for peer requests as far as I remember.
>>=20
>> I later did some experiments where I had it report but not stop when =
the after-value was different from the before-value for %r1. When this =
happened for these types of tests it seem to be an isolated example: =
later calls normally have the stack pointer value still in %r1 after =
openfirmware returns. In more detail: At most one report was made for =
such a boot, the rest of the boot went fine. (Of course to get that far =
my hacked ofwcall code avoids using the after-openfirmware %r1 value to =
extract the 3 saved values to be restored from the bottom of ofwstk.)
>>=20
>>=20
>>=20
>> I was not successful at using "capture on" in DDB for this early-boot =
context. (It hangs things after the first report.) So I've been limited =
to one screen's report and only when I have it stop at the end of the =
report (so it does not scroll away). (No input to DDB available that =
early.) Otherwise the information just scrolls by rather quickly for =
reading any detail. Still it was useful to see that other reports were =
not produced after the first (when there was a first). (I can not claim =
multiple are impossible. It just appears at least infrequent.)
>>=20
>> I have not yet investigated making analogous powerpc/GENERIC code and =
builds.
>>=20
>> Nor have I dealt with having it report more detail about the peer =
requests that fail.
>>=20
>> Nor have I seen examples of what "not failing/%r1-unchanged" looks =
like overall.
>>=20
>> I still have no examples of unstable/incomplete initialization(s) or =
race condition(s) to explain why both ways can and do occur from one =
attempt to the next --or that difference peer requests in the sequence =
can be where the problem happens.








Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0CEC8978-E208-4F57-8481-DD9C321EF673>