Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Oct 2014 09:14:10 -0700
From:      Mark Millard <markmi@dsl-only.net>
To:        Nathan Whitehorn <nwhitehorn@freebsd.org>
Cc:        Justin Hibbits <chmeeedalf@gmail.com>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed]
Message-ID:  <A2AB9066-259B-4B7D-BDDC-D03AE5827E13@dsl-only.net>
In-Reply-To: <C614025F-6455-4929-8468-462E76079274@dsl-only.net>
References:  <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <C614025F-6455-4929-8468-462E76079274@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Additional notes from additional experiments... (So far from one G5.)

I got back trace, show registers, and my openfirmware-history list going =
for failure reporting based on explicit before vs. after tests of %r1 =
values. (Explicit breakpoint call for unequal, being careful to =
save/restore %r3 around the call.) I filled several registers with =
potentially interesting values that would otherwise have had zero as a =
value (%r15-%r19, although %r15 is redundant with %r6 currently).

An interesting property resulted: every time %r1 had changed from having =
the before-value (stack pointer value) %r1 instead ended up with a value =
equal to what openfirmware put in %r3.

And more then that: For builds with the same ofwstk position the %r3 =
value involved was fixed for the failures, for example when =
0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as =
0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 =
ended up for failure as 0xd24450 instead. Yep: offset by the same amount =
as ofwstk.

And I got one example where the openfirmware %r1-value-change failure =
was instead much later in the boot, well after pmap_bootstrapped went =
true: It was just after the message lines...

vgapci0: Boot video device ...
pcib1: <IBM CPC9X5 Hypertransport tunnel> ...=20

with back trace (from OF_peer down):

.OF_peer+0x8c
.cpcht_attach+0x884
.device_attach+0x3ac
.device_probe_and_attach+0x3c
.bus_generic_new_pass+0x12c
.bus_generic_new_pass+0x114
.bus_generic_new_pass+0x114 (yep: listed twice)
.bus_set_pass+0xc0
.root_bus_configure+0x14
.mi_startup+0x10c
btext+0xbc

%r1 before: 0xc30400 ofwstk+0xfe0
%r1 after:  0xd23450
%r3 after:  0xd23450
FreeBSD msr to restore: 0x9000000000001032
ofmsr[0]  to restore:   0x1000000000003030

The same after-openfirmware %r1 and %r3 values that had been showing up =
for the before-copyright examples of ofwcall failures.

And note that it again was a peer request. All the ofwcall-tied =
boot-failures have been for peer requests as far as I remember.

I later did some experiments where I had it report but not stop when the =
after-value was different from the before-value for %r1. When this =
happened for these types of tests it seem to be an isolated example: =
later calls normally have the stack pointer value still in %r1 after =
openfirmware returns. In more detail: At most one report was made for =
such a boot, the rest of the boot went fine. (Of course to get that far =
my hacked ofwcall code avoids using the after-openfirmware %r1 value to =
extract the 3 saved values to be restored from the bottom of ofwstk.)



I was not successful at using "capture on" in DDB for this early-boot =
context. (It hangs things after the first report.) So I've been limited =
to one screen's report and only when I have it stop at the end of the =
report (so it does not scroll away). (No input to DDB available that =
early.) Otherwise the information just scrolls by rather quickly for =
reading any detail. Still it was useful to see that other reports were =
not produced after the first (when there was a first). (I can not claim =
multiple are impossible. It just appears at least infrequent.)

I have not yet investigated making analogous powerpc/GENERIC code and =
builds.

Nor have I dealt with having it report more detail about the peer =
requests that fail.

Nor have I seen examples of what "not failing/%r1-unchanged" looks like =
overall.

I still have no examples of unstable/incomplete initialization(s) or =
race condition(s) to explain why both ways can and do occur from one =
attempt to the next --or that difference peer requests in the sequence =
can be where the problem happens.

=3D=3D=3D
Mark Millard
markmi@dsl-only.net

On Oct 13, 2014, at 3:39 AM, Mark Millard <markmi@dsl-only.net> wrote:

While I do not yet have "show register" or other information displayed =
when %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have =
now had two cases happen for the same, unmodified boot SSD in the same =
PowerMac G5:

A) Boots without failure or finding any changes to %r1 for before vs. =
after openfirmware calls.

B) I had it stop the boot after the code finds that %r1 had instead =
changed. The usual before-copyright-notice sort of timing for where it =
stopped, after pmap_bootstrapped became true. (I need "show register" or =
other such to have more detail.)


I still have no examples of unstable/incomplete initialization(s) or =
race condition(s) to explain why both ways can and do occur from one =
attempt to the next. Both both do.



=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Oct 12, 2014, at 11:34 PM, Mark Millard <markmi at dsl-only.net> =
wrote:

Fixing stupid typos that reverse what I should have said: removing the =
!'s in front of pmap_bootstrapped (from a copy/paste sequence error)...

Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on =
the G5's except for when pmap_bootstrapped in (variants of) =
powerpc64/GENERIC64. (Only covers when I had enough debug context in =
place to know that much. Similarly for other notes.) These ofwcall =
related failures are the vast majority of the boot failures that I've =
seen.

...

The only other ofwcall failure that I've seen happened only once and was =
where prior ofwcall's with pmap_bootstrapped had already happened (as =
reported by the ofwcall history list in my debug/DDB hacks). But this =
was before the %r1 before and after code was in place: that is a recent =
addition to my investigation.




=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Oct 12, 2014, at 11:20 PM, Mark Millard <markmi at dsl-only.net> =
wrote:

Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on =
the G5's except for when !pmap_bootstrapped in (variants of) =
powerpc64/GENERIC64. (Only covers when I had enough debug context in =
place to know that much. Similarly for other notes.) These ofwcall =
related failures are the vast majority of the boot failures that I've =
seen.

A nice thing about what I've found is that I can now figure out how to =
use a comparison of the before and after stack pointers and to force =
DDB's involvement if and only if they are not equal.

That would also report %r1 differences that happen to not to produce =
failures (if there are such). (There has to be some explanation for why =
sometimes it works and sometimes it does not, say, unstable =
initializations, race conditions, or something meeting both criteria.)

Which in turn makes the general technique appropriate to powerpc/GENERIC =
contexts as well. (Coding details may vary.)

I can not promise how quickly I'll get to any specific part of this. But =
I should gradually progress on it.


I should have mentioned some things about the kind of evidence I have =
vs. do not (yet) have:


A) The property defining the only context where I have observed the %r1 =
issue is as noted above.

In all but one of the ofwcall failure cases it was the first ofwcall in =
that !pmap_bootstrapped context that had the problem.

The only other ofwcall failure that I've seen happened only once and was =
where prior ofwcall's with !pmap_bootstrapped had already happened (as =
reported by the ofwcall history list in my debug/DDB hacks). But this =
was before the %r1 before and after code was in place: that is a recent =
addition to my investigation.


B) While I've not been building debug code variants for powerpc/GENERIC =
I've never seen the powerpc/GENERIC code fail to boot the G5's. And I =
have spent some sessions doing reboot after reboot to see if I'd get =
some failures (in addition to some other more normal uses).


C) So far I've only been looking at "show registers" when it gets a =
boot-time exception that a DDB processes with the automatic script: the =
crashes. I do not (yet) have any observations of what things look like =
during such points for successful boots. (I'm figuring out ways to get =
and see the evidence spanning early boot time as I go.) And so I've only =
been looking with such special debug code where I knew I could reproduce =
the failures (3 PowerMac G5's when using variants of =
powerpc64/GENERIC64.)

In fact if the hack that I put in place completely masks the problem =
then I currently would not ever observe any problem-specific information =
from the successful boots. Thus the before/after comparison would seem =
to be next for my investigation.



=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn <nwhitehorn at =
freebsd.org> wrote:

Interesting. If OF is changing the value of r1, there must be some =
problem with the ABI thunk the 64-bit kernel uses or a problem with trap =
handlers. This is obviously not systematic if loader and the kernel up =
to that point have no problems. Does a 32-bit kernel have the same =
problems on your hardware? That would test whether it is the ABI =
translation.
-Nathan

On 10/12/14 17:53, Mark Millard wrote:
> NOTE: I make no claim that any of the below hacks for ofwcall are =
appropriate code for FreeBSD's general context. I only claim that it =
seems to make the specific PowerMac G5 problem go away, gives solid =
evidence for at least some of what is going on (justifying the =
investigative and testing hacks) and so gives evidence for an =
appropriate, more general FreeBSD solution.
>=20
>=20
> The big issue is: The PowerMac G5 openfirmware does not always =
preserve the %r1 value (the stack pointer contents) that it is initially =
given, at least when the early "before copyright" crash problem is =
happening but possibly other times as well.
>=20
> I had the following investigative code in ofwcall, snapshotting the =
value of %r1 before and after openfirmware's code is used:
>=20
> 	lis	%r4,openfirmware_entry@ha
> 	ld	%r4,openfirmware_entry@l(%r4)
> ...
> 	mr   %r17,%r1 /* ADDED HACK TO RECORD %r1 before...
> 	/* Finally, branch to OF */
> 	mtctr	%r4
> 	bctrl
> 	mr   %r18,%r1 /* ADDED HACK TO RECORD %r1 after...
>=20
> then the DDB show registers from the crash that I'd hacked in would =
show these values instead of the zeros they otherwise always display, in =
addition to what the show registers has always shown for r1.
>=20
> The results were like the following example for every such crash:
>=20
> r17 =3D 0xC31400 ofwstk+0xfe0
> r18 =3D 0xd24450
> r1  =3D 0xd24450
>=20
> Because of that %r1 value the later code such as:
>=20
> 	/* Reload stack pointer and MSR from the OFW stack */
> 	ld	%r6,24(%r1)
> 	ld	%r2,16(%r1)
> 	ld	%r1,8(%r1)
>=20
> gets garbage-in/garbage-out results, including %r6 being values like =
0xbc0568 instead of the value saved msr to later be restored: =
0x9000000000001032.
>=20
> So one PowerMac G5 specific hack involved in my working-boots context =
is to force the original %r1 value to be used (based on %r17 being a =
before-call copy, similar to the above):
>=20
> 	ld	%r6,24(%r17)
> 	ld	%r2,16(%r17)
> 	ld	%r1,8(%r17)
>=20
> But the exception report from DDB has had problems in part because =
sprg0 still has the openfirmware value at the time even though the =
exception is after openfirmware returned (the wrong value results in the =
register for GET_CPUINFO(<register>). So I hacked in a before-exception =
restore of FreeBSD's sprg0 inside ofwcall to make the exception handler =
code have that much FreeBSD context available at the exception (if it =
occurs, anyway). This was really just to help with information =
gathering, although I've not tested only having the %r17 changes.
>=20
> So overall PowerMac G5 specific hacking the ofwcall code to have =
instead (based on what was reported above):
>=20
> root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S
> Index: /usr/src/sys/powerpc/ofw/ofwcall64.S
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /usr/src/sys/powerpc/ofw/ofwcall64.S	(revision 272558)
> +++ /usr/src/sys/powerpc/ofw/ofwcall64.S	(working copy)
> @@ -52,6 +52,12 @@
> GLOBAL(rtas_entry)
> 	.llong	0			/* RTAS entry point */
> + /* HACK: part of having sprg0 in place for trap */
> +ofwsprg0save:
> +	.space	8 /* sizeof(register_t) */
> +GLOBAL(ofw_sprg0_save)
> +	.llong	0
> +
> /*
> * Open Firmware Real-mode Entry Point. This is a huge pain.
> */
> @@ -97,6 +103,10 @@
> 	lis	%r4,openfirmware_entry@ha
> 	ld	%r4,openfirmware_entry@l(%r4)
> +	/* HACK: part of having FreeBSD's sprg0 in place for the =
exception problem */
> +	lis	%r14,ofw_sprg0_save@ha
> +	ld	%r14,ofw_sprg0_save@l(%r14)
> +
> 	/*
> 	 * Set the MSR to the OF value. This has the side effect of =
disabling
> 	 * exceptions, which is important for the next few steps.
> @@ -123,14 +133,27 @@
> 	stw	%r5,4(%r1)
> 	stw	%r5,0(%r1)
> +	/* HACK: part of having FreeBSD's sprg0 in place for the =
exception problem */
> +	lis	%r6,ofwsprg0save@ha
> +	std	%r14,ofwsprg0save@l(%r6)
> +
> +	/* HACK: part of IGNORING the later %r1 value from openfirmware =
*/
> +	mr	%r17,%r1
> +
> 	/* Finally, branch to OF */
> 	mtctr	%r4
> 	bctrl
> +	/* HACK: part of having FreeBSD's sprg0 in place for the =
exception problem */
> +	lis	%r6,ofwsprg0save@ha
> +	ld	%r6,ofwsprg0save@l(%r6)
> +	mtsprg0	%r6
> +
> 	/* Reload stack pointer and MSR from the OFW stack */
> -	ld	%r6,24(%r1)
> -	ld	%r2,16(%r1)
> -	ld	%r1,8(%r1)
> +	/* HACKED to ignore the %r1 value that results from =
openfirmware's call */
> +	ld	%r6,24(%r17)
> +	ld	%r2,16(%r17)
> +	ld	%r1,8(%r17)
> 	/* Now set the real MSR */
> 	mtmsrd	%r6
>=20
> This results in no crashes happening so far in my testing, not even =
the 16 GByte RAM machine that crashed so much.
>=20
> NOTE: owf_machdep.c was changed to use "extern register_t =
ofw_sprg0_save;" to match the above.
>=20
> I still have ps3 disabled in GENERIC64 so that I can also have the sc =
options in GENERIC64. And the DDB and GDB options are still present as =
well.
>=20
> And I still have my hack to force a DDB script that does show =
registers and shows the ofwcall history information that I hacked in, =
even for the very early crashes before input is possible. Not that I'm =
now getting such executions of the script. (A before possible-crash =
backtrace is also shown by the added code. That still shows up.)
>=20
> I'll probably next switch to reverting the DDB related code changes =
and to removing the DDB/GDB options and see how that goes.
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
>=20








Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A2AB9066-259B-4B7D-BDDC-D03AE5827E13>