Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 12 Oct 2014 23:34:56 -0700
From:      Mark Millard <markmi@dsl-only.net>
To:        Nathan Whitehorn <nwhitehorn@freebsd.org>
Cc:        Justin Hibbits <chmeeedalf@gmail.com>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed]
Message-ID:  <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net>
In-Reply-To: <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net>
References:  <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Fixing stupid typos that reverse what I should have said: removing the =
!'s in front of pmap_bootstrapped (from a copy/paste sequence error)...

Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on =
the G5's except for when pmap_bootstrapped in (variants of) =
powerpc64/GENERIC64. (Only covers when I had enough debug context in =
place to know that much. Similarly for other notes.) These ofwcall =
related failures are the vast majority of the boot failures that I've =
seen.

...

The only other ofwcall failure that I've seen happened only once and was =
where prior ofwcall's with pmap_bootstrapped had already happened (as =
reported by the ofwcall history list in my debug/DDB hacks). But this =
was before the %r1 before and after code was in place: that is a recent =
addition to my investigation.




=3D=3D=3D
Mark Millard
markmi@dsl-only.net

On Oct 12, 2014, at 11:20 PM, Mark Millard <markmi@dsl-only.net> wrote:

Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on =
the G5's except for when !pmap_bootstrapped in (variants of) =
powerpc64/GENERIC64. (Only covers when I had enough debug context in =
place to know that much. Similarly for other notes.) These ofwcall =
related failures are the vast majority of the boot failures that I've =
seen.

A nice thing about what I've found is that I can now figure out how to =
use a comparison of the before and after stack pointers and to force =
DDB's involvement if and only if they are not equal.

That would also report %r1 differences that happen to not to produce =
failures (if there are such). (There has to be some explanation for why =
sometimes it works and sometimes it does not, say, unstable =
initializations, race conditions, or something meeting both criteria.)

Which in turn makes the general technique appropriate to powerpc/GENERIC =
contexts as well. (Coding details may vary.)

I can not promise how quickly I'll get to any specific part of this. But =
I should gradually progress on it.


I should have mentioned some things about the kind of evidence I have =
vs. do not (yet) have:


A) The property defining the only context where I have observed the %r1 =
issue is as noted above.

In all but one of the ofwcall failure cases it was the first ofwcall in =
that !pmap_bootstrapped context that had the problem.

The only other ofwcall failure that I've seen happened only once and was =
where prior ofwcall's with !pmap_bootstrapped had already happened (as =
reported by the ofwcall history list in my debug/DDB hacks). But this =
was before the %r1 before and after code was in place: that is a recent =
addition to my investigation.


B) While I've not been building debug code variants for powerpc/GENERIC =
I've never seen the powerpc/GENERIC code fail to boot the G5's. And I =
have spent some sessions doing reboot after reboot to see if I'd get =
some failures (in addition to some other more normal uses).


C) So far I've only been looking at "show registers" when it gets a =
boot-time exception that a DDB processes with the automatic script: the =
crashes. I do not (yet) have any observations of what things look like =
during such points for successful boots. (I'm figuring out ways to get =
and see the evidence spanning early boot time as I go.) And so I've only =
been looking with such special debug code where I knew I could reproduce =
the failures (3 PowerMac G5's when using variants of =
powerpc64/GENERIC64.)

In fact if the hack that I put in place completely masks the problem =
then I currently would not ever observe any problem-specific information =
from the successful boots. Thus the before/after comparison would seem =
to be next for my investigation.



=3D=3D=3D
Mark Millard
markmi at dsl-only.net

On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn <nwhitehorn@freebsd.org> =
wrote:

Interesting. If OF is changing the value of r1, there must be some =
problem with the ABI thunk the 64-bit kernel uses or a problem with trap =
handlers. This is obviously not systematic if loader and the kernel up =
to that point have no problems. Does a 32-bit kernel have the same =
problems on your hardware? That would test whether it is the ABI =
translation.
-Nathan

On 10/12/14 17:53, Mark Millard wrote:
> NOTE: I make no claim that any of the below hacks for ofwcall are =
appropriate code for FreeBSD's general context. I only claim that it =
seems to make the specific PowerMac G5 problem go away, gives solid =
evidence for at least some of what is going on (justifying the =
investigative and testing hacks) and so gives evidence for an =
appropriate, more general FreeBSD solution.
>=20
>=20
> The big issue is: The PowerMac G5 openfirmware does not always =
preserve the %r1 value (the stack pointer contents) that it is initially =
given, at least when the early "before copyright" crash problem is =
happening but possibly other times as well.
>=20
> I had the following investigative code in ofwcall, snapshotting the =
value of %r1 before and after openfirmware's code is used:
>=20
> 	lis	%r4,openfirmware_entry@ha
> 	ld	%r4,openfirmware_entry@l(%r4)
> ...
> 	mr   %r17,%r1 /* ADDED HACK TO RECORD %r1 before...
> 	/* Finally, branch to OF */
> 	mtctr	%r4
> 	bctrl
> 	mr   %r18,%r1 /* ADDED HACK TO RECORD %r1 after...
>=20
> then the DDB show registers from the crash that I'd hacked in would =
show these values instead of the zeros they otherwise always display, in =
addition to what the show registers has always shown for r1.
>=20
> The results were like the following example for every such crash:
>=20
> r17 =3D 0xC31400 ofwstk+0xfe0
> r18 =3D 0xd24450
> r1  =3D 0xd24450
>=20
> Because of that %r1 value the later code such as:
>=20
> 	/* Reload stack pointer and MSR from the OFW stack */
> 	ld	%r6,24(%r1)
> 	ld	%r2,16(%r1)
> 	ld	%r1,8(%r1)
>=20
> gets garbage-in/garbage-out results, including %r6 being values like =
0xbc0568 instead of the value saved msr to later be restored: =
0x9000000000001032.
>=20
> So one PowerMac G5 specific hack involved in my working-boots context =
is to force the original %r1 value to be used (based on %r17 being a =
before-call copy, similar to the above):
>=20
> 	ld	%r6,24(%r17)
> 	ld	%r2,16(%r17)
> 	ld	%r1,8(%r17)
>=20
> But the exception report from DDB has had problems in part because =
sprg0 still has the openfirmware value at the time even though the =
exception is after openfirmware returned (the wrong value results in the =
register for GET_CPUINFO(<register>). So I hacked in a before-exception =
restore of FreeBSD's sprg0 inside ofwcall to make the exception handler =
code have that much FreeBSD context available at the exception (if it =
occurs, anyway). This was really just to help with information =
gathering, although I've not tested only having the %r17 changes.
>=20
> So overall PowerMac G5 specific hacking the ofwcall code to have =
instead (based on what was reported above):
>=20
> root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S
> Index: /usr/src/sys/powerpc/ofw/ofwcall64.S
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- /usr/src/sys/powerpc/ofw/ofwcall64.S	(revision 272558)
> +++ /usr/src/sys/powerpc/ofw/ofwcall64.S	(working copy)
> @@ -52,6 +52,12 @@
> GLOBAL(rtas_entry)
> 	.llong	0			/* RTAS entry point */
> + /* HACK: part of having sprg0 in place for trap */
> +ofwsprg0save:
> +	.space	8 /* sizeof(register_t) */
> +GLOBAL(ofw_sprg0_save)
> +	.llong	0
> +
> /*
>  * Open Firmware Real-mode Entry Point. This is a huge pain.
>  */
> @@ -97,6 +103,10 @@
> 	lis	%r4,openfirmware_entry@ha
> 	ld	%r4,openfirmware_entry@l(%r4)
> +	/* HACK: part of having FreeBSD's sprg0 in place for the =
exception problem */
> +	lis	%r14,ofw_sprg0_save@ha
> +	ld	%r14,ofw_sprg0_save@l(%r14)
> +
> 	/*
> 	 * Set the MSR to the OF value. This has the side effect of =
disabling
> 	 * exceptions, which is important for the next few steps.
> @@ -123,14 +133,27 @@
> 	stw	%r5,4(%r1)
> 	stw	%r5,0(%r1)
> +	/* HACK: part of having FreeBSD's sprg0 in place for the =
exception problem */
> +	lis	%r6,ofwsprg0save@ha
> +	std	%r14,ofwsprg0save@l(%r6)
> +
> +	/* HACK: part of IGNORING the later %r1 value from openfirmware =
*/
> +	mr	%r17,%r1
> +
> 	/* Finally, branch to OF */
> 	mtctr	%r4
> 	bctrl
> +	/* HACK: part of having FreeBSD's sprg0 in place for the =
exception problem */
> +	lis	%r6,ofwsprg0save@ha
> +	ld	%r6,ofwsprg0save@l(%r6)
> +	mtsprg0	%r6
> +
> 	/* Reload stack pointer and MSR from the OFW stack */
> -	ld	%r6,24(%r1)
> -	ld	%r2,16(%r1)
> -	ld	%r1,8(%r1)
> +	/* HACKED to ignore the %r1 value that results from =
openfirmware's call */
> +	ld	%r6,24(%r17)
> +	ld	%r2,16(%r17)
> +	ld	%r1,8(%r17)
>   	/* Now set the real MSR */
> 	mtmsrd	%r6
>=20
> This results in no crashes happening so far in my testing, not even =
the 16 GByte RAM machine that crashed so much.
>=20
> NOTE: owf_machdep.c was changed to use "extern register_t =
ofw_sprg0_save;" to match the above.
>=20
> I still have ps3 disabled in GENERIC64 so that I can also have the sc =
options in GENERIC64. And the DDB and GDB options are still present as =
well.
>=20
> And I still have my hack to force a DDB script that does show =
registers and shows the ofwcall history information that I hacked in, =
even for the very early crashes before input is possible. Not that I'm =
now getting such executions of the script. (A before possible-crash =
backtrace is also shown by the added code. That still shows up.)
>=20
> I'll probably next switch to reverting the DDB related code changes =
and to removing the DDB/GDB options and see how that goes.
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net
>=20
>=20






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?379AA7FC-98C9-48B9-92BB-60E134817AF1>