Date: Sun, 12 Oct 2014 23:34:56 -0700 From: Mark Millard <markmi@dsl-only.net> To: Nathan Whitehorn <nwhitehorn@freebsd.org> Cc: Justin Hibbits <chmeeedalf@gmail.com>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] Message-ID: <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> In-Reply-To: <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Fixing stupid typos that reverse what I should have said: removing the = !'s in front of pmap_bootstrapped (from a copy/paste sequence error)... Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. ... The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. =3D=3D=3D Mark Millard markmi@dsl-only.net On Oct 12, 2014, at 11:20 PM, Mark Millard <markmi@dsl-only.net> wrote: Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when !pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. A nice thing about what I've found is that I can now figure out how to = use a comparison of the before and after stack pointers and to force = DDB's involvement if and only if they are not equal. That would also report %r1 differences that happen to not to produce = failures (if there are such). (There has to be some explanation for why = sometimes it works and sometimes it does not, say, unstable = initializations, race conditions, or something meeting both criteria.) Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. I should have mentioned some things about the kind of evidence I have = vs. do not (yet) have: A) The property defining the only context where I have observed the %r1 = issue is as noted above. In all but one of the ofwcall failure cases it was the first ofwcall in = that !pmap_bootstrapped context that had the problem. The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. B) While I've not been building debug code variants for powerpc/GENERIC = I've never seen the powerpc/GENERIC code fail to boot the G5's. And I = have spent some sessions doing reboot after reboot to see if I'd get = some failures (in addition to some other more normal uses). C) So far I've only been looking at "show registers" when it gets a = boot-time exception that a DDB processes with the automatic script: the = crashes. I do not (yet) have any observations of what things look like = during such points for successful boots. (I'm figuring out ways to get = and see the evidence spanning early boot time as I go.) And so I've only = been looking with such special debug code where I knew I could reproduce = the failures (3 PowerMac G5's when using variants of = powerpc64/GENERIC64.) In fact if the hack that I put in place completely masks the problem = then I currently would not ever observe any problem-specific information = from the successful boots. Thus the before/after comparison would seem = to be next for my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn <nwhitehorn@freebsd.org> = wrote: Interesting. If OF is changing the value of r1, there must be some = problem with the ABI thunk the 64-bit kernel uses or a problem with trap = handlers. This is obviously not systematic if loader and the kernel up = to that point have no problems. Does a 32-bit kernel have the same = problems on your hardware? That would test whether it is the ABI = translation. -Nathan On 10/12/14 17:53, Mark Millard wrote: > NOTE: I make no claim that any of the below hacks for ofwcall are = appropriate code for FreeBSD's general context. I only claim that it = seems to make the specific PowerMac G5 problem go away, gives solid = evidence for at least some of what is going on (justifying the = investigative and testing hacks) and so gives evidence for an = appropriate, more general FreeBSD solution. >=20 >=20 > The big issue is: The PowerMac G5 openfirmware does not always = preserve the %r1 value (the stack pointer contents) that it is initially = given, at least when the early "before copyright" crash problem is = happening but possibly other times as well. >=20 > I had the following investigative code in ofwcall, snapshotting the = value of %r1 before and after openfirmware's code is used: >=20 > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > ... > mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... > /* Finally, branch to OF */ > mtctr %r4 > bctrl > mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >=20 > then the DDB show registers from the crash that I'd hacked in would = show these values instead of the zeros they otherwise always display, in = addition to what the show registers has always shown for r1. >=20 > The results were like the following example for every such crash: >=20 > r17 =3D 0xC31400 ofwstk+0xfe0 > r18 =3D 0xd24450 > r1 =3D 0xd24450 >=20 > Because of that %r1 value the later code such as: >=20 > /* Reload stack pointer and MSR from the OFW stack */ > ld %r6,24(%r1) > ld %r2,16(%r1) > ld %r1,8(%r1) >=20 > gets garbage-in/garbage-out results, including %r6 being values like = 0xbc0568 instead of the value saved msr to later be restored: = 0x9000000000001032. >=20 > So one PowerMac G5 specific hack involved in my working-boots context = is to force the original %r1 value to be used (based on %r17 being a = before-call copy, similar to the above): >=20 > ld %r6,24(%r17) > ld %r2,16(%r17) > ld %r1,8(%r17) >=20 > But the exception report from DDB has had problems in part because = sprg0 still has the openfirmware value at the time even though the = exception is after openfirmware returned (the wrong value results in the = register for GET_CPUINFO(<register>). So I hacked in a before-exception = restore of FreeBSD's sprg0 inside ofwcall to make the exception handler = code have that much FreeBSD context available at the exception (if it = occurs, anyway). This was really just to help with information = gathering, although I've not tested only having the %r17 changes. >=20 > So overall PowerMac G5 specific hacking the ofwcall code to have = instead (based on what was reported above): >=20 > root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S > Index: /usr/src/sys/powerpc/ofw/ofwcall64.S > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) > +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) > @@ -52,6 +52,12 @@ > GLOBAL(rtas_entry) > .llong 0 /* RTAS entry point */ > + /* HACK: part of having sprg0 in place for trap */ > +ofwsprg0save: > + .space 8 /* sizeof(register_t) */ > +GLOBAL(ofw_sprg0_save) > + .llong 0 > + > /* > * Open Firmware Real-mode Entry Point. This is a huge pain. > */ > @@ -97,6 +103,10 @@ > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r14,ofw_sprg0_save@ha > + ld %r14,ofw_sprg0_save@l(%r14) > + > /* > * Set the MSR to the OF value. This has the side effect of = disabling > * exceptions, which is important for the next few steps. > @@ -123,14 +133,27 @@ > stw %r5,4(%r1) > stw %r5,0(%r1) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + std %r14,ofwsprg0save@l(%r6) > + > + /* HACK: part of IGNORING the later %r1 value from openfirmware = */ > + mr %r17,%r1 > + > /* Finally, branch to OF */ > mtctr %r4 > bctrl > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + ld %r6,ofwsprg0save@l(%r6) > + mtsprg0 %r6 > + > /* Reload stack pointer and MSR from the OFW stack */ > - ld %r6,24(%r1) > - ld %r2,16(%r1) > - ld %r1,8(%r1) > + /* HACKED to ignore the %r1 value that results from = openfirmware's call */ > + ld %r6,24(%r17) > + ld %r2,16(%r17) > + ld %r1,8(%r17) > /* Now set the real MSR */ > mtmsrd %r6 >=20 > This results in no crashes happening so far in my testing, not even = the 16 GByte RAM machine that crashed so much. >=20 > NOTE: owf_machdep.c was changed to use "extern register_t = ofw_sprg0_save;" to match the above. >=20 > I still have ps3 disabled in GENERIC64 so that I can also have the sc = options in GENERIC64. And the DDB and GDB options are still present as = well. >=20 > And I still have my hack to force a DDB script that does show = registers and shows the ofwcall history information that I hacked in, = even for the very early crashes before input is possible. Not that I'm = now getting such executions of the script. (A before possible-crash = backtrace is also shown by the added code. That still shows up.) >=20 > I'll probably next switch to reverting the DDB related code changes = and to removing the DDB/GDB options and see how that goes. >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?379AA7FC-98C9-48B9-92BB-60E134817AF1>