From owner-freebsd-ppc@FreeBSD.ORG Tue Oct 14 16:14:21 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8320880 for ; Tue, 14 Oct 2014 16:14:21 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 14FDB144 for ; Tue, 14 Oct 2014 16:14:20 +0000 (UTC) Received: (qmail 30133 invoked from network); 14 Oct 2014 16:14:12 -0000 Received: from unknown (HELO mail-cs-02.app.dca.reflexion.local) (10.81.19.2) by 0 (rfx-qmail) with SMTP; 14 Oct 2014 16:14:12 -0000 Received: by mail-cs-02.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Tue, 14 Oct 2014 12:14:12 -0400 (EDT) Received: (qmail 11084 invoked from network); 14 Oct 2014 16:14:12 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 14 Oct 2014 16:14:12 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id DFF7A1C4052; Tue, 14 Oct 2014 09:14:08 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] From: Mark Millard In-Reply-To: Date: Tue, 14 Oct 2014 09:14:10 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Oct 2014 16:14:21 -0000 Additional notes from additional experiments... (So far from one G5.) I got back trace, show registers, and my openfirmware-history list going = for failure reporting based on explicit before vs. after tests of %r1 = values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). An interesting property resulted: every time %r1 had changed from having = the before-value (stack pointer value) %r1 instead ended up with a value = equal to what openfirmware put in %r3. And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... vgapci0: Boot video device ... pcib1: ...=20 with back trace (from OF_peer down): .OF_peer+0x8c .cpcht_attach+0x884 .device_attach+0x3ac .device_probe_and_attach+0x3c .bus_generic_new_pass+0x12c .bus_generic_new_pass+0x114 .bus_generic_new_pass+0x114 (yep: listed twice) .bus_set_pass+0xc0 .root_bus_configure+0x14 .mi_startup+0x10c btext+0xbc %r1 before: 0xc30400 ofwstk+0xfe0 %r1 after: 0xd23450 %r3 after: 0xd23450 FreeBSD msr to restore: 0x9000000000001032 ofmsr[0] to restore: 0x1000000000003030 The same after-openfirmware %r1 and %r3 values that had been showing up = for the before-copyright examples of ofwcall failures. And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. I later did some experiments where I had it report but not stop when the = after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) I have not yet investigated making analogous powerpc/GENERIC code and = builds. Nor have I dealt with having it report more detail about the peer = requests that fail. Nor have I seen examples of what "not failing/%r1-unchanged" looks like = overall. I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens. =3D=3D=3D Mark Millard markmi@dsl-only.net On Oct 13, 2014, at 3:39 AM, Mark Millard wrote: While I do not yet have "show register" or other information displayed = when %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have = now had two cases happen for the same, unmodified boot SSD in the same = PowerMac G5: A) Boots without failure or finding any changes to %r1 for before vs. = after openfirmware calls. B) I had it stop the boot after the code finds that %r1 had instead = changed. The usual before-copyright-notice sort of timing for where it = stopped, after pmap_bootstrapped became true. (I need "show register" or = other such to have more detail.) I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next. Both both do. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 11:34 PM, Mark Millard = wrote: Fixing stupid typos that reverse what I should have said: removing the = !'s in front of pmap_bootstrapped (from a copy/paste sequence error)... Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. ... The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 11:20 PM, Mark Millard = wrote: Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on = the G5's except for when !pmap_bootstrapped in (variants of) = powerpc64/GENERIC64. (Only covers when I had enough debug context in = place to know that much. Similarly for other notes.) These ofwcall = related failures are the vast majority of the boot failures that I've = seen. A nice thing about what I've found is that I can now figure out how to = use a comparison of the before and after stack pointers and to force = DDB's involvement if and only if they are not equal. That would also report %r1 differences that happen to not to produce = failures (if there are such). (There has to be some explanation for why = sometimes it works and sometimes it does not, say, unstable = initializations, race conditions, or something meeting both criteria.) Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. I should have mentioned some things about the kind of evidence I have = vs. do not (yet) have: A) The property defining the only context where I have observed the %r1 = issue is as noted above. In all but one of the ofwcall failure cases it was the first ofwcall in = that !pmap_bootstrapped context that had the problem. The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as = reported by the ofwcall history list in my debug/DDB hacks). But this = was before the %r1 before and after code was in place: that is a recent = addition to my investigation. B) While I've not been building debug code variants for powerpc/GENERIC = I've never seen the powerpc/GENERIC code fail to boot the G5's. And I = have spent some sessions doing reboot after reboot to see if I'd get = some failures (in addition to some other more normal uses). C) So far I've only been looking at "show registers" when it gets a = boot-time exception that a DDB processes with the automatic script: the = crashes. I do not (yet) have any observations of what things look like = during such points for successful boots. (I'm figuring out ways to get = and see the evidence spanning early boot time as I go.) And so I've only = been looking with such special debug code where I knew I could reproduce = the failures (3 PowerMac G5's when using variants of = powerpc64/GENERIC64.) In fact if the hack that I put in place completely masks the problem = then I currently would not ever observe any problem-specific information = from the successful boots. Thus the before/after comparison would seem = to be next for my investigation. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn wrote: Interesting. If OF is changing the value of r1, there must be some = problem with the ABI thunk the 64-bit kernel uses or a problem with trap = handlers. This is obviously not systematic if loader and the kernel up = to that point have no problems. Does a 32-bit kernel have the same = problems on your hardware? That would test whether it is the ABI = translation. -Nathan On 10/12/14 17:53, Mark Millard wrote: > NOTE: I make no claim that any of the below hacks for ofwcall are = appropriate code for FreeBSD's general context. I only claim that it = seems to make the specific PowerMac G5 problem go away, gives solid = evidence for at least some of what is going on (justifying the = investigative and testing hacks) and so gives evidence for an = appropriate, more general FreeBSD solution. >=20 >=20 > The big issue is: The PowerMac G5 openfirmware does not always = preserve the %r1 value (the stack pointer contents) that it is initially = given, at least when the early "before copyright" crash problem is = happening but possibly other times as well. >=20 > I had the following investigative code in ofwcall, snapshotting the = value of %r1 before and after openfirmware's code is used: >=20 > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > ... > mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... > /* Finally, branch to OF */ > mtctr %r4 > bctrl > mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >=20 > then the DDB show registers from the crash that I'd hacked in would = show these values instead of the zeros they otherwise always display, in = addition to what the show registers has always shown for r1. >=20 > The results were like the following example for every such crash: >=20 > r17 =3D 0xC31400 ofwstk+0xfe0 > r18 =3D 0xd24450 > r1 =3D 0xd24450 >=20 > Because of that %r1 value the later code such as: >=20 > /* Reload stack pointer and MSR from the OFW stack */ > ld %r6,24(%r1) > ld %r2,16(%r1) > ld %r1,8(%r1) >=20 > gets garbage-in/garbage-out results, including %r6 being values like = 0xbc0568 instead of the value saved msr to later be restored: = 0x9000000000001032. >=20 > So one PowerMac G5 specific hack involved in my working-boots context = is to force the original %r1 value to be used (based on %r17 being a = before-call copy, similar to the above): >=20 > ld %r6,24(%r17) > ld %r2,16(%r17) > ld %r1,8(%r17) >=20 > But the exception report from DDB has had problems in part because = sprg0 still has the openfirmware value at the time even though the = exception is after openfirmware returned (the wrong value results in the = register for GET_CPUINFO(). So I hacked in a before-exception = restore of FreeBSD's sprg0 inside ofwcall to make the exception handler = code have that much FreeBSD context available at the exception (if it = occurs, anyway). This was really just to help with information = gathering, although I've not tested only having the %r17 changes. >=20 > So overall PowerMac G5 specific hacking the ofwcall code to have = instead (based on what was reported above): >=20 > root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S > Index: /usr/src/sys/powerpc/ofw/ofwcall64.S > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) > +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) > @@ -52,6 +52,12 @@ > GLOBAL(rtas_entry) > .llong 0 /* RTAS entry point */ > + /* HACK: part of having sprg0 in place for trap */ > +ofwsprg0save: > + .space 8 /* sizeof(register_t) */ > +GLOBAL(ofw_sprg0_save) > + .llong 0 > + > /* > * Open Firmware Real-mode Entry Point. This is a huge pain. > */ > @@ -97,6 +103,10 @@ > lis %r4,openfirmware_entry@ha > ld %r4,openfirmware_entry@l(%r4) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r14,ofw_sprg0_save@ha > + ld %r14,ofw_sprg0_save@l(%r14) > + > /* > * Set the MSR to the OF value. This has the side effect of = disabling > * exceptions, which is important for the next few steps. > @@ -123,14 +133,27 @@ > stw %r5,4(%r1) > stw %r5,0(%r1) > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + std %r14,ofwsprg0save@l(%r6) > + > + /* HACK: part of IGNORING the later %r1 value from openfirmware = */ > + mr %r17,%r1 > + > /* Finally, branch to OF */ > mtctr %r4 > bctrl > + /* HACK: part of having FreeBSD's sprg0 in place for the = exception problem */ > + lis %r6,ofwsprg0save@ha > + ld %r6,ofwsprg0save@l(%r6) > + mtsprg0 %r6 > + > /* Reload stack pointer and MSR from the OFW stack */ > - ld %r6,24(%r1) > - ld %r2,16(%r1) > - ld %r1,8(%r1) > + /* HACKED to ignore the %r1 value that results from = openfirmware's call */ > + ld %r6,24(%r17) > + ld %r2,16(%r17) > + ld %r1,8(%r17) > /* Now set the real MSR */ > mtmsrd %r6 >=20 > This results in no crashes happening so far in my testing, not even = the 16 GByte RAM machine that crashed so much. >=20 > NOTE: owf_machdep.c was changed to use "extern register_t = ofw_sprg0_save;" to match the above. >=20 > I still have ps3 disabled in GENERIC64 so that I can also have the sc = options in GENERIC64. And the DDB and GDB options are still present as = well. >=20 > And I still have my hack to force a DDB script that does show = registers and shows the ofwcall history information that I hacked in, = even for the very early crashes before input is possible. Not that I'm = now getting such executions of the script. (A before possible-crash = backtrace is also shown by the added code. That still shows up.) >=20 > I'll probably next switch to reverting the DDB related code changes = and to removing the DDB/GDB options and see how that goes. >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net >=20 >=20