From owner-freebsd-ppc@FreeBSD.ORG Wed Oct 15 06:30:34 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4114A1A8 for ; Wed, 15 Oct 2014 06:30:34 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id CDA6136D for ; Wed, 15 Oct 2014 06:30:32 +0000 (UTC) Received: (qmail 16671 invoked from network); 15 Oct 2014 06:30:31 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 15 Oct 2014 06:30:31 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Wed, 15 Oct 2014 02:30:31 -0400 (EDT) Received: (qmail 29696 invoked from network); 15 Oct 2014 06:30:30 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 15 Oct 2014 06:30:30 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 534A31C4017; Tue, 14 Oct 2014 23:30:23 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [%r3 corrupted too] From: Mark Millard In-Reply-To: <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> Date: Tue, 14 Oct 2014 23:30:28 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <0F85ACBD-F6D6-4ABA-B8FA-00C586A086DE@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <543D5ACD.20901@freebsd.org> <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Oct 2014 06:30:34 -0000 I added including after-ofwcall %r1 and %r3 values to my ofwcall history = buffer that I have ddb report when there is a problem. This makes it apparent that %r3 has also been corrupted when %r1 has = been. I say that because the usual/normal %r3 value is 0 in what the code = records and reports and I gather from the FreeBSD source code that the = error indicator is -1. But all along I've been reporting %r3 values for = the crashes that look more like 0xd18868 or other such. Never a 0 or -1 = (0xfff...). And the %r3 crash values even move around when the ofwstk = changes place from build to build. (This "usual"/"error-check" mix suggests %r3 from openfirmware is a = multi-bit representation of a Boolean value, with one's complemented = alternative values and zero as one of the two bit patterns --when %r3 is = not corrupted.) I also got an example of a somewhat later than normal ofwcall failure: = about 23 ofwcall's later than normal. It was not a peer request: ... OF_finddevice+0x90 powermac_smp_get_bsp+0x20 platform_smp_get_bsp+0x78 cpu_mp_start+0x24 mp_startup+0x7c mi_startup+0x10c btext_0xbc So pmap_bootstrapped had been true for a while by this point. Available = memory had been displayed as of when this example stopped to report the = %r1 change. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 3:18 PM, Mark Millard = wrote: For openfirmware: is %r3 on return any more then a failed vs. not flag = with a particular failed-value? Is there any way to validate that %r3 = values for non-failure look reasonable vs. not looking reasonable? (For = all I know %r3 could also be corrupt.) I do not have any documentation for the PowerMac G5 openfirmware API = that is in use or the associated ABI as far as I remember. I do not know = if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if = there was some conversion going back and forth (as there is for FreeBSD, = at least for powerpc64). For openfirmware I derive properties from what = I see in FreeBSD's code (which has to be more explicit then when a = compiler's code generation happens to match at least large parts of an = ABI directly). As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's = ABI but FreeBSD does. If true I do not know what other differences that = there might be (even ignoring the 32 bit vs. 64 bit issues for the = kernels). But the point would be an existence proof of at least one = difference. My understanding is that %r1 was as in FreeBSD. I vaguely seem to remember that for Darwin/Mac OS X some register was = volatile in leaf functions but non-volatile otherwise, or at least when = nested functions were involved. And that brings to mind that the = condition code sets in cr might have had a mix of volatile and = non-volatile status despite being in one register? Did Darwin/Mac OS X = have something special for register usage for Thread-Specific Storage? = Position Independent Code? Indirect Calls? Frame Pointers? I may have some Darwin/Mac OS X information around but I doubt that it = is complete, especially for the 64-bit ABI or for privileged contexts. = For the 32-bit ABI (non-priviledged) I likely have the information about = the above possible ABI properties. I assume that openfirmware avoids the FPU and other such --but I do not = know. But it is privileged code. Are there any known sources of at least some of the information for the = the PowerMac G5 openfirmware ABI(s)? What are good references for the = FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)? [I cut off some of the older history.] =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn wrote: r1 *must be* preserved by the standard and for anything to work. It's = being corrupted somehow (Mark's comment about r3 is illuminating), and = if r1 is being corrupted, you can't rely on anything. I suspect it might = be an exception handling issue since it's non-deterministic, but it's = hard to tell. It could also be triggered by the way we've set up the OF = stack frame. It would be good to check if that makes sense. -Nathan On 10/14/14 09:53, Justin Hibbits wrote: > Interesting. Perhaps, instead of using %r1, and relying purely on the > stack we use yet another (non-volatile) register to hold the MSR. > Once we reload the MSR we can get back the saved registers, because > the stack will be valid again. >=20 > Nathan, thoughts? >=20 > - Justin >=20 > On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard = wrote: >> Additional notes from additional experiments... (So far from one G5.) >>=20 >> I got back trace, show registers, and my openfirmware-history list = going for failure reporting based on explicit before vs. after tests of = %r1 values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). >>=20 >> An interesting property resulted: every time %r1 had changed from = having the before-value (stack pointer value) %r1 instead ended up with = a value equal to what openfirmware put in %r3. >>=20 >> And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. >>=20 >> And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... >>=20 >> vgapci0: Boot video device ... >> pcib1: ... >>=20 >> with back trace (from OF_peer down): >>=20 >> .OF_peer+0x8c >> .cpcht_attach+0x884 >> .device_attach+0x3ac >> .device_probe_and_attach+0x3c >> .bus_generic_new_pass+0x12c >> .bus_generic_new_pass+0x114 >> .bus_generic_new_pass+0x114 (yep: listed twice) >> .bus_set_pass+0xc0 >> .root_bus_configure+0x14 >> .mi_startup+0x10c >> btext+0xbc >>=20 >> %r1 before: 0xc30400 ofwstk+0xfe0 >> %r1 after: 0xd23450 >> %r3 after: 0xd23450 >> FreeBSD msr to restore: 0x9000000000001032 >> ofmsr[0] to restore: 0x1000000000003030 >>=20 >> The same after-openfirmware %r1 and %r3 values that had been showing = up for the before-copyright examples of ofwcall failures. >>=20 >> And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. >>=20 >> I later did some experiments where I had it report but not stop when = the after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) >>=20 >>=20 >>=20 >> I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) >>=20 >> I have not yet investigated making analogous powerpc/GENERIC code and = builds. >>=20 >> Nor have I dealt with having it report more detail about the peer = requests that fail. >>=20 >> Nor have I seen examples of what "not failing/%r1-unchanged" looks = like overall. >>=20 >> I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens.