From owner-freebsd-ppc@FreeBSD.ORG Sat Oct 18 09:36:12 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CE7B53FF for ; Sat, 18 Oct 2014 09:36:12 +0000 (UTC) Received: from asp.reflexion.net (outbound-241.asp.reflexion.net [69.84.129.241]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6C299953 for ; Sat, 18 Oct 2014 09:36:11 +0000 (UTC) Received: (qmail 18560 invoked from network); 18 Oct 2014 09:36:09 -0000 Received: from unknown (HELO mail-cs-01.app.dca.reflexion.local) (10.81.19.1) by 0 (rfx-qmail) with SMTP; 18 Oct 2014 09:36:09 -0000 Received: by mail-cs-01.app.dca.reflexion.local (Reflexion email security v7.30.7) with SMTP; Sat, 18 Oct 2014 05:36:09 -0400 (EDT) Received: (qmail 6950 invoked from network); 18 Oct 2014 09:36:09 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (DHE-RSA-AES256-SHA encrypted) SMTP; 18 Oct 2014 09:36:09 -0000 X-No-Relay: not in my network X-No-Relay: not in my network X-No-Relay: not in my network Received: from [192.168.1.8] (c-98-246-178-138.hsd1.or.comcast.net [98.246.178.138]) by iron2.pdx.net (Postfix) with ESMTPSA id 4158F1C4053; Sat, 18 Oct 2014 02:36:07 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [%r3 corrupted too] From: Mark Millard In-Reply-To: <49920E63-CB4A-429C-AB3A-984075AE183D@dsl-only.net> Date: Sat, 18 Oct 2014 02:36:07 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <0CEC8978-E208-4F57-8481-DD9C321EF673@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <543D5ACD.20901@freebsd.org> <3D4A76B3-431A-4C94-8747-70369A8A1764@dsl-only.net> <0F85ACBD-F6D6-4ABA-B8FA-00C586A086DE@dsl-only.net> <49920E63-CB4A-429C-AB3A-984075AE183D@dsl-only.net> To: Nathan Whitehorn X-Mailer: Apple Mail (2.1878.6) Cc: Justin Hibbits , FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Oct 2014 09:36:12 -0000 I did an experiment recording interrupt numbers in with the ofwcall = history that I'd hacked in and have been using. I did this addition to = the history in powerpc_interrupt. I managed to see a data storage = interrupt recorded in the history from a forced failure somewhat after = the normal ofwcall failure place (as evidence that the mechanism was = good enough for early boot time, which is all I needed or was after). But when the %r1!=3D%r3 detection happens in ofwcall after openfirmware = and the information was reported no interrupt was listed in the history. = I suppose openfirmware could temporarily save, replace, and later = restore some interrupt handlers. That would mean I'd not catch the = interrupt activity for such times. But it does not appear that = powerpc_interrupt is involved if any interrupts are involved in the = ofwcall failures. It would need to be something that avoids = powerpc_interrupt. Otherwise I should have seen the report = ("!!") in the history that it listed. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 17, 2014, at 5:25 AM, Mark Millard = wrote: I noticed for the normal PowerMac G5 failing place that %r2 was 0xd23050 = (generally in my recent builds: even when it works) and that the failing = %r1 and %r3 were then 0xd23450. In other words: %r2+0x400. An accident = or is that what the code is doing when it fails? So I cleared the %r2 value in ofwcall (leaving the TOC value in the = ofwstk place and using that to put the TOC back but having %r2=3Dzero = between). Failure then produced %0x400 for %r1 and %r3: again %r2+0x400. And %r2 = stayed at zero until the ofwcall code put the TOC value back. In other words: the failing values are from adding/oring 0x400 to %r2, = whatever its value. Trying %r2=3D0x400 instead of zero resulted in %r1 and %r3 having 0x800: = So %r2+0x400 instead of %r2 | 0x400. After this I adjusted ofwcall to clear %r0, %r2, %r4-%r12 before the = openfirmware code executes and to leave them alone after. (Thus I = changed the register usage.) This allows (probably) seeing what other = volatile registers may have changed, especially when past different = value reports are considered/compared with the results for this context. = %r19-%r28 also set to zero. %r13-%r18,%r29 deliberately record things to = report on failure (breakpoint use). (Prior tests showed %r14-%r19 as = zero after so this tests if they are left alone vs. set to zero.) (The = %r13-%r18 values also are generally put to some normal use as well.) = %r30-%r31 are also put to various uses (not so simple to tell if = preserved). Results beyond %r1/%r3's corruption: %r4 gets the lr value used by = openfirmware in returning to ofwcall: %r4 no longer zero. Past activity = shows %r0 is actually set to zero if it is initially non-zero instead, = which not visible for how I did this test. The other registers that were = intended to be easily tested were preserved, including %r2. [This only tests/classifies the failing openfirmware call. So the normal = test here is of the first ofwcall once pmap_bootstrapped is true. It = happens to be a peer request. Other requests could be different in the = details for all I know. I've not hit any other ofwcall failure points so = far.] The above notes should apply even if the "Count of times ..." string is = an accident of %r2's normal value at the time of the failure (0xd23050) = --rather than a deliberate attempted addressing of the string's = reference in the .got. [.got: global offset table used, for example, for = position independent code access to things in memory.] =3D=3D=3D Mark Millard markmi at dsl-only.net More information on the odd %r1 and %r3 value... The current and recent kernels that I've built get 0xd23450 for the = corrupted values in %r1 and %r3 after openfirmware returns. So I decided to look up what that might be... objdump -h /boot/kernel/kernel shows (.got: "global object table" or = some such?) ... Sections: Idx Name Size VMA LMA File off = Algn ... 35 .got 0002f5c0 0000000000cfb248 0000000000cfb248 00bfb248 = 2**3 CONTENTS, ALLOC, LOAD, DATA 36 .dynamic 000000d0 0000000000d2a808 0000000000d2a808 00c2a808 = 2**3 CONTENTS, ALLOC, LOAD, DATA ... and objdump -s -j .got /boot/kernel/kernel shows... d23438 00000000 00bbfd48 00000000 00bbfd60 .......H.......` d23448 00000000 00bbfd90 00000000 00bbfdb0 ................ d23458 00000000 00bbfdf0 00000000 00e17dd0 ..............}. Then for 0xbbfdb0 from the above: objdump -h /boot/kernel/kernel = shows... 6 .rodata.str1.8 000834a8 0000000000b4ddf8 0000000000b4ddf8 00a4ddf8 = 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 7 set_sysinit_set 00002538 0000000000bd12a0 0000000000bd12a0 00ad12a0 = 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA and objdump -s -j .rodata.str1.8 /boot/kernel/kernel shows... bbfda8 6f756e74 00000000 436f756e 74206f66 ount....Count of bbfdb8 2074696d 65732074 68726f74 746c696e times throttlin bbfdc8 67206261 73656420 6f6e2072 65717565 g based on reque bbfdd8 73742073 70616365 20686173 206f6363 st space has occ bbfde8 75727265 64000000 25733a20 6d617374 urred...%s: mast So 0xd23450 appears to possibly be a indirect reference to the string = "Count of times throttling based on request space has occurred" or = similar indirect content based on some offset from 0xd23450 indirectly = getting to something else through the .got section. That string that I = quoted is from /usr/src/sys/rpc/svc.c: SVCPOOL* svcpool_create(const char *name, struct sysctl_oid_list *sysctl_base) { ... SYSCTL_ADD_INT(&pool->sp_sysctl, sysctl_base, OID_AUTO, "request_space_throttle_count", CTLFLAG_RD, &pool->sp_space_throttle_count, 0, "Count of times throttling based on request space has = occurred"); } return pool; } (I have not done this lookup sequence across various FreeBSD updates and = rebuilds that also get 0xd23450 in %r1 and %r3. Nor with FreeBSD builds = that get some other corruption value. I do not know that the indirect = lookup would have always gotten to that same string.) =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 11:30 PM, Mark Millard = wrote: I added including after-ofwcall %r1 and %r3 values to my ofwcall history = buffer that I have ddb report when there is a problem. This makes it apparent that %r3 has also been corrupted when %r1 has = been. I say that because the usual/normal %r3 value is 0 in what the code = records and reports and I gather from the FreeBSD source code that the = error indicator is -1. But all along I've been reporting %r3 values for = the crashes that look more like 0xd18868 or other such. Never a 0 or -1 = (0xfff...). And the %r3 crash values even move around when the ofwstk = changes place from build to build. (This "usual"/"error-check" mix suggests %r3 from openfirmware is a = multi-bit representation of a Boolean value, with one's complemented = alternative values and zero as one of the two bit patterns --when %r3 is = not corrupted.) I also got an example of a somewhat later than normal ofwcall failure: = about 23 ofwcall's later than normal. It was not a peer request: ... OF_finddevice+0x90 powermac_smp_get_bsp+0x20 platform_smp_get_bsp+0x78 cpu_mp_start+0x24 mp_startup+0x7c mi_startup+0x10c btext_0xbc So pmap_bootstrapped had been true for a while by this point. Available = memory had been displayed as of when this example stopped to report the = %r1 change. =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 3:18 PM, Mark Millard = wrote: For openfirmware: is %r3 on return any more then a failed vs. not flag = with a particular failed-value? Is there any way to validate that %r3 = values for non-failure look reasonable vs. not looking reasonable? (For = all I know %r3 could also be corrupt.) I do not have any documentation for the PowerMac G5 openfirmware API = that is in use or the associated ABI as far as I remember. I do not know = if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if = there was some conversion going back and forth (as there is for FreeBSD, = at least for powerpc64). For openfirmware I derive properties from what = I see in FreeBSD's code (which has to be more explicit then when a = compiler's code generation happens to match at least large parts of an = ABI directly). As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's = ABI but FreeBSD does. If true I do not know what other differences that = there might be (even ignoring the 32 bit vs. 64 bit issues for the = kernels). But the point would be an existence proof of at least one = difference. My understanding is that %r1 was as in FreeBSD. I vaguely seem to remember that for Darwin/Mac OS X some register was = volatile in leaf functions but non-volatile otherwise, or at least when = nested functions were involved. And that brings to mind that the = condition code sets in cr might have had a mix of volatile and = non-volatile status despite being in one register? Did Darwin/Mac OS X = have something special for register usage for Thread-Specific Storage? = Position Independent Code? Indirect Calls? Frame Pointers? I may have some Darwin/Mac OS X information around but I doubt that it = is complete, especially for the 64-bit ABI or for privileged contexts. = For the 32-bit ABI (non-priviledged) I likely have the information about = the above possible ABI properties. I assume that openfirmware avoids the FPU and other such --but I do not = know. But it is privileged code. Are there any known sources of at least some of the information for the = the PowerMac G5 openfirmware ABI(s)? What are good references for the = FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)? [I cut off some of the older history.] =3D=3D=3D Mark Millard markmi at dsl-only.net On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn wrote: r1 *must be* preserved by the standard and for anything to work. It's = being corrupted somehow (Mark's comment about r3 is illuminating), and = if r1 is being corrupted, you can't rely on anything. I suspect it might = be an exception handling issue since it's non-deterministic, but it's = hard to tell. It could also be triggered by the way we've set up the OF = stack frame. It would be good to check if that makes sense. -Nathan On 10/14/14 09:53, Justin Hibbits wrote: > Interesting. Perhaps, instead of using %r1, and relying purely on the > stack we use yet another (non-volatile) register to hold the MSR. > Once we reload the MSR we can get back the saved registers, because > the stack will be valid again. >=20 > Nathan, thoughts? >=20 > - Justin >=20 > On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard = wrote: >> Additional notes from additional experiments... (So far from one G5.) >>=20 >> I got back trace, show registers, and my openfirmware-history list = going for failure reporting based on explicit before vs. after tests of = %r1 values. (Explicit breakpoint call for unequal, being careful to = save/restore %r3 around the call.) I filled several registers with = potentially interesting values that would otherwise have had zero as a = value (%r15-%r19, although %r15 is redundant with %r6 currently). >>=20 >> An interesting property resulted: every time %r1 had changed from = having the before-value (stack pointer value) %r1 instead ended up with = a value equal to what openfirmware put in %r3. >>=20 >> And more then that: For builds with the same ofwstk position the %r3 = value involved was fixed for the failures, for example when = 0x30400=3Dofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as = 0xd23450 for the failures. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 = ended up for failure as 0xd24450 instead. Yep: offset by the same amount = as ofwstk. >>=20 >> And I got one example where the openfirmware %r1-value-change failure = was instead much later in the boot, well after pmap_bootstrapped went = true: It was just after the message lines... >>=20 >> vgapci0: Boot video device ... >> pcib1: ... >>=20 >> with back trace (from OF_peer down): >>=20 >> .OF_peer+0x8c >> .cpcht_attach+0x884 >> .device_attach+0x3ac >> .device_probe_and_attach+0x3c >> .bus_generic_new_pass+0x12c >> .bus_generic_new_pass+0x114 >> .bus_generic_new_pass+0x114 (yep: listed twice) >> .bus_set_pass+0xc0 >> .root_bus_configure+0x14 >> .mi_startup+0x10c >> btext+0xbc >>=20 >> %r1 before: 0xc30400 ofwstk+0xfe0 >> %r1 after: 0xd23450 >> %r3 after: 0xd23450 >> FreeBSD msr to restore: 0x9000000000001032 >> ofmsr[0] to restore: 0x1000000000003030 >>=20 >> The same after-openfirmware %r1 and %r3 values that had been showing = up for the before-copyright examples of ofwcall failures. >>=20 >> And note that it again was a peer request. All the ofwcall-tied = boot-failures have been for peer requests as far as I remember. >>=20 >> I later did some experiments where I had it report but not stop when = the after-value was different from the before-value for %r1. When this = happened for these types of tests it seem to be an isolated example: = later calls normally have the stack pointer value still in %r1 after = openfirmware returns. In more detail: At most one report was made for = such a boot, the rest of the boot went fine. (Of course to get that far = my hacked ofwcall code avoids using the after-openfirmware %r1 value to = extract the 3 saved values to be restored from the bottom of ofwstk.) >>=20 >>=20 >>=20 >> I was not successful at using "capture on" in DDB for this early-boot = context. (It hangs things after the first report.) So I've been limited = to one screen's report and only when I have it stop at the end of the = report (so it does not scroll away). (No input to DDB available that = early.) Otherwise the information just scrolls by rather quickly for = reading any detail. Still it was useful to see that other reports were = not produced after the first (when there was a first). (I can not claim = multiple are impossible. It just appears at least infrequent.) >>=20 >> I have not yet investigated making analogous powerpc/GENERIC code and = builds. >>=20 >> Nor have I dealt with having it report more detail about the peer = requests that fail. >>=20 >> Nor have I seen examples of what "not failing/%r1-unchanged" looks = like overall. >>=20 >> I still have no examples of unstable/incomplete initialization(s) or = race condition(s) to explain why both ways can and do occur from one = attempt to the next --or that difference peer requests in the sequence = can be where the problem happens.