From owner-freebsd-ppc@FreeBSD.ORG Tue Oct 14 16:53:33 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DA949AC6; Tue, 14 Oct 2014 16:53:32 +0000 (UTC) Received: from mail-la0-x22f.google.com (mail-la0-x22f.google.com [IPv6:2a00:1450:4010:c03::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1BC887BD; Tue, 14 Oct 2014 16:53:31 +0000 (UTC) Received: by mail-la0-f47.google.com with SMTP id pv20so8990263lab.34 for ; Tue, 14 Oct 2014 09:53:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=jrymTU4r2dgpsF+1CBg91zSfc6OluZuMyMpwCwm62GE=; b=m+WbN1lCtLDd/q+T72gAYtM/Er2QC8qYxN2WCJWYjpkV+6CjCvvxqSvuxu8crxSt5i /69g8IV0NUMMR92RZSDicHTqP4gCRAe1G/EDAWGImxdyxZTLMvEOt86f7kR9DsSN9HH/ EvFX97HJW7wDxI8+jwsnmoTmP4OokhoJFdLSEfKB2viJMNwF+QbuKzcGF2WZFkz5Cc+v Uxj5k+iIVhUuf8Ep6Eu1zUY6JTyMMT+ohVo32BRwbDMzet/aAKZU26vJcQ37qMTCOS79 A8TfkwpoumqPh415puwCkrfFF+nU+8cWsPyZGe0zKyU6xW4Xt5LkV+r5s1Ll+jdgIBZS MRxw== MIME-Version: 1.0 X-Received: by 10.112.146.5 with SMTP id sy5mr4281508lbb.97.1413305608356; Tue, 14 Oct 2014 09:53:28 -0700 (PDT) Received: by 10.25.0.75 with HTTP; Tue, 14 Oct 2014 09:53:28 -0700 (PDT) In-Reply-To: References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> Date: Tue, 14 Oct 2014 09:53:28 -0700 Message-ID: Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] From: Justin Hibbits To: Mark Millard Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD PowerPC ML X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Oct 2014 16:53:33 -0000 Interesting. Perhaps, instead of using %r1, and relying purely on the stack we use yet another (non-volatile) register to hold the MSR. Once we reload the MSR we can get back the saved registers, because the stack will be valid again. Nathan, thoughts? - Justin On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard wrote: > Additional notes from additional experiments... (So far from one G5.) > > I got back trace, show registers, and my openfirmware-history list going = for failure reporting based on explicit before vs. after tests of %r1 value= s. (Explicit breakpoint call for unequal, being careful to save/restore %r3= around the call.) I filled several registers with potentially interesting = values that would otherwise have had zero as a value (%r15-%r19, although %= r15 is redundant with %r6 currently). > > An interesting property resulted: every time %r1 had changed from having = the before-value (stack pointer value) %r1 instead ended up with a value eq= ual to what openfirmware put in %r3. > > And more then that: For builds with the same ofwstk position the %r3 valu= e involved was fixed for the failures, for example when 0x30400=3Dofwstk+0x= fe0 (%r1 before) was reported %r3 and %r1 end up as 0xd23450 for the failur= es. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 ended up for failure as 0xd244= 50 instead. Yep: offset by the same amount as ofwstk. > > And I got one example where the openfirmware %r1-value-change failure was= instead much later in the boot, well after pmap_bootstrapped went true: It= was just after the message lines... > > vgapci0: Boot video device ... > pcib1: ... > > with back trace (from OF_peer down): > > .OF_peer+0x8c > .cpcht_attach+0x884 > .device_attach+0x3ac > .device_probe_and_attach+0x3c > .bus_generic_new_pass+0x12c > .bus_generic_new_pass+0x114 > .bus_generic_new_pass+0x114 (yep: listed twice) > .bus_set_pass+0xc0 > .root_bus_configure+0x14 > .mi_startup+0x10c > btext+0xbc > > %r1 before: 0xc30400 ofwstk+0xfe0 > %r1 after: 0xd23450 > %r3 after: 0xd23450 > FreeBSD msr to restore: 0x9000000000001032 > ofmsr[0] to restore: 0x1000000000003030 > > The same after-openfirmware %r1 and %r3 values that had been showing up f= or the before-copyright examples of ofwcall failures. > > And note that it again was a peer request. All the ofwcall-tied boot-fail= ures have been for peer requests as far as I remember. > > I later did some experiments where I had it report but not stop when the = after-value was different from the before-value for %r1. When this happened= for these types of tests it seem to be an isolated example: later calls no= rmally have the stack pointer value still in %r1 after openfirmware returns= . In more detail: At most one report was made for such a boot, the rest of = the boot went fine. (Of course to get that far my hacked ofwcall code avoid= s using the after-openfirmware %r1 value to extract the 3 saved values to b= e restored from the bottom of ofwstk.) > > > > I was not successful at using "capture on" in DDB for this early-boot con= text. (It hangs things after the first report.) So I've been limited to one= screen's report and only when I have it stop at the end of the report (so = it does not scroll away). (No input to DDB available that early.) Otherwise= the information just scrolls by rather quickly for reading any detail. Sti= ll it was useful to see that other reports were not produced after the firs= t (when there was a first). (I can not claim multiple are impossible. It ju= st appears at least infrequent.) > > I have not yet investigated making analogous powerpc/GENERIC code and bui= lds. > > Nor have I dealt with having it report more detail about the peer request= s that fail. > > Nor have I seen examples of what "not failing/%r1-unchanged" looks like o= verall. > > I still have no examples of unstable/incomplete initialization(s) or race= condition(s) to explain why both ways can and do occur from one attempt to= the next --or that difference peer requests in the sequence can be where t= he problem happens. > > =3D=3D=3D > Mark Millard > markmi@dsl-only.net > > On Oct 13, 2014, at 3:39 AM, Mark Millard wrote: > > While I do not yet have "show register" or other information displayed wh= en %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have now had= two cases happen for the same, unmodified boot SSD in the same PowerMac G5= : > > A) Boots without failure or finding any changes to %r1 for before vs. aft= er openfirmware calls. > > B) I had it stop the boot after the code finds that %r1 had instead chang= ed. The usual before-copyright-notice sort of timing for where it stopped, = after pmap_bootstrapped became true. (I need "show register" or other such = to have more detail.) > > > I still have no examples of unstable/incomplete initialization(s) or race= condition(s) to explain why both ways can and do occur from one attempt to= the next. Both both do. > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 11:34 PM, Mark Millard wrote= : > > Fixing stupid typos that reverse what I should have said: removing the !'= s in front of pmap_bootstrapped (from a copy/paste sequence error)... > > Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the= G5's except for when pmap_bootstrapped in (variants of) powerpc64/GENERIC6= 4. (Only covers when I had enough debug context in place to know that much.= Similarly for other notes.) These ofwcall related failures are the vast ma= jority of the boot failures that I've seen. > > ... > > The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with pmap_bootstrapped had already happened (as repor= ted by the ofwcall history list in my debug/DDB hacks). But this was before= the %r1 before and after code was in place: that is a recent addition to m= y investigation. > > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 11:20 PM, Mark Millard wrote= : > > Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the= G5's except for when !pmap_bootstrapped in (variants of) powerpc64/GENERIC= 64. (Only covers when I had enough debug context in place to know that much= . Similarly for other notes.) These ofwcall related failures are the vast m= ajority of the boot failures that I've seen. > > A nice thing about what I've found is that I can now figure out how to us= e a comparison of the before and after stack pointers and to force DDB's in= volvement if and only if they are not equal. > > That would also report %r1 differences that happen to not to produce fail= ures (if there are such). (There has to be some explanation for why sometim= es it works and sometimes it does not, say, unstable initializations, race = conditions, or something meeting both criteria.) > > Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) > > I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. > > > I should have mentioned some things about the kind of evidence I have vs.= do not (yet) have: > > > A) The property defining the only context where I have observed the %r1 i= ssue is as noted above. > > In all but one of the ofwcall failure cases it was the first ofwcall in t= hat !pmap_bootstrapped context that had the problem. > > The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as repo= rted by the ofwcall history list in my debug/DDB hacks). But this was befor= e the %r1 before and after code was in place: that is a recent addition to = my investigation. > > > B) While I've not been building debug code variants for powerpc/GENERIC I= 've never seen the powerpc/GENERIC code fail to boot the G5's. And I have s= pent some sessions doing reboot after reboot to see if I'd get some failure= s (in addition to some other more normal uses). > > > C) So far I've only been looking at "show registers" when it gets a boot-= time exception that a DDB processes with the automatic script: the crashes.= I do not (yet) have any observations of what things look like during such = points for successful boots. (I'm figuring out ways to get and see the evid= ence spanning early boot time as I go.) And so I've only been looking with = such special debug code where I knew I could reproduce the failures (3 Powe= rMac G5's when using variants of powerpc64/GENERIC64.) > > In fact if the hack that I put in place completely masks the problem then= I currently would not ever observe any problem-specific information from t= he successful boots. Thus the before/after comparison would seem to be next= for my investigation. > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn = wrote: > > Interesting. If OF is changing the value of r1, there must be some proble= m with the ABI thunk the 64-bit kernel uses or a problem with trap handlers= . This is obviously not systematic if loader and the kernel up to that poin= t have no problems. Does a 32-bit kernel have the same problems on your har= dware? That would test whether it is the ABI translation. > -Nathan > > On 10/12/14 17:53, Mark Millard wrote: >> NOTE: I make no claim that any of the below hacks for ofwcall are approp= riate code for FreeBSD's general context. I only claim that it seems to mak= e the specific PowerMac G5 problem go away, gives solid evidence for at lea= st some of what is going on (justifying the investigative and testing hacks= ) and so gives evidence for an appropriate, more general FreeBSD solution. >> >> >> The big issue is: The PowerMac G5 openfirmware does not always preserve = the %r1 value (the stack pointer contents) that it is initially given, at l= east when the early "before copyright" crash problem is happening but possi= bly other times as well. >> >> I had the following investigative code in ofwcall, snapshotting the valu= e of %r1 before and after openfirmware's code is used: >> >> lis %r4,openfirmware_entry@ha >> ld %r4,openfirmware_entry@l(%r4) >> ... >> mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... >> /* Finally, branch to OF */ >> mtctr %r4 >> bctrl >> mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >> >> then the DDB show registers from the crash that I'd hacked in would show= these values instead of the zeros they otherwise always display, in additi= on to what the show registers has always shown for r1. >> >> The results were like the following example for every such crash: >> >> r17 =3D 0xC31400 ofwstk+0xfe0 >> r18 =3D 0xd24450 >> r1 =3D 0xd24450 >> >> Because of that %r1 value the later code such as: >> >> /* Reload stack pointer and MSR from the OFW stack */ >> ld %r6,24(%r1) >> ld %r2,16(%r1) >> ld %r1,8(%r1) >> >> gets garbage-in/garbage-out results, including %r6 being values like 0xb= c0568 instead of the value saved msr to later be restored: 0x90000000000010= 32. >> >> So one PowerMac G5 specific hack involved in my working-boots context is= to force the original %r1 value to be used (based on %r17 being a before-c= all copy, similar to the above): >> >> ld %r6,24(%r17) >> ld %r2,16(%r17) >> ld %r1,8(%r17) >> >> But the exception report from DDB has had problems in part because sprg0= still has the openfirmware value at the time even though the exception is = after openfirmware returned (the wrong value results in the register for GE= T_CPUINFO(). So I hacked in a before-exception restore of FreeBSD= 's sprg0 inside ofwcall to make the exception handler code have that much F= reeBSD context available at the exception (if it occurs, anyway). This was = really just to help with information gathering, although I've not tested on= ly having the %r17 changes. >> >> So overall PowerMac G5 specific hacking the ofwcall code to have instead= (based on what was reported above): >> >> root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S >> Index: /usr/src/sys/powerpc/ofw/ofwcall64.S >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) >> +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) >> @@ -52,6 +52,12 @@ >> GLOBAL(rtas_entry) >> .llong 0 /* RTAS entry point */ >> + /* HACK: part of having sprg0 in place for trap */ >> +ofwsprg0save: >> + .space 8 /* sizeof(register_t) */ >> +GLOBAL(ofw_sprg0_save) >> + .llong 0 >> + >> /* >> * Open Firmware Real-mode Entry Point. This is a huge pain. >> */ >> @@ -97,6 +103,10 @@ >> lis %r4,openfirmware_entry@ha >> ld %r4,openfirmware_entry@l(%r4) >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r14,ofw_sprg0_save@ha >> + ld %r14,ofw_sprg0_save@l(%r14) >> + >> /* >> * Set the MSR to the OF value. This has the side effect of disabl= ing >> * exceptions, which is important for the next few steps. >> @@ -123,14 +133,27 @@ >> stw %r5,4(%r1) >> stw %r5,0(%r1) >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r6,ofwsprg0save@ha >> + std %r14,ofwsprg0save@l(%r6) >> + >> + /* HACK: part of IGNORING the later %r1 value from openfirmware */ >> + mr %r17,%r1 >> + >> /* Finally, branch to OF */ >> mtctr %r4 >> bctrl >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r6,ofwsprg0save@ha >> + ld %r6,ofwsprg0save@l(%r6) >> + mtsprg0 %r6 >> + >> /* Reload stack pointer and MSR from the OFW stack */ >> - ld %r6,24(%r1) >> - ld %r2,16(%r1) >> - ld %r1,8(%r1) >> + /* HACKED to ignore the %r1 value that results from openfirmware's= call */ >> + ld %r6,24(%r17) >> + ld %r2,16(%r17) >> + ld %r1,8(%r17) >> /* Now set the real MSR */ >> mtmsrd %r6 >> >> This results in no crashes happening so far in my testing, not even the = 16 GByte RAM machine that crashed so much. >> >> NOTE: owf_machdep.c was changed to use "extern register_t ofw_sprg0_save= ;" to match the above. >> >> I still have ps3 disabled in GENERIC64 so that I can also have the sc op= tions in GENERIC64. And the DDB and GDB options are still present as well. >> >> And I still have my hack to force a DDB script that does show registers = and shows the ofwcall history information that I hacked in, even for the ve= ry early crashes before input is possible. Not that I'm now getting such ex= ecutions of the script. (A before possible-crash backtrace is also shown by= the added code. That still shows up.) >> >> I'll probably next switch to reverting the DDB related code changes and = to removing the DDB/GDB options and see how that goes. >> >> >> =3D=3D=3D >> Mark Millard >> markmi at dsl-only.net >> >> > > > > >