Date: Tue, 14 Oct 2014 09:53:28 -0700 From: Justin Hibbits <chmeeedalf@gmail.com> To: Mark Millard <markmi@dsl-only.net> Cc: FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed] Message-ID: <CAHSQbTCKi_MBhERh6d=kX2y-=%2B2OzqpGM%2BN=ZEShi-kX2r8NPQ@mail.gmail.com> In-Reply-To: <A2AB9066-259B-4B7D-BDDC-D03AE5827E13@dsl-only.net> References: <76F704FD-BB74-4439-8318-DB4C167B420F@dsl-only.net> <543B3828.8070806@freebsd.org> <9D9B0372-8D8F-4153-85B5-40066206EF67@dsl-only.net> <379AA7FC-98C9-48B9-92BB-60E134817AF1@dsl-only.net> <C614025F-6455-4929-8468-462E76079274@dsl-only.net> <A2AB9066-259B-4B7D-BDDC-D03AE5827E13@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Interesting. Perhaps, instead of using %r1, and relying purely on the stack we use yet another (non-volatile) register to hold the MSR. Once we reload the MSR we can get back the saved registers, because the stack will be valid again. Nathan, thoughts? - Justin On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard <markmi@dsl-only.net> wrote: > Additional notes from additional experiments... (So far from one G5.) > > I got back trace, show registers, and my openfirmware-history list going = for failure reporting based on explicit before vs. after tests of %r1 value= s. (Explicit breakpoint call for unequal, being careful to save/restore %r3= around the call.) I filled several registers with potentially interesting = values that would otherwise have had zero as a value (%r15-%r19, although %= r15 is redundant with %r6 currently). > > An interesting property resulted: every time %r1 had changed from having = the before-value (stack pointer value) %r1 instead ended up with a value eq= ual to what openfirmware put in %r3. > > And more then that: For builds with the same ofwstk position the %r3 valu= e involved was fixed for the failures, for example when 0x30400=3Dofwstk+0x= fe0 (%r1 before) was reported %r3 and %r1 end up as 0xd23450 for the failur= es. When 0x31400=3Dofwstk+0xfe0: %r3 and %r1 ended up for failure as 0xd244= 50 instead. Yep: offset by the same amount as ofwstk. > > And I got one example where the openfirmware %r1-value-change failure was= instead much later in the boot, well after pmap_bootstrapped went true: It= was just after the message lines... > > vgapci0: Boot video device ... > pcib1: <IBM CPC9X5 Hypertransport tunnel> ... > > with back trace (from OF_peer down): > > .OF_peer+0x8c > .cpcht_attach+0x884 > .device_attach+0x3ac > .device_probe_and_attach+0x3c > .bus_generic_new_pass+0x12c > .bus_generic_new_pass+0x114 > .bus_generic_new_pass+0x114 (yep: listed twice) > .bus_set_pass+0xc0 > .root_bus_configure+0x14 > .mi_startup+0x10c > btext+0xbc > > %r1 before: 0xc30400 ofwstk+0xfe0 > %r1 after: 0xd23450 > %r3 after: 0xd23450 > FreeBSD msr to restore: 0x9000000000001032 > ofmsr[0] to restore: 0x1000000000003030 > > The same after-openfirmware %r1 and %r3 values that had been showing up f= or the before-copyright examples of ofwcall failures. > > And note that it again was a peer request. All the ofwcall-tied boot-fail= ures have been for peer requests as far as I remember. > > I later did some experiments where I had it report but not stop when the = after-value was different from the before-value for %r1. When this happened= for these types of tests it seem to be an isolated example: later calls no= rmally have the stack pointer value still in %r1 after openfirmware returns= . In more detail: At most one report was made for such a boot, the rest of = the boot went fine. (Of course to get that far my hacked ofwcall code avoid= s using the after-openfirmware %r1 value to extract the 3 saved values to b= e restored from the bottom of ofwstk.) > > > > I was not successful at using "capture on" in DDB for this early-boot con= text. (It hangs things after the first report.) So I've been limited to one= screen's report and only when I have it stop at the end of the report (so = it does not scroll away). (No input to DDB available that early.) Otherwise= the information just scrolls by rather quickly for reading any detail. Sti= ll it was useful to see that other reports were not produced after the firs= t (when there was a first). (I can not claim multiple are impossible. It ju= st appears at least infrequent.) > > I have not yet investigated making analogous powerpc/GENERIC code and bui= lds. > > Nor have I dealt with having it report more detail about the peer request= s that fail. > > Nor have I seen examples of what "not failing/%r1-unchanged" looks like o= verall. > > I still have no examples of unstable/incomplete initialization(s) or race= condition(s) to explain why both ways can and do occur from one attempt to= the next --or that difference peer requests in the sequence can be where t= he problem happens. > > =3D=3D=3D > Mark Millard > markmi@dsl-only.net > > On Oct 13, 2014, at 3:39 AM, Mark Millard <markmi@dsl-only.net> wrote: > > While I do not yet have "show register" or other information displayed wh= en %r1 is changed by openfirmware... For powerpc64/GENERIC64 I have now had= two cases happen for the same, unmodified boot SSD in the same PowerMac G5= : > > A) Boots without failure or finding any changes to %r1 for before vs. aft= er openfirmware calls. > > B) I had it stop the boot after the code finds that %r1 had instead chang= ed. The usual before-copyright-notice sort of timing for where it stopped, = after pmap_bootstrapped became true. (I need "show register" or other such = to have more detail.) > > > I still have no examples of unstable/incomplete initialization(s) or race= condition(s) to explain why both ways can and do occur from one attempt to= the next. Both both do. > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 11:34 PM, Mark Millard <markmi at dsl-only.net> wrote= : > > Fixing stupid typos that reverse what I should have said: removing the !'= s in front of pmap_bootstrapped (from a copy/paste sequence error)... > > Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the= G5's except for when pmap_bootstrapped in (variants of) powerpc64/GENERIC6= 4. (Only covers when I had enough debug context in place to know that much.= Similarly for other notes.) These ofwcall related failures are the vast ma= jority of the boot failures that I've seen. > > ... > > The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with pmap_bootstrapped had already happened (as repor= ted by the ofwcall history list in my debug/DDB hacks). But this was before= the %r1 before and after code was in place: that is a recent addition to m= y investigation. > > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 11:20 PM, Mark Millard <markmi at dsl-only.net> wrote= : > > Quick summary: I've never seen FreeBSD boots fail "around" ofwcall on the= G5's except for when !pmap_bootstrapped in (variants of) powerpc64/GENERIC= 64. (Only covers when I had enough debug context in place to know that much= . Similarly for other notes.) These ofwcall related failures are the vast m= ajority of the boot failures that I've seen. > > A nice thing about what I've found is that I can now figure out how to us= e a comparison of the before and after stack pointers and to force DDB's in= volvement if and only if they are not equal. > > That would also report %r1 differences that happen to not to produce fail= ures (if there are such). (There has to be some explanation for why sometim= es it works and sometimes it does not, say, unstable initializations, race = conditions, or something meeting both criteria.) > > Which in turn makes the general technique appropriate to powerpc/GENERIC = contexts as well. (Coding details may vary.) > > I can not promise how quickly I'll get to any specific part of this. But = I should gradually progress on it. > > > I should have mentioned some things about the kind of evidence I have vs.= do not (yet) have: > > > A) The property defining the only context where I have observed the %r1 i= ssue is as noted above. > > In all but one of the ofwcall failure cases it was the first ofwcall in t= hat !pmap_bootstrapped context that had the problem. > > The only other ofwcall failure that I've seen happened only once and was = where prior ofwcall's with !pmap_bootstrapped had already happened (as repo= rted by the ofwcall history list in my debug/DDB hacks). But this was befor= e the %r1 before and after code was in place: that is a recent addition to = my investigation. > > > B) While I've not been building debug code variants for powerpc/GENERIC I= 've never seen the powerpc/GENERIC code fail to boot the G5's. And I have s= pent some sessions doing reboot after reboot to see if I'd get some failure= s (in addition to some other more normal uses). > > > C) So far I've only been looking at "show registers" when it gets a boot-= time exception that a DDB processes with the automatic script: the crashes.= I do not (yet) have any observations of what things look like during such = points for successful boots. (I'm figuring out ways to get and see the evid= ence spanning early boot time as I go.) And so I've only been looking with = such special debug code where I knew I could reproduce the failures (3 Powe= rMac G5's when using variants of powerpc64/GENERIC64.) > > In fact if the hack that I put in place completely masks the problem then= I currently would not ever observe any problem-specific information from t= he successful boots. Thus the before/after comparison would seem to be next= for my investigation. > > > > =3D=3D=3D > Mark Millard > markmi at dsl-only.net > > On Oct 12, 2014, at 7:25 PM, Nathan Whitehorn <nwhitehorn at freebsd.org>= wrote: > > Interesting. If OF is changing the value of r1, there must be some proble= m with the ABI thunk the 64-bit kernel uses or a problem with trap handlers= . This is obviously not systematic if loader and the kernel up to that poin= t have no problems. Does a 32-bit kernel have the same problems on your har= dware? That would test whether it is the ABI translation. > -Nathan > > On 10/12/14 17:53, Mark Millard wrote: >> NOTE: I make no claim that any of the below hacks for ofwcall are approp= riate code for FreeBSD's general context. I only claim that it seems to mak= e the specific PowerMac G5 problem go away, gives solid evidence for at lea= st some of what is going on (justifying the investigative and testing hacks= ) and so gives evidence for an appropriate, more general FreeBSD solution. >> >> >> The big issue is: The PowerMac G5 openfirmware does not always preserve = the %r1 value (the stack pointer contents) that it is initially given, at l= east when the early "before copyright" crash problem is happening but possi= bly other times as well. >> >> I had the following investigative code in ofwcall, snapshotting the valu= e of %r1 before and after openfirmware's code is used: >> >> lis %r4,openfirmware_entry@ha >> ld %r4,openfirmware_entry@l(%r4) >> ... >> mr %r17,%r1 /* ADDED HACK TO RECORD %r1 before... >> /* Finally, branch to OF */ >> mtctr %r4 >> bctrl >> mr %r18,%r1 /* ADDED HACK TO RECORD %r1 after... >> >> then the DDB show registers from the crash that I'd hacked in would show= these values instead of the zeros they otherwise always display, in additi= on to what the show registers has always shown for r1. >> >> The results were like the following example for every such crash: >> >> r17 =3D 0xC31400 ofwstk+0xfe0 >> r18 =3D 0xd24450 >> r1 =3D 0xd24450 >> >> Because of that %r1 value the later code such as: >> >> /* Reload stack pointer and MSR from the OFW stack */ >> ld %r6,24(%r1) >> ld %r2,16(%r1) >> ld %r1,8(%r1) >> >> gets garbage-in/garbage-out results, including %r6 being values like 0xb= c0568 instead of the value saved msr to later be restored: 0x90000000000010= 32. >> >> So one PowerMac G5 specific hack involved in my working-boots context is= to force the original %r1 value to be used (based on %r17 being a before-c= all copy, similar to the above): >> >> ld %r6,24(%r17) >> ld %r2,16(%r17) >> ld %r1,8(%r17) >> >> But the exception report from DDB has had problems in part because sprg0= still has the openfirmware value at the time even though the exception is = after openfirmware returned (the wrong value results in the register for GE= T_CPUINFO(<register>). So I hacked in a before-exception restore of FreeBSD= 's sprg0 inside ofwcall to make the exception handler code have that much F= reeBSD context available at the exception (if it occurs, anyway). This was = really just to help with information gathering, although I've not tested on= ly having the %r17 changes. >> >> So overall PowerMac G5 specific hacking the ofwcall code to have instead= (based on what was reported above): >> >> root@FBSDG5M1:~ # svnlite diff /usr/src/sys/powerpc/ofw/ofwcall64.S >> Index: /usr/src/sys/powerpc/ofw/ofwcall64.S >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> --- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558) >> +++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy) >> @@ -52,6 +52,12 @@ >> GLOBAL(rtas_entry) >> .llong 0 /* RTAS entry point */ >> + /* HACK: part of having sprg0 in place for trap */ >> +ofwsprg0save: >> + .space 8 /* sizeof(register_t) */ >> +GLOBAL(ofw_sprg0_save) >> + .llong 0 >> + >> /* >> * Open Firmware Real-mode Entry Point. This is a huge pain. >> */ >> @@ -97,6 +103,10 @@ >> lis %r4,openfirmware_entry@ha >> ld %r4,openfirmware_entry@l(%r4) >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r14,ofw_sprg0_save@ha >> + ld %r14,ofw_sprg0_save@l(%r14) >> + >> /* >> * Set the MSR to the OF value. This has the side effect of disabl= ing >> * exceptions, which is important for the next few steps. >> @@ -123,14 +133,27 @@ >> stw %r5,4(%r1) >> stw %r5,0(%r1) >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r6,ofwsprg0save@ha >> + std %r14,ofwsprg0save@l(%r6) >> + >> + /* HACK: part of IGNORING the later %r1 value from openfirmware */ >> + mr %r17,%r1 >> + >> /* Finally, branch to OF */ >> mtctr %r4 >> bctrl >> + /* HACK: part of having FreeBSD's sprg0 in place for the exception= problem */ >> + lis %r6,ofwsprg0save@ha >> + ld %r6,ofwsprg0save@l(%r6) >> + mtsprg0 %r6 >> + >> /* Reload stack pointer and MSR from the OFW stack */ >> - ld %r6,24(%r1) >> - ld %r2,16(%r1) >> - ld %r1,8(%r1) >> + /* HACKED to ignore the %r1 value that results from openfirmware's= call */ >> + ld %r6,24(%r17) >> + ld %r2,16(%r17) >> + ld %r1,8(%r17) >> /* Now set the real MSR */ >> mtmsrd %r6 >> >> This results in no crashes happening so far in my testing, not even the = 16 GByte RAM machine that crashed so much. >> >> NOTE: owf_machdep.c was changed to use "extern register_t ofw_sprg0_save= ;" to match the above. >> >> I still have ps3 disabled in GENERIC64 so that I can also have the sc op= tions in GENERIC64. And the DDB and GDB options are still present as well. >> >> And I still have my hack to force a DDB script that does show registers = and shows the ofwcall history information that I hacked in, even for the ve= ry early crashes before input is possible. Not that I'm now getting such ex= ecutions of the script. (A before possible-crash backtrace is also shown by= the added code. That still shows up.) >> >> I'll probably next switch to reverting the DDB related code changes and = to removing the DDB/GDB options and see how that goes. >> >> >> =3D=3D=3D >> Mark Millard >> markmi at dsl-only.net >> >> > > > > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAHSQbTCKi_MBhERh6d=kX2y-=%2B2OzqpGM%2BN=ZEShi-kX2r8NPQ>