Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 24 Sep 2014 09:40:10 -0700
From:      Justin Hibbits <chmeeedalf@gmail.com>
To:        Nathan Whitehorn <nwhitehorn@freebsd.org>
Cc:        FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>, Mark Millard <markmi@dsl-only.net>
Subject:   Re: lr=u_trap+0x10 and srr0=k_trap+0x28 for "stopped at 0 illegal instruction 0" before-copyright hang on PowerMac G5's
Message-ID:  <20140924094010.68225ddf@zhabar.attlocal.net>
In-Reply-To: <5422E513.6010806@freebsd.org>
References:  <1118046C-0FF7-49FC-82DA-DB9A7A310991@dsl-only.net> <2ED3DB50-B985-4382-8FF2-3B44E7E65453@dsl-only.net> <CAHSQbTBXxrgXQdNeCs=C5wJaT_bmh9FU836O6VnJDbJuqCUujw@mail.gmail.com> <6D729F43-662A-429E-9503-0148EC3250B1@dsl-only.net> <72535F89-3942-45A6-B351-7F746209ED9F@dsl-only.net> <0703EF26-6E33-4446-9273-BBFD0CB72893@dsl-only.net> <37575F94-763C-43BF-8DD9-F648F4A7C09F@dsl-only.net> <5422E513.6010806@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
I had started testing callbacks, but didn't get any farther than
panicking the system with a bad callback handler :)  Needless to say, I
didn't get too far.  Haven't tried restoring the vectors.

- Justin


On Wed, 24 Sep 2014 08:36:51 -0700
Nathan Whitehorn <nwhitehorn@freebsd.org> wrote:

> There shouldn't be any exceptions at that point, nested or otherwise. 
> What I suspect is happening is that Open Firmware has turned them on
> for some bizarre reason, taken one, and ended up in the kernel's
> handlers but with the Open Firmware environment. Saving and restoring
> the OF interrupt vectors would be a possible solution; flattening the
> device tree in loader so that the kernel doesn't call Open Firmware
> at all would be another. I think Justin may have tried the first at
> some point. -Nathan
> 
> On 09/24/14 02:04, Mark Millard wrote:
> > Now that I've had a kernel/boot crash with a successful DDB bt and 
> > show registers (a different submittal) it makes for a good 
> > comparison/contrast with what DDB reports for this "before
> > copyright" crash.
> >
> > Something unique to the "before copyright" context is...
> >
> > No registers are reported to have values that point into the range 
> > between tmpstk and esym.
> >
> > In other words: There is no valid stack pointer reported as far as
> > I can tell. r1 has the value 0 instead of being a handling a valid
> > stack address. tmpstk=0xbd7000 and esym=0xbdb000 (example for one
> > of my WITH_DEBUG_FILES= and options DDB and GDB builds of
> > 10.1-BETA2). That at least gives a ball park on the range to expect
> > for pointing into the stack even with some build variation.
> >
> > It leaves me wondering if the DDB report is for a nested exception 
> > handling. That could explain why lr points to u_trap+0x10 and srr0 
> > points to k_trap+0x28 when normally srr0 would point to the the 
> > failing instruction (or the instruction after) and lr to where that 
> > routine would normally return to.
> >
> > The register values that are reported for my 10.1-BETA2 builds that 
> > crash before the copyright notice are:
> >
> > r0: 0
> > r1: 0
> > r2: 0xc81538 vop_unlock_desc
> > r3: 0xd18868
> > r4: 0x894b58
> > r5: 0
> > r6: 0xc1dee0 M_AUDITBSM
> > r7: 0xe3f818 ofw_real_mode
> > r8: 0x1
> > r9: 0xe0f580 __pcpu
> > r10: 0x1c35ec0
> > r11: 0
> > r12: 0x10000000
> > r13: 0xdbb290 thread0 (Note: another submittal has this mistyped as 
> > 0xdbb290.)
> > r14-r19: all 0
> > r20: 0x10c1000
> > r21: 0x4
> > r22: 0x180abd4
> > r23: 0x1803a28
> > r24: 0xc000000000008760
> > r25: 0xcc89b8 smp_no...
> > r26: 0xcea108 ofw_rend...
> > r27: 0x894b58 ofwcall+0xa8
> > r28: 0x894b58 ofwcall+0xa8
> > r29: 2400022
> > r30: 9000000000001032
> > r31: 0xbb7d38
> >
> > srr0: 0x102720 k_trap+0x28
> > srr1: 9000000000001032
> > lr: 0x1026f0 u_trap+0x10
> > ctr: 0xff846d78
> > cr: 2000deb0
> > xer: 0
> > dar: f...d50 (lots of f's)
> > dsisr: 42000000
> >
> >
> >
> >
> >
> >
> > ===
> > Mark Millard
> > markmi at dsl-only.net <http://dsl-only.net>;
> >
> > On Sep 20, 2014, at 3:42 PM, Mark Millard <markmi at dsl-only.net 
> > <http://dsl-only.net>>; wrote:
> >
> > [I corrected the SSR0 in the subject to be SRR0.]
> >
> > I did miss a register in my list (it matched the shown r30 value).
> > And it turns out to probably be very important to interpreting what
> > the "show registers" is reporting:
> >
> > SRR1: 0x9000000000001032
> >
> > But bits 43-46 of SRR1 are supposed to indicate which type of
> > Program Exception, using a single binary 1 to so. No such 1's are
> > present.
> >
> > Illegal instruction would have been bit 44 being 1. (PowerPC has
> > the upper bit numbered zero and increases from there.)
> >
> > So the ddb "show registers" is apparently not reporting the status
> > as of when the "stopped at 0 illegal instruction 0" happened. Thus
> > other things are also likely not from that exact time frame.
> >
> >
> >
> > And I misinterpreted the LR value status: The LR value was just
> > left over from the restore_kernsrs returning when it finished.
> > Execution then flowed into k_trap. Nothing unusual involved.
> >
> >
> >
> >
> >
> > ===
> > Mark Millard
> > markmi@dsl-only.net <mailto:markmi@dsl-only.net>
> >
> > On Sep 18, 2014, at 8:57 PM, Mark Millard <markmi@dsl-only.net 
> > <mailto:markmi@dsl-only.net>> wrote:
> >
> > I modified DDB to automatically "show registers" even at the early 
> > "before Copyright" crash time. The end of this note will show the 
> > /usr/src/sys/ddb/db_script.c diff for the hack. While I also had
> > DDB bt, the bt does not actually print a back trace for this
> > context. (It might for others.)
> >
> > The registers give interesting context despite the lack of a back 
> > trace. I do not know if it will be sufficient to be of much
> > immediate help if someone used the information to start looking at
> > the problem.
> >
> > I'll start with register lr: 0x1026f0 u_trap+0x10.
> >
> > /usr/src/sys/powerpc/aim/trap_subr64.S has:
> >
> > s_trap:
> >         bf      17,k_trap               /* branch if PSL_PR is
> > false */ GET_CPUINFO(%r1)
> > u_trap:
> >         ld      %r1,PC_CURPCB(%r1)
> >         mr      %r27,%r28               /* Save LR, r29 */
> >         mtsprg2 %r29
> >         bl      restore_kernsrs         /* enable kernel mapping */
> >         mfsprg2 %r29
> >         mr      %r28,%r27
> >
> > /*
> >  * Now the common trap catching code.
> >  */
> > k_trap:
> >         FRAME_SETUP(PC_TEMPSAVE)
> > /* Call C interrupt dispatcher: */
> > trapagain:
> >
> > and so this appears to indicate a pending return to execute the 
> > "mfsprg2 %r29" after "bl restore_kernsrs", which indicates that 
> > restore_kernsrs should be active.
> >
> > But register srr0 indicates: 0x102720 k_trap+0x28. (So apparently
> > in FRAME_SETUP(PC_TEMPSAVE) someplace.)
> >
> > So it appears to me that the processor got to the k_trap code
> > during the supposed restore_kernsrs time frame. (But I'm no expert
> > at these sorts of things or for the processor.)
> >
> > I'll list the other register values:
> >
> > r0:  0
> > r1:  0
> > r2:  0xc1be80 M_AUDITBSM
> > r3:  0xb16138
> > r4:  0x8926e8 .ofwcall+0xa8
> > r5:  0
> > r6:  0xbb5f90
> > r7:  0xe3d118 ofw_real_mode
> > r8:  0x1
> > r9:  0xe0ce80 __pcpu
> > r10: 0x1c35ec9
> > r11: 0
> > r12: 0x10000000
> > r13: db890    thread0
> > r14-r19: all 0
> > r20: 0x10bc000
> > r21: 0x4
> > r22: 0x1801db4
> > r23: 0x1803a28
> > r24: 0xc000000000008760
> > r25: 0xcc6908 smp_no_rendevous_barrier
> > r26: 0xec79e0 ofw_rendezvous_dispatch (yep one has v and the other
> > zv) r27: 0x8926e8 .ofwcall+0xa8
> > r28: 0x8926e8 .ofwcall+0xa8 (yep: same value)
> > r29: 0x24000022
> > r30: 0x9000000000001032
> > r31: 0xc7f488 vop_unlock_desc
> >
> > ctr: 0xff846d78
> > cr:  0x2000d7b0
> > xer: 0
> > dar: 0xfffffffffffffd50
> > dsisr: 0x42000000
> >
> > (Hopefully this manual transcription from the screen display is 
> > complete --and also accurate for what it does present.)
> >
> >
> >
> >
> > The personal HACK to /usr/src/sys/ddb/db_script.c's 
> > db_script_kdbenter(...) to have it show registers and try bt...
> >
> > $ cd /usr/src/sys/ddb/
> > $ svnlite diff .
> > Index: db_script.c
> > ===================================================================
> > --- db_script.c(revision 271610)
> > +++ db_script.c(working copy)
> > @@ -319,10 +319,25 @@
> >  {
> > char scriptname[DB_MAXSCRIPTNAME];
> >
> > +/* HACK!!! : Additional lines to force a basic default script to
> > exist. +* Will dump information even if ddb input is not available
> > for early crash.
> > +* Used to get more information about PowerMac G5 "before
> > Copyright" hangs.
> > +*/
> > +struct ddb_script *dsp =
> > db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT); +if (!dsp)
> > db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "show registers; bt");
> > +
> > snprintf(scriptname, sizeof(scriptname), "%s.%s",
> > DB_SCRIPT_KDBENTER_PREFIX, eventname);
> > if (db_script_exec(scriptname, 0) == ENOENT)
> > (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
> > +
> > +/* HACK!!! : Additional lines to always use the default script,
> > +*           even if scriptname existed and was executed.
> > +* Will dump information even if ddb input is not available for
> > early crash.
> > +* Used to get more information about PowerMac G5 "before
> > Copyright" hangs.
> > +*/
> > +else
> > +(void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
> >  }
> >
> >  /*-
> >
> >
> >
> > ===
> > Mark Millard
> > markmi at dsl-only.net <http://dsl-only.net/>;
> >
> > On Sep 16, 2014, at 9:28 PM, Mark Millard <markmi at dsl-only.net 
> > <http://dsl-only.net/>>; wrote:
> >
> > In part I sent directly to you because of a past exchange (July-27) 
> > where you had written:
> >
> >> Nathan and I both speculate that it's
> >> dropping into Open Firmware (we make extensive use of OFW), and
> >> then messing something up, taking a page fault or something.
> >
> > The specific text that I report and its uniformity when it is
> > produced seems to add a little information beyond a speculated
> > "page fault or something" and so might eventually help a little. As
> > I understand the text it is reporting execution reaching address
> > zero without any prior un-handled exceptions or other such that
> > would stop it. A corrupted stack (pointer) so a bad return address
> > or some such? I'd guess there are no explicit jumps to address zero
> > so I expect that indirection is likely involved, with the content
> > for the indirection messed up.
> >
> > I really wish that I had a logic analyzer configuration for this.
> > I've not found a way to make the failing context visible so far and
> > the extra way of looking at things might have helped.
> >
> >
> >
> >
> > ===
> > Mark Millard
> > markmi@dsl-only.net <mailto:markmi@dsl-only.net>
> >
> > On Sep 16, 2014, at 8:28 PM, Justin Hibbits <chmeeedalf@gmail.com 
> > <mailto:chmeeedalf@gmail.com>> wrote:
> >
> > Hi mark,
> >
> > I see this on my G5, and I think it's due to the amount of RAM in
> > the machine. More than 4gb seems to confuse open firmware when
> > called by FreeBSD. There is some effort to remove the need of the
> > callbacks but thus far it's not far along. The good news is that
> > after it boots it's solid except when switching vtys, buy earlier
> > this year or last year I added a sysctl hack to disable the call
> > into open firmware on vty switch (don't recall offhand and not at
> > my computer right now, but if you grep the sysctl output for reset
> > and ofw you can find it).
> >
> > -Justin
> >
> > On Sep 16, 2014 8:01 PM, "Mark Millard" <markmi@dsl-only.net 
> > <mailto:markmi@dsl-only.net>> wrote:
> >
> >     I've now spent time with rebooting and power-off/power-on for
> > all 3 PowerMac G5's (one PowerMac7,2 and two PowerMac11,2's) and
> > all 3 get the
> >
> >>     GDB: no debug ports present
> >>     KDB: debugger backends: DDB
> >>     KDB: current backend: DDB
> >>     [ thread pid -1 tid 1006665719 ]
> >>     Stopped at 0: illegal instruction 0
> >>     db>
> >
> >     when they fail just before the Copyright notice would normally
> > be displayed. None fail any earlier. At that spot none have failed
> >     any other way. It is the same SSD in all 3. (Happens with other
> >     SSD's as well.) Overall there is a mix of Radeon and NVIDIA
> >     display boards. Besides the SSD use and RAM upgrades the rest is
> >     stock equipment. scons used, not vt. (I've yet to try vt.)
> >
> >     Seeing a failure after the Copyright notice as been fairly rare
> > in all my experiments from when I started last April or so. The ones
> >     that I've noted had Data Storage Interrupt reported. So far no
> >     examples of the above have been reported after the Copyright
> >     notice. So I'd guess that they are separate issues. Of course it
> >     seems that only in the last few days would I have seen the above
> >     sort of thing if it did happen after the Copyright notice: The
> >     prior history does not count for judgements about that.
> >
> >     ===
> >     Mark Millard
> >     markmi at dsl-only.net <http://dsl-only.net/>;
> >
> >     On Sep 16, 2014, at 8:15 AM, Mark Millard <markmi@dsl-only.net
> >     <mailto:markmi@dsl-only.net>> wrote:
> >
> >     Using 10.1-BETA1 I added "options DDB" and "options GDB" to
> >     powerpc64's GENERIC64. (I also used WITH_DEBUG_FILES=,
> >     WITHOUT_CLANG=, and WITH_DEBUG= in /etc/make.conf.) So
> > buildworld, kernel was basically just set up to have more of a
> > debugging context around (including for any ports builds).
> >
> >     The result was new information about the PowerMac G5 boot
> > hangups: The screen is no longer blank when the G5 is hung up
> > without there being a Copyright notice yet. It says...
> >
> >>     GDB: no debug ports present
> >>     KDB: debugger backends: DDB
> >>     KDB: current backend: DDB
> >>     [ thread pid -1 tid 1006665719 ]
> >>     Stopped at 0: illegal instruction 0
> >>     db>
> >
> >     (I had no ability to input at that point.) Normally the
> > Copyright notice would have displayed instead of "[...]" and what
> > follows. (I do not claim to have all the spacing, capitalization,
> > and such correct above.)
> >
> >     That text is constant from hang to hang when it hangs just
> > before it would normally output the Copyright notice: The numbers
> > do not vary, much less the other text. It has never failed until
> > after the two KDB messages are present. So far I've only tested one
> >     PowerMac G5, booting over and over for a few hours.
> >
> >
> >
> >     (I do not claim to be set up for remote kernel debugging. I just
> >     decided to let GDB go along for the ride when I added DDB.)
> >
> >     ===
> >     Mark Millard
> >     markmi at dsl-only.net <http://dsl-only.net/>;
> >
> >
> >
> >
> >
> >
> 




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140924094010.68225ddf>