From owner-freebsd-ppc@FreeBSD.ORG Wed Sep 24 16:40:17 2014 Return-Path: Delivered-To: freebsd-ppc@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 822E96CD; Wed, 24 Sep 2014 16:40:17 +0000 (UTC) Received: from mail-qa0-x231.google.com (mail-qa0-x231.google.com [IPv6:2607:f8b0:400d:c00::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 250BF910; Wed, 24 Sep 2014 16:40:17 +0000 (UTC) Received: by mail-qa0-f49.google.com with SMTP id n8so3365174qaq.22 for ; Wed, 24 Sep 2014 09:40:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-type:content-transfer-encoding; bh=7WIKIUNVACDBizWPVl5s1blsrmII291WZ3k+kO2um7c=; b=ym76F9Jfc/ooeFF0hqL0W2GCdNz+Boh/Azwz/PKWqNa4X915joZOhA2iXYZcMe6egP rYN8nULxwG2Szbp2ext0I3dqo2UUiRK/WNhH6WHWiwL4OCvbkZA780mXEE3RW+1oHi4S K0GcYE6u6SDvDx4hWDKrWEy/kO5K4yxN0EjII8TU8kOdOj0qBJsn2Oepa/XEY5H6cR4X 31FU8QM4uUi8ceg6dZ9K49qb/zglYGwO8zMoKPPMSl6pEBkVTdV/dIt22xncn8M0hRbN HPqiYKzp1KPqjTG0qS6iR2KAWk10vYpIHo79jPcfj5OJjn6140mgomNrBdDBJj5zGlye SJjw== X-Received: by 10.224.161.11 with SMTP id p11mr10978478qax.40.1411576816270; Wed, 24 Sep 2014 09:40:16 -0700 (PDT) Received: from zhabar.attlocal.net (107-222-186-3.lightspeed.sntcca.sbcglobal.net. [107.222.186.3]) by mx.google.com with ESMTPSA id k10sm13244989qaj.7.2014.09.24.09.40.14 for (version=SSLv3 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 24 Sep 2014 09:40:15 -0700 (PDT) Date: Wed, 24 Sep 2014 09:40:10 -0700 From: Justin Hibbits To: Nathan Whitehorn Subject: Re: lr=u_trap+0x10 and srr0=k_trap+0x28 for "stopped at 0 illegal instruction 0" before-copyright hang on PowerMac G5's Message-ID: <20140924094010.68225ddf@zhabar.attlocal.net> In-Reply-To: <5422E513.6010806@freebsd.org> References: <1118046C-0FF7-49FC-82DA-DB9A7A310991@dsl-only.net> <2ED3DB50-B985-4382-8FF2-3B44E7E65453@dsl-only.net> <6D729F43-662A-429E-9503-0148EC3250B1@dsl-only.net> <72535F89-3942-45A6-B351-7F746209ED9F@dsl-only.net> <0703EF26-6E33-4446-9273-BBFD0CB72893@dsl-only.net> <37575F94-763C-43BF-8DD9-F648F4A7C09F@dsl-only.net> <5422E513.6010806@freebsd.org> X-Mailer: Claws Mail 3.10.1 (GTK+ 2.24.22; powerpc64-portbld-freebsd11.0) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: FreeBSD PowerPC ML , Mark Millard X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Sep 2014 16:40:17 -0000 I had started testing callbacks, but didn't get any farther than panicking the system with a bad callback handler :) Needless to say, I didn't get too far. Haven't tried restoring the vectors. - Justin On Wed, 24 Sep 2014 08:36:51 -0700 Nathan Whitehorn wrote: > There shouldn't be any exceptions at that point, nested or otherwise. > What I suspect is happening is that Open Firmware has turned them on > for some bizarre reason, taken one, and ended up in the kernel's > handlers but with the Open Firmware environment. Saving and restoring > the OF interrupt vectors would be a possible solution; flattening the > device tree in loader so that the kernel doesn't call Open Firmware > at all would be another. I think Justin may have tried the first at > some point. -Nathan > > On 09/24/14 02:04, Mark Millard wrote: > > Now that I've had a kernel/boot crash with a successful DDB bt and > > show registers (a different submittal) it makes for a good > > comparison/contrast with what DDB reports for this "before > > copyright" crash. > > > > Something unique to the "before copyright" context is... > > > > No registers are reported to have values that point into the range > > between tmpstk and esym. > > > > In other words: There is no valid stack pointer reported as far as > > I can tell. r1 has the value 0 instead of being a handling a valid > > stack address. tmpstk=0xbd7000 and esym=0xbdb000 (example for one > > of my WITH_DEBUG_FILES= and options DDB and GDB builds of > > 10.1-BETA2). That at least gives a ball park on the range to expect > > for pointing into the stack even with some build variation. > > > > It leaves me wondering if the DDB report is for a nested exception > > handling. That could explain why lr points to u_trap+0x10 and srr0 > > points to k_trap+0x28 when normally srr0 would point to the the > > failing instruction (or the instruction after) and lr to where that > > routine would normally return to. > > > > The register values that are reported for my 10.1-BETA2 builds that > > crash before the copyright notice are: > > > > r0: 0 > > r1: 0 > > r2: 0xc81538 vop_unlock_desc > > r3: 0xd18868 > > r4: 0x894b58 > > r5: 0 > > r6: 0xc1dee0 M_AUDITBSM > > r7: 0xe3f818 ofw_real_mode > > r8: 0x1 > > r9: 0xe0f580 __pcpu > > r10: 0x1c35ec0 > > r11: 0 > > r12: 0x10000000 > > r13: 0xdbb290 thread0 (Note: another submittal has this mistyped as > > 0xdbb290.) > > r14-r19: all 0 > > r20: 0x10c1000 > > r21: 0x4 > > r22: 0x180abd4 > > r23: 0x1803a28 > > r24: 0xc000000000008760 > > r25: 0xcc89b8 smp_no... > > r26: 0xcea108 ofw_rend... > > r27: 0x894b58 ofwcall+0xa8 > > r28: 0x894b58 ofwcall+0xa8 > > r29: 2400022 > > r30: 9000000000001032 > > r31: 0xbb7d38 > > > > srr0: 0x102720 k_trap+0x28 > > srr1: 9000000000001032 > > lr: 0x1026f0 u_trap+0x10 > > ctr: 0xff846d78 > > cr: 2000deb0 > > xer: 0 > > dar: f...d50 (lots of f's) > > dsisr: 42000000 > > > > > > > > > > > > > > === > > Mark Millard > > markmi at dsl-only.net > > > > On Sep 20, 2014, at 3:42 PM, Mark Millard > > wrote: > > > > [I corrected the SSR0 in the subject to be SRR0.] > > > > I did miss a register in my list (it matched the shown r30 value). > > And it turns out to probably be very important to interpreting what > > the "show registers" is reporting: > > > > SRR1: 0x9000000000001032 > > > > But bits 43-46 of SRR1 are supposed to indicate which type of > > Program Exception, using a single binary 1 to so. No such 1's are > > present. > > > > Illegal instruction would have been bit 44 being 1. (PowerPC has > > the upper bit numbered zero and increases from there.) > > > > So the ddb "show registers" is apparently not reporting the status > > as of when the "stopped at 0 illegal instruction 0" happened. Thus > > other things are also likely not from that exact time frame. > > > > > > > > And I misinterpreted the LR value status: The LR value was just > > left over from the restore_kernsrs returning when it finished. > > Execution then flowed into k_trap. Nothing unusual involved. > > > > > > > > > > > > === > > Mark Millard > > markmi@dsl-only.net > > > > On Sep 18, 2014, at 8:57 PM, Mark Millard > > wrote: > > > > I modified DDB to automatically "show registers" even at the early > > "before Copyright" crash time. The end of this note will show the > > /usr/src/sys/ddb/db_script.c diff for the hack. While I also had > > DDB bt, the bt does not actually print a back trace for this > > context. (It might for others.) > > > > The registers give interesting context despite the lack of a back > > trace. I do not know if it will be sufficient to be of much > > immediate help if someone used the information to start looking at > > the problem. > > > > I'll start with register lr: 0x1026f0 u_trap+0x10. > > > > /usr/src/sys/powerpc/aim/trap_subr64.S has: > > > > s_trap: > > bf 17,k_trap /* branch if PSL_PR is > > false */ GET_CPUINFO(%r1) > > u_trap: > > ld %r1,PC_CURPCB(%r1) > > mr %r27,%r28 /* Save LR, r29 */ > > mtsprg2 %r29 > > bl restore_kernsrs /* enable kernel mapping */ > > mfsprg2 %r29 > > mr %r28,%r27 > > > > /* > > * Now the common trap catching code. > > */ > > k_trap: > > FRAME_SETUP(PC_TEMPSAVE) > > /* Call C interrupt dispatcher: */ > > trapagain: > > > > and so this appears to indicate a pending return to execute the > > "mfsprg2 %r29" after "bl restore_kernsrs", which indicates that > > restore_kernsrs should be active. > > > > But register srr0 indicates: 0x102720 k_trap+0x28. (So apparently > > in FRAME_SETUP(PC_TEMPSAVE) someplace.) > > > > So it appears to me that the processor got to the k_trap code > > during the supposed restore_kernsrs time frame. (But I'm no expert > > at these sorts of things or for the processor.) > > > > I'll list the other register values: > > > > r0: 0 > > r1: 0 > > r2: 0xc1be80 M_AUDITBSM > > r3: 0xb16138 > > r4: 0x8926e8 .ofwcall+0xa8 > > r5: 0 > > r6: 0xbb5f90 > > r7: 0xe3d118 ofw_real_mode > > r8: 0x1 > > r9: 0xe0ce80 __pcpu > > r10: 0x1c35ec9 > > r11: 0 > > r12: 0x10000000 > > r13: db890 thread0 > > r14-r19: all 0 > > r20: 0x10bc000 > > r21: 0x4 > > r22: 0x1801db4 > > r23: 0x1803a28 > > r24: 0xc000000000008760 > > r25: 0xcc6908 smp_no_rendevous_barrier > > r26: 0xec79e0 ofw_rendezvous_dispatch (yep one has v and the other > > zv) r27: 0x8926e8 .ofwcall+0xa8 > > r28: 0x8926e8 .ofwcall+0xa8 (yep: same value) > > r29: 0x24000022 > > r30: 0x9000000000001032 > > r31: 0xc7f488 vop_unlock_desc > > > > ctr: 0xff846d78 > > cr: 0x2000d7b0 > > xer: 0 > > dar: 0xfffffffffffffd50 > > dsisr: 0x42000000 > > > > (Hopefully this manual transcription from the screen display is > > complete --and also accurate for what it does present.) > > > > > > > > > > The personal HACK to /usr/src/sys/ddb/db_script.c's > > db_script_kdbenter(...) to have it show registers and try bt... > > > > $ cd /usr/src/sys/ddb/ > > $ svnlite diff . > > Index: db_script.c > > =================================================================== > > --- db_script.c(revision 271610) > > +++ db_script.c(working copy) > > @@ -319,10 +319,25 @@ > > { > > char scriptname[DB_MAXSCRIPTNAME]; > > > > +/* HACK!!! : Additional lines to force a basic default script to > > exist. +* Will dump information even if ddb input is not available > > for early crash. > > +* Used to get more information about PowerMac G5 "before > > Copyright" hangs. > > +*/ > > +struct ddb_script *dsp = > > db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT); +if (!dsp) > > db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "show registers; bt"); > > + > > snprintf(scriptname, sizeof(scriptname), "%s.%s", > > DB_SCRIPT_KDBENTER_PREFIX, eventname); > > if (db_script_exec(scriptname, 0) == ENOENT) > > (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0); > > + > > +/* HACK!!! : Additional lines to always use the default script, > > +* even if scriptname existed and was executed. > > +* Will dump information even if ddb input is not available for > > early crash. > > +* Used to get more information about PowerMac G5 "before > > Copyright" hangs. > > +*/ > > +else > > +(void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0); > > } > > > > /*- > > > > > > > > === > > Mark Millard > > markmi at dsl-only.net > > > > On Sep 16, 2014, at 9:28 PM, Mark Millard > > wrote: > > > > In part I sent directly to you because of a past exchange (July-27) > > where you had written: > > > >> Nathan and I both speculate that it's > >> dropping into Open Firmware (we make extensive use of OFW), and > >> then messing something up, taking a page fault or something. > > > > The specific text that I report and its uniformity when it is > > produced seems to add a little information beyond a speculated > > "page fault or something" and so might eventually help a little. As > > I understand the text it is reporting execution reaching address > > zero without any prior un-handled exceptions or other such that > > would stop it. A corrupted stack (pointer) so a bad return address > > or some such? I'd guess there are no explicit jumps to address zero > > so I expect that indirection is likely involved, with the content > > for the indirection messed up. > > > > I really wish that I had a logic analyzer configuration for this. > > I've not found a way to make the failing context visible so far and > > the extra way of looking at things might have helped. > > > > > > > > > > === > > Mark Millard > > markmi@dsl-only.net > > > > On Sep 16, 2014, at 8:28 PM, Justin Hibbits > > wrote: > > > > Hi mark, > > > > I see this on my G5, and I think it's due to the amount of RAM in > > the machine. More than 4gb seems to confuse open firmware when > > called by FreeBSD. There is some effort to remove the need of the > > callbacks but thus far it's not far along. The good news is that > > after it boots it's solid except when switching vtys, buy earlier > > this year or last year I added a sysctl hack to disable the call > > into open firmware on vty switch (don't recall offhand and not at > > my computer right now, but if you grep the sysctl output for reset > > and ofw you can find it). > > > > -Justin > > > > On Sep 16, 2014 8:01 PM, "Mark Millard" > > wrote: > > > > I've now spent time with rebooting and power-off/power-on for > > all 3 PowerMac G5's (one PowerMac7,2 and two PowerMac11,2's) and > > all 3 get the > > > >> GDB: no debug ports present > >> KDB: debugger backends: DDB > >> KDB: current backend: DDB > >> [ thread pid -1 tid 1006665719 ] > >> Stopped at 0: illegal instruction 0 > >> db> > > > > when they fail just before the Copyright notice would normally > > be displayed. None fail any earlier. At that spot none have failed > > any other way. It is the same SSD in all 3. (Happens with other > > SSD's as well.) Overall there is a mix of Radeon and NVIDIA > > display boards. Besides the SSD use and RAM upgrades the rest is > > stock equipment. scons used, not vt. (I've yet to try vt.) > > > > Seeing a failure after the Copyright notice as been fairly rare > > in all my experiments from when I started last April or so. The ones > > that I've noted had Data Storage Interrupt reported. So far no > > examples of the above have been reported after the Copyright > > notice. So I'd guess that they are separate issues. Of course it > > seems that only in the last few days would I have seen the above > > sort of thing if it did happen after the Copyright notice: The > > prior history does not count for judgements about that. > > > > === > > Mark Millard > > markmi at dsl-only.net > > > > On Sep 16, 2014, at 8:15 AM, Mark Millard > > wrote: > > > > Using 10.1-BETA1 I added "options DDB" and "options GDB" to > > powerpc64's GENERIC64. (I also used WITH_DEBUG_FILES=, > > WITHOUT_CLANG=, and WITH_DEBUG= in /etc/make.conf.) So > > buildworld, kernel was basically just set up to have more of a > > debugging context around (including for any ports builds). > > > > The result was new information about the PowerMac G5 boot > > hangups: The screen is no longer blank when the G5 is hung up > > without there being a Copyright notice yet. It says... > > > >> GDB: no debug ports present > >> KDB: debugger backends: DDB > >> KDB: current backend: DDB > >> [ thread pid -1 tid 1006665719 ] > >> Stopped at 0: illegal instruction 0 > >> db> > > > > (I had no ability to input at that point.) Normally the > > Copyright notice would have displayed instead of "[...]" and what > > follows. (I do not claim to have all the spacing, capitalization, > > and such correct above.) > > > > That text is constant from hang to hang when it hangs just > > before it would normally output the Copyright notice: The numbers > > do not vary, much less the other text. It has never failed until > > after the two KDB messages are present. So far I've only tested one > > PowerMac G5, booting over and over for a few hours. > > > > > > > > (I do not claim to be set up for remote kernel debugging. I just > > decided to let GDB go along for the ride when I added DDB.) > > > > === > > Mark Millard > > markmi at dsl-only.net > > > > > > > > > > > > >