Date: Mon, 10 Jun 2013 08:31:37 -0500 From: Nathan Whitehorn <nwhitehorn@freebsd.org> To: Justin Hibbits <jhibbits@freebsd.org> Cc: FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: Strange panic on ppc64 Message-ID: <51B5D539.8050102@freebsd.org> In-Reply-To: <51B5D28C.505@freebsd.org> References: <CAHSQbTAZTc9puGaH0rbhyY11s0%2BL0xGjSabK1kj65UMm1t7j3w@mail.gmail.com> <51AF6661.3060007@freebsd.org> <CAHSQbTBjza0u7nZf4z%2BxpTCcWj-TW-ZigV2-CZexuBOYQX5=3A@mail.gmail.com> <CAHSQbTCvFXDZPsOnmogc0FkZeMXwOP6h40F2kFUu2s6UmffyPw@mail.gmail.com> <51B345BE.5030905@freebsd.org> <CAHSQbTDnwne3KJWN7xjcUw4PhF-uiD4B-4y1Lf90Bfou-2Ppvw@mail.gmail.com> <51B4A389.4020607@freebsd.org> <CAHSQbTACtejaRKiG4qScSV_EdTC8y_k5Qghx_FYebWzstBP61g@mail.gmail.com> <51B5D28C.505@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 06/10/13 08:20, Nathan Whitehorn wrote: > On 06/09/13 16:21, Justin Hibbits wrote: >> On Sun, Jun 9, 2013 at 8:47 AM, Nathan Whitehorn >> <nwhitehorn@freebsd.org <mailto:nwhitehorn@freebsd.org>> wrote: >> >> On 06/08/13 17:33, Justin Hibbits wrote: >>> >>> >>> On Sat, Jun 8, 2013 at 7:54 AM, Nathan Whitehorn >>> <nwhitehorn@freebsd.org <mailto:nwhitehorn@freebsd.org>> wrote: >>> >>> On 06/08/13 09:21, Justin Hibbits wrote: >>>> >>>> >>>> On Wed, Jun 5, 2013 at 9:47 AM, Justin Hibbits >>>> <jhibbits@freebsd.org <mailto:jhibbits@freebsd.org>> wrote: >>>> >>>> Will do, when I get it panicking again. >>>> >>>> - Justin >>>> >>>> On Jun 5, 2013 9:46 AM, "Nathan Whitehorn" >>>> <nwhitehorn@freebsd.org <mailto:nwhitehorn@freebsd.org>> >>>> wrote: >>>> >>>> On 06/04/13 22:35, Justin Hibbits wrote: >>>> >>>> After a string of seemingly random hangs, I >>>> added invariants (but not >>>> witness) to my custom kernel config, and I get >>>> the following panic, >>>> recreated from a fuzzy cell phone picture: >>>> >>>> >>>> [thread pid -1 tid 1006665719 ] >>>> Stopped at 0: illegal instruction 0 >>>> db> panic: mutex ohci1 owned at >>>> /usr/home/chmeee/freebsd/head/sys/dev/usb/usb_transfer.c:2280 >>>> cpuid = 0 >>>> Uptime: 9h8m1s >>>> <my dump code> >>>> ... >>>> panic: msleep1 >>>> cpu = 0 >>>> KDB: enter: panic >>>> [ thread pid -1 tid 100665719 ] >>>> .... >>>> >>>> The first question I have is how the hell it got >>>> such a strange PID/TID, >>>> memory corruption my guess, something is >>>> stomping on the pcpu or something, >>>> and I think these hangs have only happened since >>>> I added a lot more memory >>>> (up to 12G from 4G, Andreas Tobler was seeing >>>> hangs as well), so it might >>>> be something in the moea64 pmap code, but that's >>>> pure speculation on my >>>> part. Then the other panic messages, owned >>>> mutex and panic in msleep1. I >>>> enabled more trace code, so hopefully the next >>>> time it panics I can collect >>>> better data. >>>> >>>> - Justin >>>> _______________________________________________ >>>> freebsd-ppc@freebsd.org >>>> <mailto:freebsd-ppc@freebsd.org> mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-ppc >>>> To unsubscribe, send any mail to >>>> "freebsd-ppc-unsubscribe@freebsd.org >>>> <mailto:freebsd-ppc-unsubscribe@freebsd.org>" >>>> >>>> >>>> Could you post the output from show reg? It looks >>>> like it tried to jump to a null pointer there. >>>> -Nathan >>>> >>>> >>>> Well, it's hard to do get that output, because I just hit >>>> that 'mutex owned' panic, and here's the backtrace: >>> >>> The mutex thing is spurious -- it was already panicing and >>> then paniced again trying to panic. Can you get the backtrace >>> for the original panic (it should be different) and the >>> values of the registers? >>> -Nathan >>> >>> >>> Here you go: >>> >>> [ thread pid -1 tid 1006665719 ] >>> Stopped at 0: illegal instruction 0 >>> db:0:kdb.enter.default> show reg >>> r0 0 >>> r1 0 >>> r2 0xab63d0 M_MACTEMP >>> r3 0xbb12e0 >>> r4 0x741f18 .ofwcall+0xa8 >>> r5 0 >>> r6 0xa4f1a8 >>> r7 0x1 >>> r8 0x1 >>> r9 0xc10500 __pcpu >>> r10 0x1c35ec0 >>> r11 0 >>> r12 0x2000d032 >>> r13 0x342eb000 >>> r14 0x10014200 >>> r15 0xffffffffffffcb58 >>> r16 0x2 >>> r17 0x2 >>> r18 0xffffffffffffcb50 >>> r19 0 >>> r20 0xc000000013231478 >>> r21 0xc00000014c0ce200 >>> r22 0 >>> r23 0x64 dbsize+0x10 >>> r24 0xc00000014c0cdf70 >>> r25 0xb62cb8 smp_no_rendevous_barrier >>> r26 0 >>> r27 0x741f18 .ofwcall+0xa8 >>> r28 0x741f18 .ofwcall+0xa8 >>> r29 0x2000d032 >>> r30 0x9000000000001032 >>> r31 0xc0cad8 mac_labeled >>> srr0 0x102ca4 k_trap+0x28 >>> srr1 0x9000000000001032 >>> lr 0x102c74 u_trap+0x10 >>> ctr 0xff846d78 >>> cr 0x2000f1b0 >>> xer 0 >>> dar 0xfffffffffffffd60 >>> dsisr 0x42000000 >>> 0: illegal instruction 0 >>> db:0:kdb.enter.default> bt >>> Tracing pid -1 tid 1006665719 td 0 >>> (nothing) >> Well, that is all kinds of messed up. It appears to have halted >> while handling a userland trap due to an implicit branch caused by >> bad translations when it restores the kernel SRs. Could you see >> what 'show pcpu' does? Does that information look valid at all? I >> suspect it has become corrupted somehow. >> -Nathan >> >> >> Here's the full log from dconschat, from bootup to panic. >> Unfortunately, not everything I wanted to print would print, and I >> can't type anything once it panics, because it panics when reading the >> keyboard, so I have to add everything as a ddb enter script. Here's >> what I've added so far (doesn't do everything as you can see from the >> transcript): >> >> script kdb.enter.default=show reg; bt; show pcpu; ps; run >> lockinfo; alltrace; show all procs; show files; show malloc; show >> allchains >> >> - Justin > This is now getting interesting. Reading the tea leaves, what has > happened is that the kernel has called into Open Firmware. Open Firmware > has then crashed early on, before setting up its own trap handlers, > which has then flung you back into FreeBSD's handlers with a totally > bogus environment, causing a second panic, which then causes a *third* > panic when trying to acquire a lock. It would be interesting to know > what the OF environment looked like and what commands it was trying to > execute (in r3), but that may be tricky to get... > -Nathan > _______________________________________________ One other point: you can trace this pretty easily by just putting something like: if (pmap_bootstrapped) printf("Open Firmware call %p\n", args); in the top of openfirmware(). If I understood the debugger output correctly, something should be making a firmware call immediately before the crash. As a random guess about what is happening, it is possible OF is trying to allocate memory for itself. We just ignore the possibility that it might want to do that at present, but that is not necessarily a good assumption. -Nathan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51B5D539.8050102>