Date: Sat, 20 May 2017 02:01:55 -0700 From: Mark Millard <markmi@dsl-only.net> To: Justin Hibbits <jhibbits@FreeBSD.org>, Nathan Whitehorn <nwhitehorn@freebsd.org> Cc: FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>, FreeBSD Current <freebsd-current@freebsd.org> Subject: Re: TARGET_ARCH=powerpc head -r317820 production-style kernel: periodic panics always in pid=11 (the Idle threads) Message-ID: <81F7A2D8-4C3A-426D-B957-9DC937006D88@dsl-only.net> In-Reply-To: <1F50B6D9-4E41-4367-860D-E2A0E13AE661@dsl-only.net> References: <831804AB-1BEB-40C7-BA8B-94DF07E314E5@dsl-only.net> <1F50B6D9-4E41-4367-860D-E2A0E13AE661@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-May-19, at 9:42 PM, Mark Millard <markmi@dsl-only.net> wrote: > On 2017-May-9, at 2:00 PM, Mark Millard <markmi@dsl-only.net> wrote: >=20 > . . . >> fatal kernel trap: >> exception =3D 0x903a64e (unknown) >> srr0 =3D 0x7ff760 >> srr1 =3D 0xc1007c >> lr =3D 0x907f >> curthread =3D 0x147d6c0 >> pid =3D 11, comm =3D idle: cpu0 >> [ thread pid 11 tid 100003 ] >> Stopped at ffs_truncate+0x1080: stw r11, 0xf8(r31) >>=20 >> 1 contains (cpu1 instead of cpu0, so different tid): >>=20 >> fatal kernel trap: >> exception =3D 0x903a64e (unknown) >> srr0 =3D 0x7ff760 >> srr1 =3D 0xc1007c >> lr =3D 0x907f >> curthread =3D 0x147d360 >> pid =3D 11, comm =3D idle: cpu1 >> [ thread pid 11 tid 100004 ] >> Stopped at ffs_truncate+0x1080: stw r11, 0xf8(r31) >>=20 >> 1 contains: >=20 > I've discovered where to find the trapframe > in the vmcore.* files for these specific > examples with 0x903a64e as the exception > and such. >=20 > In the vmcore the memory image starts at > byte offset 0x1000. >=20 > To see the values reported the only > place in the image file to start that > produces those values at the offsets > for in side the powerpc trapframe is: >=20 > offset 0x1001 in the vmcore.* file. >=20 > So memory address 0x1 is being used > as the trapframe address when that > odd exception information is being > displayed. Yep: misaligned. >=20 > The decoding is not of the actual > trapframe: it is garbage that is > not to be believed. >=20 >=20 > Note: I lucked out after the above and > got a somewhat different odd trap information > that lead to actually getting a backtrace > that included the actual pid 11 cpu 1 kernel > thread stack bt associated with that odd > information display. Typo: That should have been "cpu 2". > I'll send a separate reply for that information > as it will take some transcribing from camera > pictures and such. As indicated, I got a different odd trap report that gave a backtrace. . . fatal user trap exception =3D 0x4210000 (unknown) srr0 =3D 0xc1007c09 srr1 =3D 0x3a64e80 lr =3D 0xc0807fc9 curthread =3D 0x147d000 pid =3D 11, comm =3D idle: cpu 2 Now at this point it attempted to db_print_loc_and_inst and got another exception (at offset +0x60 in the routine). So the backtrace has both the consequences of that and what lead up to that: an EXI trap was attempting to report trap frame information but was using a bad address for the supposed frame. The details of the backtrace: panic: data storage interrupt trap cpuid =3D 2 time =3D 145187154 KDB: stack backtrace 0xdf5ef2c0: at kdb_backtrace+0x5c 0xdf5ef3a0: at panic+0x54 0xdf5ef3f0: at trap_fatal+0x1cc 0xdf5ef420: at powerpc_interrupt+0x180 0xdf5ef5c0: kernel DSI read trap @ 0xc1007c09 by db_disasm+0x30: srr1=3D0x1032 r1 =3D0xdf5ef6b0 cr =3D0x24009022 xer =3D0 ctr =3D0x1852cc sr =3D0x40000000 0xdf5ef6b0: at 0x1007480 0xdf5ef6d0: at db_print_loc_and_inst+0x60 0xdf5ef700: at db_trap+0x104 0xdf5ef790: at kdb_trap+0x1bc 0xdf5ef810: at trap_fatal+0x1b0 0xdf5ef840: at trap+0x1184 0xdf5ef870: kernel EXI trap by cpu_idle_60x+0x88: srr1=3D0x1032 r1 =3D0xdf5ef930 cr =3D0x40000042 xer =3D0x20000000 ctr =3D0x8e3bd8 saved LR(0x2) is invalid. So an EXI trap was attempting to report a trap frame. (Note: the LR's for pid 11 cpu threads normally report an invalid LR in ddb.) The actual EXI trapframe starts at 013f0878 in vmcore.5: 013f0870 df 5e f9 30 00 10 08 f8 00 04 90 32 df 5e f9 30 = |.^.0.......2.^.0| 013f0880 01 47 d0 00 00 00 00 00 25 94 48 3f 00 00 00 00 = |.G......%.H?....| 013f0890 25 94 48 3f 00 4a a9 c8 00 00 00 00 00 00 00 44 = |%.H?.J.........D| 013f08a0 01 fc a0 55 00 00 90 32 df 5d 1d 00 00 00 00 00 = |...U...2.]......| 013f08b0 00 d4 bd ec 00 cb 98 98 00 c9 66 bc 00 c4 5d 08 = |..........f...].| 013f08c0 00 c9 66 bc 00 d4 c5 3c df 5e f9 e0 00 eb a7 80 = |..f....<.^......| 013f08d0 00 c9 66 bc 01 47 d0 00 df 5e f9 8c 00 00 00 06 = |..f..G...^......| 013f08e0 00 00 00 06 00 eb b5 80 00 00 00 00 00 8e 3b d8 = |..............;.| 013f08f0 00 d2 6b f0 df 5e f9 30 00 8e 3b f4 40 00 00 42 = |..k..^.0..;.@..B| 013f0900 20 00 00 00 00 8e 3b d8 00 8e 3c 60 00 00 90 32 | = .....;...<`...2| 013f0910 00 00 05 00 41 a1 d5 d4 42 00 00 00 00 00 00 00 = |....A...B.......| So: r0 =3D 0x00049032 r1 =3D 0xdf5ef930 r2 =3D 0x0147d000 r3 =3D 0x00000000 r4 =3D 0x2594483f r5 =3D 0x00000000 r6 =3D 0x2594483f r7 =3D 0x004aa9c8 r8 =3D 0x00000000 r9 =3D 0x00000044 r10 =3D 0x01fca055 r11 =3D 0x00009032 r12 =3D 0xdf5d1d00 r13 =3D 0x00000000 r14 =3D 0x00d4bdec r15 =3D 0x00cb9898 r16 =3D 0x00c966bc r17 =3D 0x00c45d08 r18 =3D 0x00c966bc r19 =3D 0x00d4c53c r20 =3D 0xdf5ef9e0 r21 =3D 0x00eba780 r22 =3D 0x00c966bc r23 =3D 0x1047d000 r24 =3D 0xdf5ef98c r25 =3D 0x00000006 (this value shows up later in a bad spot) r26 =3D 0x00000006 (this value shows up next to that) r27 =3D 0x00ebb580 r28 =3D 0x00000000 r29 =3D 0x008e3bd8 r30 =3D 0x00d26bf0 r31 =3D 0xdf5ef930 lr =3D 0x008e3bf4 cr =3D 0x40000042 xer =3D 0x20000000 ctr =3D 0x008e3bd8 srr0 =3D 0x008e3c60 srr1 =3D 0x00009032 exc =3D 0x00000500 dar =3D 0x41a1d5d4 dsisr =3D 0x42000000 Other elements of the stack leading to this are: 013f0920 45 8b a7 b5 b1 fd 96 be 00 00 00 00 00 00 00 04 = |E...............| 013f0930 df 5e f9 50 00 00 00 06 00 00 00 06 00 eb b5 80 = |.^.P............| (odd lr value; value repeats above, not ; here. r25/r26 a multiple ; pair saved of 4 even. ; in those? matches r25 ; matches r26 in trapframe; in trapframe) 013f0940 00 00 00 00 00 d4 ca 34 00 d2 6b f0 df 5e f9 50 = |.......4..k..^.P| 013f0950 df 5e f9 70 00 8e 31 7c 00 00 00 02 00 eb b5 80 = |.^.p..1|........| 013f0960 00 f2 d5 fc 00 00 00 06 00 d1 ca ac df 5e f9 70 = |.............^.p| 013f0970 df 5e fa 50 00 53 6e 58 df 5e f9 80 00 00 00 00 = |.^.P.SnX.^......| . . . 013f0a50 df 5e fa 80 00 4a 3c b4 df 5e fa 60 fa 50 05 af = |.^...J<..^.`.P..| 013f0a60 df 5e fa 80 00 00 00 00 00 00 00 00 00 00 00 00 = |.^..............| 013f0a70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 = |................| 013f0a80 00 00 00 00 00 8f 18 90 00 53 69 84 00 00 00 00 = |.........Si.....| =46rom which I get from the stack backpointer/lr-value pairs(via objdump on the kernel): 00000006: ???????????????????? 008e317c: cpu_idle+0x58 00536e58: scheduletd+0x4d4 004a3cb4: fork_exit+0xf8 008f1890: fork_trampoline+0x10 But cpu_idle+0x58 is the bl ,cpu_activeclock> in the below and the prior bctrl after the bl <cpu_idleclock> is what made the call. 008e3124 <cpu_idle> stwu r1,-32(r1) 008e3128 <cpu_idle+0x4> mflr r0 008e312c <cpu_idle+0x8> stw r29,20(r1) 008e3130 <cpu_idle+0xc> stw r30,24(r1) 008e3134 <cpu_idle+0x10> stw r31,28(r1) 008e3138 <cpu_idle+0x14> stw r0,36(r1) 008e313c <cpu_idle+0x18> mr r31,r1 008e3140 <cpu_idle+0x1c> bcl- 20,4*cr7+so,008e3144 <cpu_idle+0x20> 008e3144 <cpu_idle+0x20> mflr r30 008e3148 <cpu_idle+0x24> lwz r0,-36(r30) 008e314c <cpu_idle+0x28> add r30,r0,r30 008e3150 <cpu_idle+0x2c> lwz r29,-32768(r30) 008e3154 <cpu_idle+0x30> lwz r0,0(r29) 008e3158 <cpu_idle+0x34> cmpwi cr7,r0,0 008e315c <cpu_idle+0x38> beq- cr7,008e3198 <cpu_idle+0x74> 008e3160 <cpu_idle+0x3c> cmpwi cr7,r3,0 008e3164 <cpu_idle+0x40> bne- cr7,008e3188 <cpu_idle+0x64> 008e3168 <cpu_idle+0x44> bl 005002a4 <critical_enter> 008e316c <cpu_idle+0x48> bl 008ad894 <cpu_idleclock> 008e3170 <cpu_idle+0x4c> lwz r29,0(r29) 008e3174 <cpu_idle+0x50> mtctr r29 008e3178 <cpu_idle+0x54> bctrl 008e317c <cpu_idle+0x58> bl 008ad794 <cpu_activeclock> But ctr was reported as: 0x8e3bd8 which is cpu_idle_60x. (So no surprises so far.) The code through the reported cpu_idle_60x+0x88 is: 008e3bd8 <cpu_idle_60x> stwu r1,-32(r1) 008e3bdc <cpu_idle_60x+0x4> mflr r0 008e3be0 <cpu_idle_60x+0x8> stw r30,24(r1) 008e3be4 <cpu_idle_60x+0xc> stw r31,28(r1) 008e3be8 <cpu_idle_60x+0x10> stw r0,36(r1) 008e3bec <cpu_idle_60x+0x14> mr r31,r1 008e3bf0 <cpu_idle_60x+0x18> bcl- 20,4*cr7+so,008e3bf4 = <cpu_idle_60x+0x1c> 008e3bf4 <cpu_idle_60x+0x1c> mflr r30 008e3bf8 <cpu_idle_60x+0x20> lwz r0,-32(r30) 008e3bfc <cpu_idle_60x+0x24> add r30,r0,r30 008e3c00 <cpu_idle_60x+0x28> lwz r9,-32756(r30) 008e3c04 <cpu_idle_60x+0x2c> lwz r0,0(r9) 008e3c08 <cpu_idle_60x+0x30> cmpwi cr7,r0,0 008e3c0c <cpu_idle_60x+0x34> beq- cr7,008e3c78 <cpu_idle_60x+0xa0> 008e3c10 <cpu_idle_60x+0x38> mfmsr r11 008e3c14 <cpu_idle_60x+0x3c> mfsprg r0,7 008e3c18 <cpu_idle_60x+0x40> rlwinm r9,r0,16,16,31 008e3c1c <cpu_idle_60x+0x44> cmpwi cr7,r9,68 008e3c20 <cpu_idle_60x+0x48> beq- cr7,008e3c4c <cpu_idle_60x+0x74> 008e3c24 <cpu_idle_60x+0x4c> cmplwi cr7,r9,68 008e3c28 <cpu_idle_60x+0x50> bgt- cr7,008e3c40 <cpu_idle_60x+0x68> 008e3c2c <cpu_idle_60x+0x54> cmpwi cr7,r9,57 008e3c30 <cpu_idle_60x+0x58> beq- cr7,008e3c4c <cpu_idle_60x+0x74> 008e3c34 <cpu_idle_60x+0x5c> cmpwi cr7,r9,60 008e3c38 <cpu_idle_60x+0x60> bne+ cr7,008e3c64 <cpu_idle_60x+0x8c> 008e3c3c <cpu_idle_60x+0x64> b 008e3c4c <cpu_idle_60x+0x74> 008e3c40 <cpu_idle_60x+0x68> addi r0,r9,-32768 008e3c44 <cpu_idle_60x+0x6c> cmplwi cr7,r0,4 008e3c48 <cpu_idle_60x+0x70> bgt- cr7,008e3c64 <cpu_idle_60x+0x8c> 008e3c4c <cpu_idle_60x+0x74> oris r0,r11,4 008e3c50 <cpu_idle_60x+0x78> dssall 008e3c54 <cpu_idle_60x+0x7c> sync =20 008e3c58 <cpu_idle_60x+0x80> mtmsr r0 008e3c5c <cpu_idle_60x+0x84> isync 008e3c60 <cpu_idle_60x+0x88> b 008e3c78 <cpu_idle_60x+0xa0> So that is the context for the EXI trap. The "mtsmr r0" merged in PSL_POW to the msr value when it was originally not set. (r11 vs. r0 value comparison in the trapframe.) Why it would be going from without POW for so long and then merging in POW I do not know. (Or may be I just did not find code that turns POW back off on occasion.) Still setting POW in msr seems to be what is unique to the failure context. Mkes me wonder if smu_doorbell_intr and its irq is somehow involved. There is no obvious, local tie to r25 and r26 in/for the spots identified earlier for the 013f0930 line of the thread stack area. Interestingly when I looked up POW in mtsmsr use what I found reported: Synchronization Required Prior: implementation dependent Synchronization Required After: implementation dependent The source code looks like: static void cpu_idle_60x(sbintime_t sbt) { register_t msr; uint16_t vers; if (!powerpc_pow_enabled) return; msr =3D mfmsr(); vers =3D mfpvr() >> 16; #ifdef AIM switch (vers) { case IBM970: case IBM970FX: case IBM970MP: case MPC7447A: case MPC7448: case MPC7450: case MPC7455: case MPC7457: __asm __volatile("\ dssall; sync; mtmsr %0; isync" :: "r"(msr | PSL_POW)); break; default: powerpc_sync(); mtmsr(msr | PSL_POW); isync(); break; } #endif } I've not yet figured out how to check that the details are right here for IBM970MP used via 32-bit FreeBSD. =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?81F7A2D8-4C3A-426D-B957-9DC937006D88>