Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 29 May 2017 21:11:35 -0700
From:      Mark Millard <markmi@dsl-only.net>
To:        Justin Hibbits <jhibbits@FreeBSD.org>, Nathan Whitehorn <nwhitehorn@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: A good backtrace from a head -r317820 powerpc random/periodic panic: execution of garbage at 0x0090a030 (in .hash section) [more analysis; corrections]
Message-ID:  <BE41E1EF-3A45-4B53-B762-1878272AFB15@dsl-only.net>
In-Reply-To: <83A22794-AD54-4D27-951A-986E6FA693DB@dsl-only.net>
References:  <1CE8346B-04F3-48AB-A3E9-6DF3B86B8D1A@dsl-only.net> <8C88BB6F-E747-42A1-9DDC-35EC6D865141@dsl-only.net> <83A22794-AD54-4D27-951A-986E6FA693DB@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-May-27, at 4:40 AM, Mark Millard <markmi@dsl-only.net> wrote:

> [I compare a new panic's register values
> to prior panic's values.]
>=20
> On 2017-May-26, at 10:29 PM, Mark Millard <markmi@dsl-only.net> wrote:
>=20
>> [Additional information that does not need to
>> interlace with the prior material, so see
>> after. A somewhat better backtrace reported
>> by ddb. And so on.]
>>=20
>>=20
>> On 2017-May-26, at 7:14 PM, Mark Millard <markmi at dsl-only.net> =
wrote:
>>=20
>>> I lucked out and got a vmcore.9 for a random
>>> panic that I could manage to backtrace for
>>> one of my test builds of -r317820. It appears
>>> that not all that much happened before it got
>>> the panic so much context was better preserved
>>> this time.
>>>=20
>>> (I do not explore from ddb as I've had that
>>> panic and mess up the dump just made by
>>> replacing it. So this is a manual backtrace
>>> from the debug.minidump=3D0 style vmcore.9
>>> file. objdump was used on the
>>> /boot/kernel/kernel to find code.)
>>>=20
>>> Being able to see the problem is very
>>> sensitive to kernel memory layout. This
>>> is why I'm sticking with -r317820 built
>>> production style: the kind of context
>>> the problem was first observed in.
>>> Attempting a debug kernel build simply
>>> did not repeat the problem for days
>>> (vs. the usual hours for builds like
>>> this).
>>>=20
>>>=20
>>> So below is the backtrace:
>>>=20
>>> (I do not show what trap_fatal calls:
>>> starting with trap_fatal and going toward
>>> larger memory addresses. . .)
>>>=20
>>> [vmcore.9's
>>> offset
>>> in file
>>> when no
>>> 0x prefix]
>>>=20
>>> 013e83b0  df 5e 55 50 00 8f 34 5c  fa 50 05 af fa 50 05 af  =
|.^UP..4\.P...P..|
>>> 0x008f3454 <trap+0x1228> mr      r3,r26
>>> 0x008f3458 <trap+0x122c> bl      008f2030 <trap_fatal>
>>> 0x008f345c <trap+0x1230> b       008f34c8 <trap+0x129c>
>>>=20
>>> 013e83c0  fa 50 05 af fa 50 05 af  fa 50 05 af fa 50 05 af  =
|.P...P...P...P..|
>>> *
>>> 013e83e0  df 5e 54 00 fa 50 05 af  fa 50 05 af fa 50 05 af  =
|.^T..P...P...P..|
>>> 013e83f0  fa 50 05 af fa 50 05 af  00 d1 ca ac df 5e 54 00  =
|.P...P.......^T.|
>>> 013e8400  df 5e 54 20 00 53 3a d0  fa 50 05 af fa 50 05 af  |.^T =
.S:..P...P..|
>>> 013e8410  fa 50 05 af fa 50 05 af  00 d1 ca ac df 5e 54 20  =
|.P...P.......^T |
>>> 013e8420  df 5e 54 d0 00 53 3a d0  00 00 00 00 00 00 00 00  =
|.^T..S:.........|
>>> 013e8430  00 00 00 00 00 00 00 00  00 d1 ca ac 00 00 00 0f  =
|................|
>>> 013e8440  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  =
|................|
>>> *
>>> 013e8460  00 00 00 54 7f ff ff ff  00 00 00 00 ff ff ff aa  =
|...T............|
>>> 013e8470  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  =
|................|
>>> *
>>> 013e8490  00 d4 c4 5c 00 d4 c4 5c  00 17 99 dd 00 17 9a 5c  =
|...\...\.......\|
>>> 013e84a0  00 00 17 9a 00 d4 c4 6c  df 5e 54 ec 00 00 00 54  =
|.......l.^T....T|
>>> 013e84b0  df 5e 54 d0 7f ff ff ff  05 aa 36 c0 05 aa 39 e8  =
|.^T.......6...9.|
>>> 013e84c0  00 00 00 00 00 00 00 00  00 d1 ca ac df 5e 54 d0  =
|.............^T.|
>>> 013e84d0  df 5e 55 80 00 53 3a d0  05 91 d0 00 05 91 d3 28  =
|.^U..S:........(|
>>> 013e84e0  df 5e 55 00 00 00 00 00  00 d1 ca ac 00 00 00 0f  =
|.^U.............|
>>> 013e84f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  =
|................|
>>> 013e8500  00 00 00 00 00 00 00 00  00 d4 bd ec 00 cb 98 98  =
|................|
>>> 013e8510  00 d4 c4 5c 00 d4 c4 5c  00 17 99 f8 00 17 99 f8  =
|...\...\........|
>>> 013e8520  00 00 17 99 fa 69 b8 79  00 17 9a 0a 00 00 17 99  =
|.....i.y........|
>>> 013e8530  f8 1c 2f cc df 5e 55 88  01 47 d6 c0 01 43 a2 00  =
|../..^U..G...C..|
>>> 013e8540  41 eb 30 00 0a 00 00 00  00 d2 6e 4c df 5e 55 50  =
|A.0.......nL.^UP|
>>>=20
>>> 013e8550  df 5e 55 80 00 8e 7d 40  df 5e 55 9c 00 00 00 28  =
|.^U...}@.^U....(|
>>> 0x008e7d28 <powerpc_interrupt+0x184> mfmsr   r0
>>> 0x008e7d2c <powerpc_interrupt+0x188> or      r0,r0,r9
>>> 0x008e7d30 <powerpc_interrupt+0x18c> mtmsr   r0
>>> 0x008e7d34 <powerpc_interrupt+0x190> isync
>>> 0x008e7d38 <powerpc_interrupt+0x194> mr      r3,r25
>>> 0x008e7d3c <powerpc_interrupt+0x198> bl      008f222c <trap>
>>> 0x008e7d40 <powerpc_interrupt+0x19c> lwz     r11,0(r1)
>>>=20
>>> 013e8560  00 00 00 00 00 00 17 9a  06 40 c2 a8 01 43 a2 00  =
|.........@...C..|
>>> 013e8570  41 eb 30 00 0a 00 00 00  00 00 00 00 00 08 10 32  =
|A.0............2|
>>>=20
>>>                                    r0          r1
>>> 013e8580  df 5e 56 40 00 10 08 f8  00 00 00 04 df 5e 56 40  =
|.^V@.........^V@|
>>> The struct trapframe starts is 0x 013e8588 in the
>>> vmcore.9 file. The 0x00108f8 is as shown below:
>>>=20
>>> 0x001008ec <k_trap+0x118> isync
>>> 0x001008f0 <trapagain> addi    r3,r1,8
>>> 0x001008f4 <trapagain+0x4> bl      008e7ba4 <powerpc_interrupt>
>>> 0x001008f8 <trapexit> mfmsr   r3
>>> 0x001008fc <trapexit+0x4> andi.   r3,r3,32767
>>>=20
>>> 013e8590  01 47 d6 c0 00 00 00 28  01 47 c6 c0 00 00 00 04  =
|.G.....(.G......|
>>>             r2          r3           r4         r5
>>>=20
>>> 013e85a0  00 00 00 04 00 00 00 0f  00 00 00 00 00 d4 c0 3c  =
|...............<|
>>>             r6          r7           r8         r9
>>>=20
>>> 013e85b0  01 47 d6 c0 df 5e 56 80  14 3d 60 da 00 00 00 00  =
|.G...^V..=3D`.....|
>>>            r10         r11          r12         r13
>>>=20
>>> 013e85c0  00 d4 bd ec 00 cb 98 98  00 d4 c4 5c 00 d4 c4 5c  =
|...........\...\|
>>>            r14         r15          r16         r17
>>>=20
>>> 013e85d0  00 17 99 f8 00 17 99 f8  00 00 17 99 fa 69 b8 79  =
|.............i.y|
>>>            r18         r19          r20         r21
>>>=20
>>> 013e85e0  00 17 9a 0a 00 00 17 99  f8 1c 2f cc 00 00 17 9a  =
|........../.....|
>>>            r22         r32          r24         r25
>>>=20
>>> 013e85f0  06 40 c2 a8 01 43 a2 00  00 eb a7 80 01 47 d6 c0  =
|.@...C.......G..|
>>>            r26         r27          r28         r29
>>>=20
>>> 013e8600  00 d1 ca ac df 5e 56 40  00 53 5a d0 20 00 90 44  =
|.....^V@.SZ. ..D|
>>>            r30         r31           lr          cr
>>>=20
>>>            xer         ctr          srr0     srr1
>>> 013e8610  00 00 00 00 00 00 00 00  00 90 a0 30 00 08 10 32  =
|...........0...2|
>>> Note: objdump shows no code at 0x0090a030. 0x0090a030 is
>>> inside what objdump -x reports as the section .hash (Idx
>>> 2).
>>>=20
>>>            exc         dar         dsisr    (booke dbcer0)
>>> 013e8620  00 00 07 00 41 eb 30 00  0a 00 00 00 01 47 c6 c0  =
|....A.0......G..|
>>> The 0x00007000 above is the framep->exc (exception code)
>>> (program in this case).
>>>=20
>>> 013e8630  00 eb a7 80 01 47 60 10  00 d1 ca ac df 5e 56 40  =
|.....G`......^V@|
>>>=20
>>> At this point the above does not match the below
>>> part of the stack trace.
>>>=20
>>> The lr part of struct trapframe was: 0x00535ad0 so
>>> showing around that:
>>>=20
>>> 00535ab8 <sched_affinity> stwu    r1,-32(r1)
>>> 00535abc <sched_affinity+0x4> mflr    r0
>>> 00535ac0 <sched_affinity+0x8> stw     r29,20(r1)
>>> 00535ac4 <sched_affinity+0xc> stw     r30,24(r1)
>>> 00535ac8 <sched_affinity+0x10> stw     r31,28(r1)
>>> 00535acc <sched_affinity+0x14> stw     r0,36(r1)
>>> 00535ad0 <sched_affinity+0x18> mr      r31,r1
>>> 00535ad4 <sched_affinity+0x1c> mr      r29,r3
>>>=20
>>> Back to the stack backtrace. . .
>>>=20
>>> 013e8640  df 5e 56 80 00 53 59 dc  00 17 99 f8 00 17 99 f8  =
|.^V..SY.........|
>>> 0x005359c8 <sched_add+0x18c> bl      008ea420 <spinlock_exit>
>>> 0x005359cc <sched_add+0x190> mr      r3,r28
>>> 0x005359d0 <sched_add+0x194> mr      r4,r27
>>> 0x005359d4 <sched_add+0x198> mr      r5,r25
>>> 0x005359d8 <sched_add+0x19c> bl      005356ec <tdq_add>
>>> 0x005359dc <sched_add+0x1a0> mfsprg  r9,0
>>>=20
>>> I show around 0x005356ec
>>> from the sched_add+0x19c bl that is above
>>> because the routine is not referenced
>>> in the stack tracce but the above indicates
>>> that it should have been called:
>>> 0x005356ec <tdq_add> stwu    r1,-32(r1)
>>> 0x005356f0 <tdq_add+0x4> mflr    r0
>>> 0x005356f4 <tdq_add+0x8> stw     r28,16(r1)
>>> 0x005356f8 <tdq_add+0xc> stw     r29,20(r1)
>>> 0x005356fc <tdq_add+0x10> stw     r30,24(r1)
>>> 0x00535700 <tdq_add+0x14> stw     r31,28(r1)
>>> 0x00535704 <tdq_add+0x18> stw     r0,36(r1)
>>> . . .
>>>=20
>>> Back to the stack backtrace again. . .
>>>=20
>>> 013e8650  00 00 17 99 fa 69 b8 79  00 17 9a 0a 00 00 00 04  =
|.....i.y........|
>>> 013e8660  f8 1c 2f cc 00 00 17 9a  06 40 c2 a8 01 43 a2 00  =
|../......@...C..|
>>> 013e8670  01 47 c6 c0 01 47 60 10  00 d1 b4 30 df 5e 56 80  =
|.G...G`....0.^V.|
>>>=20
>>> 013e8680  df 5e 56 b0 00 4a 87 8c  00 d2 5b 10 00 00 00 04  =
|.^V..J....[.....|
>>> 0x004a8780 <intr_event_schedule_thread+0xc4> mr      r3,r28
>>> 0x004a8784 <intr_event_schedule_thread+0xc8> li      r4,4
>>> 0x004a8788 <intr_event_schedule_thread+0xcc> bl      0053583c =
<sched_add>
>>> 0x004a878c <intr_event_schedule_thread+0xd0> lwz     r9,0(r28)
>>>=20
>>> 013e8690  df 5e 56 b0 00 00 17 9a  06 40 c2 a8 01 43 a2 00  =
|.^V......@...C..|
>>> 013e86a0  00 00 00 00 01 46 81 c0  00 d1 b4 30 df 5e 56 b0  =
|.....F.....0.^V.|
>>>=20
>>> 013e86b0  df 5e 56 f0 00 4a 97 0c  00 00 00 00 00 00 00 04  =
|.^V..J..........|
>>> 0x004a9700 <swi_sched+0xa4> bl      005000ec <critical_exit>
>>> 0x004a9704 <swi_sched+0xa8> mr      r3,r27
>>> 0x004a9708 <swi_sched+0xac> bl      004a86bc =
<intr_event_schedule_thread>
>>> 0x004a970c <swi_sched+0xb0> lwz     r11,0(r1)
>>>=20
>>> 013e86c0  df 5e 56 e0 01 47 d6 c0  01 47 d6 c0 01 45 4d 40  =
|.^V..G...G...EM@|
>>> 013e86d0  df 5e 56 f0 00 8e a4 44  06 40 c2 a8 00 00 17 9a  =
|.^V....D.@......|
>>> 013e86e0  78 00 00 00 00 e9 56 00  00 d1 c8 20 df 5e 56 f0  =
|x.....V.... .^V.|
>>>=20
>>> 013e86f0  df 5e 57 50 00 51 79 6c  df 5e 58 78 01 47 d6 c0  =
|.^WP.Qyl.^Xx.G..|
>>> 0x00517960 <callout_process+0x420> lwz     r3,264(r29)
>>> 0x00517964 <callout_process+0x424> li      r4,0
>>> 0x00517968 <callout_process+0x428> bl      004a965c <swi_sched>
>>> 0x0051796c <callout_process+0x42c> lwz     r11,0(r1)
>>>=20
>>> 013e8700  01 47 d7 b8 00 00 00 00  00 d1 ab 24 00 00 00 04  =
|.G.........$....|
>>> 013e8710  00 c9 66 bc 00 c4 5d 48  00 c9 66 bc 00 d4 c5 3c  =
|..f...]H..f....<|
>>> 013e8720  00 d0 53 00 00 eb a7 80  00 00 00 01 00 00 00 00  =
|..S.............|
>>> 013e8730  df 5e 59 8c 00 00 00 00  df 5e 58 78 00 00 17 99  =
|.^Y......^Xx....|
>>> 013e8740  f8 1c 2f cc d0 01 dd 00  00 d2 5b 10 df 5e 57 50  =
|../.......[..^WP|
>>>=20
>>> 013e8750  df 5e 57 a0 00 8a b2 70  df 5e 57 60 df 5e 57 60  =
|.^W....p.^W`.^W`|
>>> 0x008ab264 <handleevents+0x2a4> mr      r3,r27
>>> 0x008ab268 <handleevents+0x2a8> mr      r4,r28
>>> 0x008ab26c <handleevents+0x2ac> bl      00517540 <callout_process>
>>> 0x008ab270 <handleevents+0x2b0> li      r3,0
>>>=20
>>> 013e8760  df 5e 57 a0 df 5e 58 78  05 86 37 00 00 00 00 04  =
|.^W..^Xx..7.....|
>>> 013e8770  00 00 00 00 05 9b f2 00  00 c9 66 bc 01 47 d6 c0  =
|..........f..G..|
>>> 013e8780  df 5e 59 8c 00 f6 1d 10  00 00 17 99 f8 1c 2f cc  =
|.^Y.........../.|
>>> 013e8790  d0 01 dd 00 d0 01 dd 30  00 d2 5b 10 df 5e 57 a0  =
|.......0..[..^W.|
>>>=20
>>> 013e87a0  df 5e 58 20 00 8a d1 10  00 d2 6e 5c df 5e 57 b0  |.^X =
......n\.^W.|
>>> 0x008ad100 <timercb+0x4b8> mr      r3,r26
>>> 0x008ad104 <timercb+0x4bc> mr      r4,r27
>>> 0x008ad108 <timercb+0x4c0> li      r5,0
>>> 0x008ad10c <timercb+0x4c4> bl      008aafc0 <handleevents>
>>> 0x008ad110 <timercb+0x4c8> lwz     r11,0(r1)
>>>=20
>>> 013e87b0  df 5e 57 e0 00 4a 96 00  00 00 17 99 00 00 00 00  =
|.^W..J..........|
>>> 013e87c0  f8 1c 2f cc 0a 3f a0 12  df 5e 58 78 05 86 37 00  =
|../..?...^Xx..7.|
>>> 013e87d0  01 48 b1 00 05 86 37 80  00 d4 bd ec 00 cb 98 98  =
|.H....7.........|
>>> 013e87e0  00 c9 66 bc 00 c4 5d 48  00 c9 66 bc 00 d4 c5 3c  =
|..f...]H..f....<|
>>> 013e87f0  df 5e 59 e0 00 eb a7 80  00 c9 66 bc 01 47 d6 c0  =
|.^Y.......f..G..|
>>> 013e8800  df 5e 59 8c df 5e 58 78  01 47 d6 c0 00 00 00 00  =
|.^Y..^Xx.G......|
>>> 013e8810  00 f6 1d 10 00 00 00 01  00 d2 6b c8 df 5e 58 20  =
|..........k..^X |
>>>=20
>>> 013e8820  df 5e 58 40 00 8e 1e 48  00 00 00 00 00 eb a7 80  =
|.^X@...H........|
>>> So 0x13e8820 from vmcore.9 has start of struct trapeframe.
>>> The 0x8e1e48 is from:
>>> 0x008e1e34 <decr_intr+0xe0> lwz     r0,56(r28)
>>> 0x008e1e38 <decr_intr+0xe4> mtctr   r0
>>> 0x008e1e3c <decr_intr+0xe8> mr      r3,r28
>>> 0x008e1e40 <decr_intr+0xec> lwz     r4,64(r28)
>>> 0x008e1e44 <decr_intr+0xf0> bctrl
>>> 0x008e1e48 <decr_intr+0xf4> addi    r29,r29,-1
>>>=20
>>> 013e8830  01 47 d7 94 00 00 00 01  00 d2 6e 4c df 5e 58 40  =
|.G........nL.^X@|
>>> 013e8840  df 5e 58 70 00 8e 7c 9c  00 d1 ca ac df 5e 58 50  =
|.^Xp..|......^XP|
>>> 013e8850  00 cd f0 74 00 00 00 01  00 00 00 01 00 eb a7 80  =
|...t............|
>>> 013e8860  41 eb 30 00 0a 00 00 00  00 00 00 00 00 00 90 32  =
|A.0............2|
>>> 013e8870  df 5e 59 30 00 10 08 f8  00 04 90 32 df 5e 59 30  =
|.^Y0.......2.^Y0|
>>> 013e8880  01 47 d6 c0 00 00 00 00  0d 0b bf e3 00 00 00 00  =
|.G..............|
>>> 013e8890  0d 0b bf e3 00 19 eb 7c  00 00 00 00 00 00 00 44  =
|.......|.......D|
>>> 013e88a0  01 fc a0 55 00 00 90 32  d0 01 dd 00 00 00 00 00  =
|...U...2........|
>>> 013e88b0  00 d4 bd ec 00 cb 98 98  00 c9 66 bc 00 c4 5d 48  =
|..........f...]H|
>>> 013e88c0  00 c9 66 bc 00 d4 c5 3c  df 5e 59 e0 00 eb a7 80  =
|..f....<.^Y.....|
>>> 013e88d0  00 c9 66 bc 01 47 d6 c0  df 5e 59 8c 00 00 00 01  =
|..f..G...^Y.....|
>>> 013e88e0  00 00 00 01 00 eb a7 80  00 00 00 00 00 8e 3b f8  =
|..............;.|
>>>=20
>>>                                     srr0
>>> 013e88f0  00 d2 6b f0 df 5e 59 30  00 8e 3c 14 40 00 00 42  =
|..k..^Y0..<.@..B|
>>> The 0x00833c14 from the trap frame is the srr0 (conceptual lr):
>>> 0x008e3c04 <cpu_idle_60x+0xc> stw     r31,28(r1)
>>> 0x008e3c08 <cpu_idle_60x+0x10> stw     r0,36(r1)
>>> 0x008e3c0c <cpu_idle_60x+0x14> mr      r31,r1
>>> 0x008e3c10 <cpu_idle_60x+0x18> bcl-    20,4*cr7+so,008e3c14 =
<cpu_idle_60x+0x1c>
>>> 0x008e3c14 <cpu_idle_60x+0x1c> mflr    r30
>>> 0x008e3c18 <cpu_idle_60x+0x20> lwz     r0,-32(r30)
>>>=20
>>> 013e8900  20 00 00 00 00 8e 3b f8  00 8e 3c 80 00 00 90 32  | =
.....;...<....2|
>>>=20
>>>            exc
>>> 013e8910  00 00 09 00 41 eb 30 00  0a 00 00 00 00 00 00 00  =
|....A.0.........|
>>> The 0x00009000 above is the framep->exc (exception code)
>>> (decrementer in this case).
>>>=20
>>> 013e8920  eb 10 25 b5 55 7e c4 22  00 00 00 00 00 00 00 04  =
|..%.U~."........|
>>>=20
>>> 013e8930  df 5e 59 50 00 00 00 01  00 00 00 01 00 eb a7 80  =
|.^YP............|
>>> (trap [above] means no lr filled in here)
>>>=20
>>> 013e8940  00 00 00 00 00 d4 ca 34  00 d2 6b f0 df 5e 59 50  =
|.......4..k..^YP|
>>>=20
>>> 013e8950  df 5e 59 70 00 8e 31 9c  00 00 00 02 00 eb a7 80  =
|.^Yp..1.........|
>>> 0x008e318c <cpu_idle+0x48> bl      008ad8b8 <cpu_idleclock>
>>> 0x008e3190 <cpu_idle+0x4c> lwz     r29,0(r29)
>>> 0x008e3194 <cpu_idle+0x50> mtctr   r29
>>> 0x008e3198 <cpu_idle+0x54> bctrl
>>> 0x008e319c <cpu_idle+0x58> bl      008ad7b8 <cpu_activeclock>
>>>=20
>>> 013e8960  00 f2 d5 fc 00 00 00 01  00 d1 ca ac df 5e 59 70  =
|.............^Yp|
>>>=20
>>> 013e8970  df 5e 5a 50 00 53 6e 7c  fa 50 05 af fa 50 05 af  =
|.^ZP.Sn|.P...P..|
>>> 0x00536e78 <sched_idletd+0x4d0> bl      008e3144 <cpu_idle>
>>> 0x00536e7c <sched_idletd+0x4d4> stw     r28,136(r27)
>>>=20
>>>=20
>>> 013e8980  fa 50 05 af fa 50 05 af  fa 50 05 af ff ff ff fe  =
|.P...P...P......|
>>> 013e8990  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  =
|................|
>>> 013e89a0  ff ff ff ff ff ff ff ff  ff ff ff ff fa 50 05 af  =
|.............P..|
>>> 013e89b0  fa 50 05 af 00 00 00 02  ff ff ff ff 00 00 01 f0  =
|.P..............|
>>> 013e89c0  ff ff ff fe ff ff ff ff  ff ff ff ff ff ff ff ff  =
|................|
>>> 013e89d0  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  =
|................|
>>> 013e89e0  ff ff ff fe ff ff ff ff  ff ff ff ff ff ff ff ff  =
|................|
>>> 013e89f0  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  =
|................|
>>> 013e8a00  df 5e 5a 20 fa 50 05 af  00 00 00 00 00 00 00 00  |.^Z =
.P..........|
>>> 013e8a10  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  =
|................|
>>> *
>>> 013e8a30  00 00 00 00 00 53 69 a8  df 5e 5a 98 00 00 00 00  =
|.....Si..^Z.....|
>>> 013e8a40  01 47 96 e0 01 47 d6 c0  00 d1 b3 70 df 5e 5a 50  =
|.G...G.....p.^ZP|
>>>=20
>>> 013e8a50  df 5e 5a 80 00 4a 3c b4  df 5e 5a 60 df 5e 5a 60  =
|.^Z..J<..^Z`.^Z`|
>>> 0x004a3ca0 <fork_exit+0xe4> bl      008ea420 <spinlock_exit>
>>> 0x004a3ca4 <fork_exit+0xe8> mr      r3,r27
>>> 0x004a3ca8 <fork_exit+0xec> mr      r4,r26
>>> 0x004a3cac <fork_exit+0xf0> mtctr   r25
>>> 0x004a3cb0 <fork_exit+0xf4> bctrl
>>> 0x004a3cb4 <fork_exit+0xf8> lwz     r0,108(r28)
>>>=20
>>> 013e8a60  df 5e 5a 80 00 00 00 00  00 00 00 00 00 00 00 00  =
|.^Z.............|
>>> 013e8a70  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  =
|................|
>>>=20
>>> 013e8a80  00 00 00 00 00 8f 18 d0  00 53 69 a8 00 00 00 00  =
|.........Si.....|
>>> 0x008f18c0 <fork_trampoline> lwz     r3,8(r1)
>>> 0x008f18c4 <fork_trampoline+0x4> lwz     r4,12(r1)
>>> 0x008f18c8 <fork_trampoline+0x8> lwz     r5,16(r1)
>>> 0x008f18cc <fork_trampoline+0xc> bl      004a3bbc <fork_exit>
>>> 0x008f18d0 <fork_trampoline+0x10> addi    r1,r1,16
>>> 0x008f18d4 <fork_trampoline+0x14> b       001008f8 <trapexit>
>>>=20
>>>=20
>>> So that is an example context for the failure.
>>>=20
>>> It has taken weeks to get this. It may be some
>>> time before I get another for comparison/contrast.
>>>=20
>>> And sticking with the same build vs. trying some more
>>> to find a better way to get more evidence is not
>>> obvious to me at this point.
>>=20
>> I should have mentioned that this is
>> TARGET_ARCH=3Dpowerpc but used on a so-called PowerMac
>> G5 "Quad Core".
>>=20
>> I've had 2 more examples, with the same 0x0090a030
>> srr0 and such (vmcore.0 and vmcore.1) (I have added
>> one more little block of code for detecting an earlier
>> problem symptom not being currently seen so the build
>> is slight different from the earlier report.)
>>=20
>> Looking around in vmcore.0 I find 3 examples of
>> "00 90 a0 30" in areas overlapping with objdump -x's
>> coverage. . .
>>=20
>> 00c77fd0  00 00 00 00 00 00 00 00  00 90 a0 30 00 08 10 32  =
|...........0...2|
>>                                  [         ]
>>=20
>> (from sorted objdump -x output:)
>> 00c775a8 l     O .data  00000dc0 seqprog
>> 00c78368 l     O .data  0000000c seeprom_long_ewen

I forgot to subtract the 0x1000 vmcore header
size from 00c77fd0 in what I looked up with objdump -x,
so correcting:

00c75010 g       .data  00000000 tmpstk
00c77010 g     O .data  00000004 dtrace_invop_calltrap_addr

tmpstk is involved via dbtrap:=20

dbtrap:
        /* Write the trap vector to SPRG3 by computing LR & 0xff00 */
        mflr    %r1
        andi.   %r1,%r1,0xff00
        mtsprg3 %r1

        lwz     %r1,TRAP_TOCBASE(0)             /* get new SP */
        lwz     %r1,tmpstk@got(%r1)
        addi    %r1,%r1,TMPSTKSZ-16
       =20
        FRAME_SETUP(PC_DBSAVE)
/* Call C trap code: */
        addi    %r3,%r1,8
        bl      CNAME(db_trap_glue)
. . .

I missed the below in vmcore.1 (vmcore.2 matches other than 0x30 vs. =
0x70):

                                               [         ]
00d4d090  00 ca 8a ac 00 10 00 03  00 10 00 03 00 90 a0 30  =
|...............0|
00d4d090  00 ca 8a ac 00 10 00 03  00 10 00 03 00 90 a0 70  =
|...............p|

00d4c090 g     O .sbss  00000004 db_cmd_table
00d4c094 g     O .sbss  00000004 db_dot
00d4c098 g     O .sbss  00000004 db_last_addr
00d4c09c g     O .sbss  00000004 db_prev
00d4c0a0 g     O .sbss  00000004 db_next

That puts the 0x0090a0?0 in db_prev .

I also missed:

                                               [         ]
00f2c070  00 00 00 00 df 5e 55 90  00 00 00 00 00 90 a0 30  =
|.....^U........0|
00f2c070  00 00 00 00 df 5e 55 90  00 00 00 00 00 90 a0 70  =
|.....^U........p|

00f2b020 l     O .bss   000004a0 kdb_pcb
00f2b4c0 l     O .bss   00000004 kobj_mutex_inited

That puts the 0x0090a0?0 in kdb_pcb.pcb_lr .
kdb_pcb is tied kdb_trap(.) use.


>>                                  [         ]
>> 00f65820  41 eb 30 00 0a 00 00 00  00 90 a0 30 00 08 10 32  =
|A.0........0...2|
>> . . .
>> 00f65870  00 90 a0 30 00 08 10 32  00 00 00 00 df 5d 30 00  =
|...0...2.....]0.|
>>         [         ]
>>=20
>> (from sorted objdump -x output:)
>> 00f64780 g     O .bss   00020000 __pcpu
>> 00f84780 l     O .bss   00000004 ap_letgo

Again I forgot to subtract off 0x1000 from
the vmcore offset but it happened to work
out anyway here for how little information
I reported at the time.

It turns out that the two references are:

__pcpu[0].pc_tempsave[7]
__pcpu[0].pc_dbsave[7]

and:

#define CPUSAVE_SRR0    7               /* where SRR0 gets saved */


Overall that means that all the copies of
the 0x0090a0?0 value are a consequence of
having reached the illegal instruction at
the bad address.

Looking at prior modulo 4 addresses
before the 0x009070?0 shows they are
all illegal instructions for such
memory contents.

Which suggests that the jump involved
was directly to the 0x0090a0?0 address,
but it is not clear where it was when
it jumped or how it came up with a
0x0090a0?0 address.

> Making some comparisons across
> two examples of the panic, prior
> vs. newest. . .
>=20
> The srr0 was different by 0x40:
>=20
>> For vmcore.1 I went ahead and tried exploring
>> some with ddb --and had no new panics. . .
>> (Hand transcriptions of pictures)
>>=20
>> fatal kernel trap:
>> exception =3D 0x700 (program)
>> srr0      =3D 0x90a030
>        vs.=3D 0x90a070
>> srr1      =3D 0x81032
>> lr        =3D 0x535ad0
>> curthread =3D 0x147d6c0
>> pic =3D 11, comm =3D idle: cpu0
>>=20
>> [ thread pid 11 tid 100003 ]
>> Stopped at _etext+0xb8fc: illegal instruction 0
>> db> bt
>> 0xdf5e55d0: at sched_wakeup+0xa4
>> 0xdf5e55f0: at setrunnable+0x9c
>> 0xdf5e5610: at sleepq_resume_thread+0x17c
>> 0xdf5e5640: at sleepq_timeout+0xc8
>> 0xdf5e5680: at softclock_call_cc+0x1f0
>> 0xdf5e56f0: at callout_process+0x27c
>> 0xdf5e57a0: at timercb+0x4c4
>> 0xdf5e5820: at decr_intr+0xf0
>> 0xdf5e5840: at powerpc_interrupt+0xf4
>> 0xdf5e5870: at kernel DECR trap
>>              by cpu_idle_60x+0x88 (so: srr0)
>>              srr1=3D0x9032
>>              r1  =3D0xdf5e5930
>>              cr  =3D0x40000042
>>              xer =3D0x20000000
>>              ctr =3D0x8e3bf8
>> saved LR(0xfffffffd) is invalid.

So sometime while sched_wakeup is
active on the stack a jump to a
bad 0x0090a0?0 address is executed.

> Below I show an alternate example
> values were they are different, where
> the patch that I attempted was also
> involved for the new figures.
>=20
>> db> show reg (but reformatted
>> r0   =3D0x4
>> r1   =3D0xdf5e5590
>> r2   =3D0x147d6c0
>> r3   =3D0x54       testppc64size+0x34
>   vs.=3D0x78       dlmisssize+0x8
>> r4   =3D0x591d000
>   vs.=3D0x5aa4360
>> r5   =3D0
>> r6   =3D0
>> r7   =3D0xf
>> r8   =3D0
>> r9   =3D0xd4c03c   cold
>> r10  =3D0x147d6c0
>> r11  =3D0xdf5e55d0
>> r12  =3D0
>> r13  =3D0
>> r14  =3D0xd4bdec   sdt_probe_func
>> r15  =3D0xcb9898   std_lockstat___spin__release
>> r16  =3D0xd4c45c   callwheelmask
>> r17  =3D0xd4c45c   callwheelmask
>> r18  =3D0x55925
>   vs.=3D0x15c267   fdt_property+0x113
>> r19  =3D0x559a4
>   vs.=3D0x15c2e6   fdt_property+0x192
>> r20  =3D0x559      dsmisssize+0x469
>   vs.=3D0x15c2     dsmisssize+0x14d2
>> r21  =3D0x591d000
>   vs.=3D0x5aa4360
>> r22  =3D0x566430   sleepq_timeout
>> r23  =3D0x114      dsmisssize+0x24
>> r24  =3D0
>> r25  =3D0
>> r26  =3D0x1
>> r27  =3D0
>> r28  =3D0xeba780   tdq_cpu
>> r29  =3D0x147d6c0
>> r30  =3D0xd1caac
>> r31  =3D0xdf5e5590
> srr0 =3D0x90a030    _etext+0xb8fc
>  vs.=3D0x90c070    _etext+0xb8fc
>> srr1 =3D0x81032
>> lr   =3D0x535ad0   sched_affinity+0x18
>> ctr  =3D0
>> cr   =3D0x20009034
>   vs.=3D0x20009044
>> xer  =3D0
>> dar  =3D0x419df5d4
>   vs.=3D0x41d276f4
>> dsisr=3D0x24000000
>   vs.=3D0xa0000000
>> _etext+0xb8fc: illegal instruction 0
>>=20
>>=20
>> Just for completeness: acttrace also showed:
>>=20
>> Tracing command less pid 1144 tid 100150 td 0x5bc8a20 (CPU 3)
>> 0xd25a59f0: at powrpc_dispatch_intr+0xc8
>> 0xd25a5a20: at openpic_dispatch+0x90
>> 0xd25a5a50: at powerpc_interrupt+0xc0
>> 0xd25a5a80: at user EXI trap
>>              by 0x181c68c (so: ssr0)
>>              r1 =3D0xffffdb30
>>              cr =3D0x44020624
>>              xer=3D0
>>              ctr=3D0x41989570
>=20
> The newer example panic had CPU3 more like CPU 1 and 2 below.
>=20
>> Tracing command idle pid 11 tid 100004 td 0x147d360 (CPU 1)
>> saved LR(0x4c) in invalid
>>=20
>> Tracing command idle pid 11 tid 100005 td 0x147d360 (CPU 2)
>> saved LR(0x4c) in invalid

Since tmpstk has the trap content shown
(including the trap frame) (the thread
stack agrees but for the vmcore offset):

00c77f40  df 5e 55 90 00 10 0c 54  00 00 00 04 df 5e 55 90  =
|.^U....T.....^U.|
00c77f50  01 47 d6 c0 00 00 00 78  05 aa 43 60 00 00 00 00  =
|.G.....x..C`....|
00c77f60  00 00 00 00 00 00 00 0f  00 00 00 00 00 d4 c0 3c  =
|...............<|
00c77f70  01 47 d6 c0 df 5e 55 d0  00 00 00 00 00 00 00 00  =
|.G...^U.........|
00c77f80  00 d4 bd ec 00 cb 98 98  00 d4 c4 5c 00 d4 c4 5c  =
|...........\...\|
00c77f90  00 15 c2 67 00 15 c2 e6  00 00 15 c2 05 aa 43 60  =
|...g..........C`|
00c77fa0  00 56 64 30 00 00 01 14  00 00 00 00 00 00 00 00  =
|.Vd0............|
00c77fb0  00 00 00 01 00 00 00 00  00 eb a7 80 01 47 d6 c0  =
|.............G..|
00c77fc0  00 d1 ca ac df 5e 55 90  00 53 5a d0 20 00 90 44  =
|.....^U..SZ. ..D|
                                   [    lr   ]
                                   [   srr0  ]
00c77fd0  00 00 00 00 00 00 00 00  00 90 a0 70 00 08 10 32  =
|...........p...2|
00c77fe0  00 00 07 00 00 00 00 00  00 00 00 00 01 c4 4f 00  =
|..............O.|
00c77ff0  00 00 00 00 00 10 01 40  00 00 00 00 00 00 00 00  =
|.......@........|

and that lr (also in show reg's output) is in:

0x535ab8 <sched_affinity>:	stwu    r1,-32(r1)
0x535abc <sched_affinity+4>:	mflr    r0
0x535ac0 <sched_affinity+8>:	stw     r29,20(r1)
0x535ac4 <sched_affinity+12>:	stw     r30,24(r1)
0x535ac8 <sched_affinity+16>:	stw     r31,28(r1)
0x535acc <sched_affinity+20>:	stw     r0,36(r1)
0x535ad0 <sched_affinity+24>:	mr      r31,r1

which is also an odd place for lr to point.

As for srr0:

0x90a060 <.hash+47324>:	.long 0xad6
0x90a064 <.hash+47328>:	.long 0x0
0x90a068 <.hash+47332>:	.long 0x85
0x90a06c <.hash+47336>:	.long 0x0
0x90a070 <.hash+47340>:	.long 0x0
0x90a074 <.hash+47344>:	.long 0x532
0x90a078 <.hash+47348>:	.long 0x0
0x90a07c <.hash+47352>:	.long 0x0
0x90a080 <.hash+47356>:	.long 0x0

I have not found any indication why the lr and srr0
values end up as they are above or what code jumped
to 0x0090a0?0.

Sort of preventing fetches from being allowed for
the bad-jump's target area I've not had any ideas
for finding what code jumped. But I've not figured
out how to do that protection of some pages either.
(And I'm not sure that FreeBSD is designed to deal
with failing to fetch on powerpc.)

=3D=3D=3D
Mark Millard
markmi at dsl-only.net





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BE41E1EF-3A45-4B53-B762-1878272AFB15>