Date: Thu, 18 May 2017 03:04:26 -0700 From: Mark Millard <markmi@dsl-only.net> To: FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Cc: freebsd-hackers@freebsd.org Subject: head -r317820: An example TARGET_ARCH=powerpc panic with interesting information reported (PowerMac G5 so-called "Quad Core" context) Message-ID: <3D469253-A16F-4723-B459-38BE01FFB051@dsl-only.net>
next in thread | raw e-mail | index | archive | help
Context: The context is a TARGET_ARCH=powerpc build used on an old PowerMac G5 so-called "Quad Core". Currently I'm using head -r317820 for investigating. I've been having problems with production-style kernel builds for TARGET_ARCH=powerpc getting occasional fatal kernel traps --but the debug kernel works and does not report anything odd. I see this even with pure gcc 4.2.1 based builds, not just my clang experiments. Also all the TARGET_ARCH=powerpc64 builds seem to work fine on the same machines. All the trap reports indicate pid 11 and one of the cpu idle threads, more cpu 0 than the others but not limited to one cpu/thread. Interesting evidence based on the trap frame: I caught a data storage interrupt fatal kernel trap from which the ddb information is interesting . . . For the fatal kernel trap notice for pid 11's cpu0 thread (there is a "timeout stopping cpus"): srr0 indicates 0x5e58e0 which is: sched_choose+0x100: lwz r30, -0x8(r11) in the build then in use. But lr is listed as: lr 0x5358d0, which is before srr0, lr pointing at: sched_choose+0xf0: lwz r0,4(r11) *** Does this ordering of lr vs. srr0 content make sense? Is it evidence of a problem? *** The code in the area looks like: 005358b8 <sched_choose+0xd8> li r0,-1 005358bc <sched_choose+0xdc> stb r0,152(r29) 005358c0 <sched_choose+0xe0> mfsprg r9,0 005358c4 <sched_choose+0xe4> lwz r28,4(r9) 005358c8 <sched_choose+0xe8> mr r3,r28 005358cc <sched_choose+0xec> lwz r11,0(r1) => 005358d0 <sched_choose+0xf0> lwz r0,4(r11) 005358d4 <sched_choose+0xf4> mtlr r0 005358d8 <sched_choose+0xf8> lwz r28,-16(r11) 005358dc <sched_choose+0xfc> lwz r29,-12(r11) => 005358e0 <sched_choose+0x100> lwz r30,-8(r11) 005358e4 <sched_choose+0x104> lwz r31,-4(r11) 005358e8 <sched_choose+0x108> mr r1,r11 005358ec <sched_choose+0x10c> blr The r3, r28, r11, r9, and r0 values in the trap frame do not seem to match the code. For example at 005358d0 the above would have r3=r28 but the "show reg" shows: r3 0x0 r28 0x7c2803c6 The r1 value and vmcore.3 content indicate 0x0147d6c0 for the later "lwz r28,-16(r11)". "show reg" reported both of the following for the trap frame: r1 0xdf5e58c0 r11 0xdf5e58c0 which would indicate r11 not being from the result of "lwz r11,0(r1)". vmcore.3 content indicates two things: A) The trap frame shows the 0xdf5e58c0 for r11. B) 0(r1) would have been 0xdf5e58e0 . In vmcore.3 0xdf5e58e0 has an associated lr value next to it: 0x5358ac, matching up with the return from the last bl in sched_choose: 005358a0 <sched_choose+0xc0> b 005358ac <sched_choose+0xcc> 005358a4 <sched_choose+0xc4> mr r4,r28 005358a8 <sched_choose+0xc8> bl 00500010 <runq_remove> 005358ac <sched_choose+0xcc> lbz r0,622(r28) So it looks as expected in vmcore.3 . Then either: a) The trap from is from prior to the "lwz r11,0(r1)" result. b) The 0(r1) access was not based on coherent/up-to-date cache/memory contents and so filled r11 with junk. c) The r11 value in the trap frame has been corrupted some other way. The trap frame shows: r9 0x1f8 But 0x1f8 is much smaller than what I see on a live system for sprg0: more like 0xf65f00 . And the 4(r9) would not get to a 0x7c2803c6 value from what I see in vmcore.3 . The trap frame's r0 0x4bfca6c1 does not fit with either "li r0,-1" nor the code: => 005358d0 <sched_choose+0xf0> lwz r0,4(r11) 005358d4 <sched_choose+0xf4> mtlr r0 (The trap frame reports lr as 0x5358d0.) (r0's value only exists in two places in vmcore.3, one being the trap frame.) The vmcore.3 and r1 would suggest the result of: 005358cc <sched_choose+0xec> lwz r11,0(r1) => 005358d0 <sched_choose+0xf0> lwz r0,4(r11) 005358d4 <sched_choose+0xf4> mtlr r0 005358d8 <sched_choose+0xf8> lwz r28,-16(r11) 005358dc <sched_choose+0xfc> lwz r29,-12(r11) => 005358e0 <sched_choose+0x100> lwz r30,-8(r11) 005358e4 <sched_choose+0x104> lwz r31,-4(r11) 005358e8 <sched_choose+0x108> mr r1,r11 should be the sequence of register assignments: r11 = 0xdf5e58e0 r0 = 0x00500460 lr = 0x00500460 r28 = 0x0147d6c0 r29 = 0x00d4d7c8 r30 = 0x00d1d6c4 r31 = 0xdf5e58e0 r1 = 0xdf5e58e0 None of which show up in the trap frame. But even the trap frame's 005358d0 would suggest the first of the above. As for the virtual address reported for the failure: virtual address = 0x7c2803ba There is only one register listed with a value near that: r28 0x7c2803a6 (This value is in vmcore.3 something like 677 times.) and: 0x7c2803ba - 0x7c2803a6 == 0x14 (20 decimal). But it does not fit for the code getting to 0x7c2803ba from r28. 0x7c2803ba only exists in 3 or so places in vmcore.3 (including the trap frame instance). I'm not sure how 0x7c2803ba ended up in the dar (or its spot in the trap frame) in vmcore.3 . Supporting detail: Looking around in the vmcore.3 file showed that the area I found with the proper back chain and lr links was present (this was debug.minidump=0), complete with having one of 3 instances of the failing virtual memory address (0x7c2803ba) near by --happening to be the dar value from the trap frame when I looked in detail. (Of course in vmcore.3 I'm seeing physical memory addresses offset by the dump's header size for locations of memory. So I'm learning what the mapping was by finding the region with the content.) Notes: I have evidence that changes in the memory layout of the kernel (such as by adding something to potentially use in investigating) changes the details of the failure behavior. I have had contexts where the failures would happen when the PowerMac involved was booted but not in active use. In other contexts I've had that not fail but something like "find / -name dog -print | more" leads to failure before completing. In the other context such things completed just fine. libkvm was never updated to deal with the powerpc and powerpc64 relocatable kernel format changes. On the PowerMac it seems to leave the kernel at its default place but the format is ET_DYN. More than just testing for ET_DYN is required in libkvm. And it appears that for powerpc and powerpc64 it never supported debug.minidump=0 . The open Bugzilla 219153 reports this issue and has more notes. === Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D469253-A16F-4723-B459-38BE01FFB051>