Date: Tue, 6 Jun 2017 11:09:17 -0700 From: Mark Millard <markmi@dsl-only.net> To: Justin Hibbits <jhibbits@FreeBSD.org>, Nathan Whitehorn <nwhitehorn@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>, freebsd-hackers@freebsd.org Subject: Re: A different 32-bit powerpc head -r317820 panic on old PowerMac G5: dual backtraces from "timeout stopping cpus" (dump failed though): any comments? Message-ID: <1F1E52BD-375E-47CC-BF06-ECB1092121B4@dsl-only.net> In-Reply-To: <D69CB244-69E2-4319-BD63-07BC7F763279@dsl-only.net> References: <D69CB244-69E2-4319-BD63-07BC7F763279@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-Jun-6, at 3:46 AM, Mark Millard <markmi@dsl-only.net> wrote: > I'm not sure what to make of this. May be someone has > some idea. >=20 > One of the bt's is for "irq53: smudoorbell0" (tid 100055) the other is = for usb. >=20 > spinlock 0xefbbf00 (sched lock 1) held by 0x593d6c0 (tid 100055) too = long > spinlock 0xxf2e1f8 (sleepq chain) help by 0x17606c0 (tid 100049) too = long > timeout stopping cpus >=20 > usb. . . > (I omit any "at kdb_backtrace" lines) >=20 > 0xe56636d0: at vpanic+0x210 > 0xe5663740: at panic+0x54 > 0xe5663790: at _mtx_lock_spin_failed_+0x54 > 0xe56637b0: at _mtx_lock_spin_cookie+0x19c > 0xe5663810: at sched_add+0x140 > 0xe5663850: at sched_wakeup+0xa4 > 0xe5663870: at setrunnable+0x9c > 0xe5663890: at sleepq_resume_thread+0x17c > 0xe56638c0: at sleepq_signal+0xa4 > 0xe56638f0: at cv_signal+0x94 > 0xe5663910: at usbd_do_request_callback+0x80 > 0xe5663930: at usb_request_callback_0x28 > 0xe5663950: at usbd_callback_wrapper+0xa98 > 0xe5663990: at usb_command_wrapper+0x14c > 0xe56639d0: at usb_callback_proc+0x190 > 0xe5663a00: at usb_process+0x1d0 > 0xe5663a50: at fork_exit+0xf4 > 0xe5663a80: at fork_trampoline+0xc >=20 > irq53: smudoorbell0. . . >=20 > 0xe56e4460: at vpanic+0x22c > 0xe56e44d0: at panic+0x54 > 0xe56e4520: at _mtx_lock_spin_failed+0x54 > 0xe56e4540: at thread_lock_flags_0x1bc > 0xe56e45b0: at sleepq_timeout+0x44 > 0xe56e45f0: at softclock_call_cc+0x1bc > 0xe56e4660: at callout_process+0x27c > 0xe56e46c0: at handleevents+0x2ac > 0xe56e4710: at timercb+0x4c4 > 0xe56e4790: at decr_intr+0xf0 > 0xe56e47b0: at powerpc_intrerrupt+0xf4 > 0xe56e47e0: at kernel DECR trap > by _mtx_lock_spin_cookie+0x194 > srr1=3D 0x9032 > r1=3D 0xe56e48a0 > cr=3D 0x20000002 > xer=3D 0 > ctr=3D 0 > 0xe56e48a0: at _mtx_lock+_spin_cookie+0x190 > 0xe56e4900: at sleepq_lock+0xac > 0xe56e4930: at wakeup+0x24 > 0xe56e4950: at smu_doorbell_intr+0x128 > 0xe56e4980: at intr_event_execute_handler+0x220 > 0xe56e49f0: at ithread_loop+0xf0 > 0xe56e4a50: at fork_exit+0xf4 > 0xe56e4a80: at fork_trampoline+0xc >=20 > For reference acttrace also reported > (that looks odd after the first 2 > stack lines): >=20 > powerpd pid 960 tid 100097 td 0x5ba1999 (CPU 2) > 0xd2418510: at uma_zalloc_arg+0x178 > 0xd2418580: at mmu_unmapdev+desc+0x8 > 0x00fcf9dc: at primes+0x68 > 0x00d026b0: at moe64_page_exists_quick+0x1d8 > 0x00ce0ff8: at mmu_unmapdev_desc+0x8 >=20 > (CPU3 had an empty backtrace for pid 11's > tid 100006.) >=20 >=20 > I had another example of sorts but it did not > get far before hanging up: >=20 > spinlock 0xefbbf00 (sched lock 1) held by 0x146d360 (tid 100004) too = long > spinlock 0xefbbf00 (sched lock 1) held by 0x146d360 (tid 100004) too = long >=20 > (Note: Two different colored lines above but the > same otherwise.) >=20 > timeout stopping cpus >=20 > Note: The partial backtrace was: > ("cpuid =3D 2", all in the first color above) >=20 > 0xdf6bb5c0: at vpanic+0x210 > 0xdf6bb630: at panic++0x54 > 0xdf6bb680: at _mtx_lock_spin_failed+0x54 > 0xdf6bb6a0: at _mtx_lock_spin_cookie+0x190 > 0xdf6bb700: at sched_add+0x140 > 0xdf6bb740: at sched_wakeup+0xa4 > 0xdf6bb760: at=20 >=20 > (and that is as far as it got for the backtrace(s)) FYI: I'm currently doing an approximate binary search for localizing part of the panic problem. This is based on the classic panics that are instead from jumping to a non-code area. . . At a given point in my other experiments I was getting: srr0=3D0x90a0f0 etext+0xb8fc Adding (unused) code somewhat before that etext (so increasing etext) got: srr0=3D0x90a0f0 etext+0xb8a8 (The additional code was larger than I now use.) But instead adding some code earlier (by around 0x100000 in this example) got: srr0=3D0x90a110 etext+0xb8fc So comparing to the starting conditions in each case: The bad-address accessed in one case stayed constant but the etext offset decreased: in essence the only thing that happened is etext increased (matching the offset decrease). In the other case the etext offset stayed constant but the bad-address and etext increased by the same amount. Unfortunately this will take a while: A) It typically takes hours for a failure to happen, I've had up to between 7 hours and 8 hours. (I've had minutes-to-failure but it is rare.) B) I'd let it go 24 hours before concluding that a test combination was not likely to produce an obvious form of failure (such as the panic). (I've had (B) in some earlier forms of testing variations of things.) Another time-taker is that I need to be around because I avoid leaving the PowerMac G5 panicked with the fans going fill-speed for long periods. Currently I'm adding code by adding: void HACKISH_EXTRA_CODE(void) {} to one .c file from /usr/src/sys/. . . based which file gets to within a ballpark of a more accurate binary search position. (Large binary search jumps currently: I'm not being picky about where in the .c the addition is made yet.) =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1F1E52BD-375E-47CC-BF06-ECB1092121B4>