Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 6 Jun 2017 11:09:17 -0700
From:      Mark Millard <markmi@dsl-only.net>
To:        Justin Hibbits <jhibbits@FreeBSD.org>, Nathan Whitehorn <nwhitehorn@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: A different 32-bit powerpc head -r317820 panic on old PowerMac G5: dual backtraces from "timeout stopping cpus" (dump failed though): any comments?
Message-ID:  <1F1E52BD-375E-47CC-BF06-ECB1092121B4@dsl-only.net>
In-Reply-To: <D69CB244-69E2-4319-BD63-07BC7F763279@dsl-only.net>
References:  <D69CB244-69E2-4319-BD63-07BC7F763279@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-Jun-6, at 3:46 AM, Mark Millard <markmi@dsl-only.net> wrote:

> I'm not sure what to make of this. May be someone has
> some idea.
>=20
> One of the bt's is for "irq53: smudoorbell0" (tid 100055) the other is =
for usb.
>=20
> spinlock 0xefbbf00 (sched lock 1) held by 0x593d6c0 (tid 100055) too =
long
> spinlock 0xxf2e1f8 (sleepq chain) help by 0x17606c0 (tid 100049) too =
long
> timeout stopping cpus
>=20
> usb. . .
> (I omit any "at kdb_backtrace" lines)
>=20
> 0xe56636d0: at vpanic+0x210
> 0xe5663740: at panic+0x54
> 0xe5663790: at _mtx_lock_spin_failed_+0x54
> 0xe56637b0: at _mtx_lock_spin_cookie+0x19c
> 0xe5663810: at sched_add+0x140
> 0xe5663850: at sched_wakeup+0xa4
> 0xe5663870: at setrunnable+0x9c
> 0xe5663890: at sleepq_resume_thread+0x17c
> 0xe56638c0: at sleepq_signal+0xa4
> 0xe56638f0: at cv_signal+0x94
> 0xe5663910: at usbd_do_request_callback+0x80
> 0xe5663930: at usb_request_callback_0x28
> 0xe5663950: at usbd_callback_wrapper+0xa98
> 0xe5663990: at usb_command_wrapper+0x14c
> 0xe56639d0: at usb_callback_proc+0x190
> 0xe5663a00: at usb_process+0x1d0
> 0xe5663a50: at fork_exit+0xf4
> 0xe5663a80: at fork_trampoline+0xc
>=20
> irq53: smudoorbell0. . .
>=20
> 0xe56e4460: at vpanic+0x22c
> 0xe56e44d0: at panic+0x54
> 0xe56e4520: at _mtx_lock_spin_failed+0x54
> 0xe56e4540: at thread_lock_flags_0x1bc
> 0xe56e45b0: at sleepq_timeout+0x44
> 0xe56e45f0: at softclock_call_cc+0x1bc
> 0xe56e4660: at callout_process+0x27c
> 0xe56e46c0: at handleevents+0x2ac
> 0xe56e4710: at timercb+0x4c4
> 0xe56e4790: at decr_intr+0xf0
> 0xe56e47b0: at powerpc_intrerrupt+0xf4
> 0xe56e47e0: at kernel DECR trap
>               by _mtx_lock_spin_cookie+0x194
>               srr1=3D 0x9032
>               r1=3D   0xe56e48a0
>               cr=3D   0x20000002
>               xer=3D  0
>               ctr=3D  0
> 0xe56e48a0: at _mtx_lock+_spin_cookie+0x190
> 0xe56e4900: at sleepq_lock+0xac
> 0xe56e4930: at wakeup+0x24
> 0xe56e4950: at smu_doorbell_intr+0x128
> 0xe56e4980: at intr_event_execute_handler+0x220
> 0xe56e49f0: at ithread_loop+0xf0
> 0xe56e4a50: at fork_exit+0xf4
> 0xe56e4a80: at fork_trampoline+0xc
>=20
> For reference acttrace also reported
> (that looks odd after the first 2
> stack lines):
>=20
> powerpd pid 960 tid 100097 td 0x5ba1999 (CPU 2)
> 0xd2418510: at uma_zalloc_arg+0x178
> 0xd2418580: at mmu_unmapdev+desc+0x8
> 0x00fcf9dc: at primes+0x68
> 0x00d026b0: at moe64_page_exists_quick+0x1d8
> 0x00ce0ff8: at mmu_unmapdev_desc+0x8
>=20
> (CPU3 had an empty backtrace for pid 11's
> tid 100006.)
>=20
>=20
> I had another example of sorts but it did not
> get far before hanging up:
>=20
> spinlock 0xefbbf00 (sched lock 1) held by 0x146d360 (tid 100004) too =
long
> spinlock 0xefbbf00 (sched lock 1) held by 0x146d360 (tid 100004) too =
long
>=20
> (Note: Two different colored lines above but the
> same otherwise.)
>=20
> timeout stopping cpus
>=20
> Note: The partial backtrace was:
> ("cpuid =3D 2", all in the first color above)
>=20
> 0xdf6bb5c0: at vpanic+0x210
> 0xdf6bb630: at panic++0x54
> 0xdf6bb680: at _mtx_lock_spin_failed+0x54
> 0xdf6bb6a0: at _mtx_lock_spin_cookie+0x190
> 0xdf6bb700: at sched_add+0x140
> 0xdf6bb740: at sched_wakeup+0xa4
> 0xdf6bb760: at=20
>=20
> (and that is as far as it got for the backtrace(s))


FYI: I'm currently doing an approximate
binary search for localizing part of the panic problem.

This is based on the classic panics that are instead
from jumping to a non-code area. . .

At a given point in my other experiments I was
getting:

srr0=3D0x90a0f0 etext+0xb8fc

Adding (unused) code somewhat before that etext
(so increasing etext) got:

srr0=3D0x90a0f0 etext+0xb8a8
(The additional code was larger than I now use.)

But instead adding some code earlier (by around
0x100000 in this example) got:

srr0=3D0x90a110 etext+0xb8fc

So comparing to the starting conditions in
each case:

The bad-address accessed in one case stayed
constant but the etext offset decreased: in essence
the only thing that happened is etext increased
(matching the offset decrease).

In the other case the etext offset stayed constant
but the bad-address and etext increased by the
same amount.

Unfortunately this will take a while:

A) It typically takes hours for a failure to happen,
   I've had up to between 7 hours and 8 hours.
   (I've had minutes-to-failure but it is rare.)

B) I'd let it go 24 hours before concluding that
   a test combination was not likely to produce
   an obvious form of failure (such as the panic).

(I've had (B) in some earlier forms of testing
variations of things.)

Another time-taker is that I need to be around
because I avoid leaving the PowerMac G5 panicked
with the fans going fill-speed for long periods.

Currently I'm adding code by adding:

void HACKISH_EXTRA_CODE(void) {}

to one .c file from /usr/src/sys/. . . based which
file gets to within a ballpark of a more accurate
binary search position. (Large binary search
jumps currently: I'm not being picky about where
in the .c the addition is made yet.)



=3D=3D=3D
Mark Millard
markmi at dsl-only.net





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1F1E52BD-375E-47CC-BF06-ECB1092121B4>