Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 16 Jun 2011 01:12:26 +0200
From:      Marius Strobl <marius@alchemy.franken.de>
To:        Peter Jeremy <peterjeremy@acm.org>
Cc:        freebsd-sparc64@freebsd.org
Subject:   Re: 'make -j16 universe' gives SIReset
Message-ID:  <20110615231226.GY7064@alchemy.franken.de>
In-Reply-To: <20110614214959.GB91014@server.vk2pj.dyndns.org>
References:  <20110526234728.GA69750@server.vk2pj.dyndns.org> <20110527120659.GA78000@alchemy.franken.de> <20110601231237.GA5267@server.vk2pj.dyndns.org> <20110608224801.GB35494@alchemy.franken.de> <20110613235144.GA12470@server.vk2pj.dyndns.org> <20110614214959.GB91014@server.vk2pj.dyndns.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Jun 15, 2011 at 07:49:59AM +1000, Peter Jeremy wrote:
> On 2011-Jun-14 09:51:44 +1000, Peter Jeremy <peter@server.vk2pj.dyndns.org> wrote:
> >I'm building r223035 with DDB & KDB and will see how that goes.
> 
> I had another try with WITNESS & INVARIANTS and got a different panic:
> panic: blockable sleep lock (sleep mutex) system map @ /usr/src/sys/vm/vm_map.c:3651
> cpuid = 13
> KDB: stack backtrace:
> panic() at panic+0x1c8
> witness_checkorder() at witness_checkorder+0x108
> _mtx_lock_flags() at _mtx_lock_flags+0x110
> _vm_map_lock_read() at _vm_map_lock_read+0x1c
> vm_map_lookup() at vm_map_lookup+0x4c
> vm_fault_hold() at vm_fault_hold+0x94
> vm_fault() at vm_fault+0x14
> trap_pfault() at trap_pfault+0x338
> trap() at trap+0x3a8
> -- fast data access mmu miss tar=0x2000 %o7=0xc055e038 --
> intr_vector_stray() at intr_vector_stray+0x8
> sched_switch() at sched_switch+0x290
> mi_switch() at mi_switch+0x2a8
> sleepq_switch() at sleepq_switch+0x1cc
> sleepq_catch_signals() at sleepq_catch_signals+0x130
> sleepq_timedwait_sig() at sleepq_timedwait_sig+0x8
> _cv_timedwait_sig() at _cv_timedwait_sig+0x344
> seltdwait() at seltdwait+0x74
> kern_select() at kern_select+0x618
> select() at select+0x44
> syscallenter() at syscallenter+0x270
> syscall() at syscall+0x74
> -- syscall (93, FreeBSD ELF64, select) %o7=0x1099dc --
> userland() at 0x14bde8
> user trace: trap %o7=0x1099dc
> pc 0x14bde8, sp 0x7fdffffc8d1
> pc 0x26c800, sp 0x26c800
> done
> KDB: enter: panic
> 
> Unfortunately, still no DDB - just a hang

This backtrace shows two things that just shouldn't happen hardware-wise:
a) The CPU issues an stray interrupt vector. This would explain the SIRs
   you were seeing without the patch which tries to make these non-fatal.
b) The CPU faults on an address which is covered by an locked TLB slot.

The funny thing is that the CPU then actually still manages to panic; if
something like b) occurs I'd expect it to be in a totally unusable state.
I'm not sure what to do about these as it still looks like broken hardware
or a silicon bug to me but at least the public errata doesn't mention
something like that and the OpenSolaris source doesn't seem to work
around something like these in an obvious way either. The only thing I
can think of is to try whether just ignoring the stray interrupt vectors
with the below patch avoids any further issues. You'll need to revert
sparc64_intr_vector_stray.diff for that or at least the exception.S
part.

Marius

Index: exception.S
===================================================================
--- exception.S	(revision 223042)
+++ exception.S	(working copy)
@@ -578,7 +578,7 @@
 	andcc	%g1, IRSR_BUSY, %g0
 	bnz,a,pt %xcc, intr_vector
 	 nop
-	sir
+	retry
 	.align	32
 	.endm
 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110615231226.GY7064>