Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 27 May 2011 14:34:04 +0200
From:      Marius Strobl <marius@alchemy.franken.de>
To:        Peter Jeremy <peterjeremy@acm.org>
Cc:        freebsd-sparc64@freebsd.org
Subject:   Re: 'make -j16 universe' gives SIReset
Message-ID:  <20110527123404.GB78000@alchemy.franken.de>
In-Reply-To: <20110527120659.GA78000@alchemy.franken.de>
References:  <20110526234728.GA69750@server.vk2pj.dyndns.org> <20110527120659.GA78000@alchemy.franken.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, May 27, 2011 at 02:06:59PM +0200, Marius Strobl wrote:
> On Fri, May 27, 2011 at 09:47:28AM +1000, Peter Jeremy wrote:
> > I tried a "make -j16 universe" using a recent 8-stable on a 16-CPU
> > V890 and after about 11 minutes, I got the following.  This box
> > had been running Solaris without problem for several years so I'm
> > inclined to suspect a software issue.
> 
> It probably doesn't hurt to check the hardware with SunVTS though.
> 
> > Any suggestions?
> > 
> > ERROR: CPU4 SIReset
> > 
> > 
> > System State (CPU4 reporting)
> > 
> >   BBC Devices: 0000.0000.0000.000f    0000.0000.0000.000f
> >   BBC Arb:     0000.0000.0000.000f    0000.0000.0000.000f
> >   BBC Quiesce: 0000.0000.0000.0003    0000.0000.0000.0003
> >   BBC WDogAct: 0000.0000.0000.0000    0000.0000.0000.0000
> >   BBC POR Gen: 0000.0000.0000.0000    0000.0000.0000.0000
> >   BBC XIR Gen: 0000.0000.0000.0000    0000.0000.0000.0000
> >   BBC POR Src: 0000.0000.0000.0000    0000.0000.0000.0000
> >   BBC XIR Src: 0000.0000.0000.000f    0000.0000.0000.000f
> >   BBC EBus TC: 014f.99fd.a7e6.3f29    014f.99fd.a7e6.3f29
> > 
> > CMP0 Core Config/Control registers: 
> > 
> >   CoreAvail:   0000.0000.0000.0003 0 1
> >   CoreEnabled: 0000.0000.0000.0003 0 1
> >   CoreRunning: 0000.0000.0000.0003 0 1
> >   XIRSteering: 0000.0000.0000.0003 0 1
> >   ErrSteering: 0000.0000.0000.0000
> > 
> > CPU0 Config/Control/Status registers: 
> > 
> >   CPUVersion:  003e.0018.3100.0507
> >   SafConfig:   0caa.01bc.2000.8002 9:1 ID:0 HBM TOL:15
> >   SafBaseAdr:  0000.0400.0000.0000
> >   DispatchCtl: 0000.0000.0000.0009 MS SI
> >   DCacheCtl:   0000.0200.0000.0010 WE
> >   ECacheCtl:   0000.0000.01c5.5000 5:1 8MB mode=5-5-5(2) R/W-turn:2 Late-Sel ECC:off
> >   ErrorEnable: 0000.0000.0000.000b CEEN NCEEN UCEEN
> > 
> >   AFAR:        0000.0000.0000.0000
> >   AFSR:        0000.0000.0000.0000 (no errors set)
> >   AFAR 2:      0000.0000.8000.0000
> >   AFSR 2:      0000.0000.0000.0000 (no errors set)
> > 
> >   DMMU SFAR:   0000.0000.f3f8.c300
> >   DMMU SFSR:   0000.0000.0000.0000 (no status set)
> >   IMMU SFSR:   0000.0000.0080.8000 TM
> > 
> 
> This doesn't indicate much, especially not the address of the instruction
> causing the SIR, except that there was an i-TLB miss, which seems innocuous.
> Generally, FreeBSD only triggers a SIR when something really unexpected
> happens in an environemt where we can't or at least can't easily trigger
> a panic. The only exception to this which is not really fatal from the
> OS point of view are stray vector interrupts (IIRC even OpenSolaris just
> ignores a certain amount of these). You could try whether the following
> patch makes any difference to the SIR you're seeing:
> http://people.freebsd.org/~marius/sparc64_intr_vector_stray.diff
> Generally, both USIV and V880 with USIII (which should be quite close to
> a V890) are rather quirky hardware; I've already hit two CPU bugs which
> are not documented in the publicly available errata. Two other things
> to try is to replace the following in cheetah.c:
> 	val &= ~DCR_DTPE;
> once with:
> 	val &= ~(DCR_DTPE | DCR_ITPE);
> and once with:
> 	val &= ~DCR_SI;
> Besides that, IIRC I haven't added a workaround for the USVI+ erratum #4
> so far, which seems unlikely to be the cause of this problem though.
> 

Err, wait, I've just noticed that your machine has USIV rather than
USIV+ CPUs so the latter shouldn't apply. It's probably still worth
a try whether enabling single issue via
	val &= ~DCR_SI;
outside of the respective block also makes a difference in this case.

Marius




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110527123404.GB78000>