Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Oct 2011 09:17:05 +1100
From:      Peter Jeremy <peterjeremy@acm.org>
To:        Marius Strobl <marius@alchemy.franken.de>
Cc:        freebsd-sparc64@freebsd.org
Subject:   Re: 'make -j16 universe' gives SIReset
Message-ID:  <20111021221705.GD45938@server.vk2pj.dyndns.org>
In-Reply-To: <20111018172718.GT39118@alchemy.franken.de>
References:  <20110830152725.GA28552@alchemy.franken.de> <20110831212458.GA25926@server.vk2pj.dyndns.org> <20110902153206.GR40781@alchemy.franken.de> <20111006120411.GA903@alchemy.franken.de> <20111011030529.GA4093@server.vk2pj.dyndns.org> <20111011205543.GA81376@alchemy.franken.de> <20111013035648.GA54190@server.vk2pj.dyndns.org> <20111013184224.GG39118@alchemy.franken.de> <20111018042646.GA18863@server.vk2pj.dyndns.org> <20111018172718.GT39118@alchemy.franken.de>

next in thread | previous in thread | raw e-mail | index | archive | help

--Kj7319i9nmIyA2yE
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2011-Oct-18 19:27:18 +0200, Marius Strobl <marius@alchemy.franken.de> wr=
ote:
>On Tue, Oct 18, 2011 at 03:26:46PM +1100, Peter Jeremy wrote:
>> On 2011-Oct-13 20:42:25 +0200, Marius Strobl <marius@alchemy.franken.de>=
 wrote:
>> >On Thu, Oct 13, 2011 at 02:56:48PM +1100, Peter Jeremy wrote:
>> >> Unfortunately, I can't get a crashdump because dumpon(8) doesn't like
>> >> my Solaris swap partitions:
>> >> GEOM_PART: Partition 'da0b' not suitable for kernel dumps (wrong type=
?)
>> >> GEOM_PART: Partition 'da6b' not suitable for kernel dumps (wrong type=
?)
>> >> No suitable dump device was found.
>> >>=20
>> >> I did write a patch for that but took it out during some earlier
>> >> testing to get back to stock code.  It looks like I didn't PR it
>> >> either so I will do that when I get some time.
>>=20
>> I've resurrected that patch (and will send-pr it later).

Thanks for committing it.

>Hrm, AFAICT this would mean that the _mtx_obtain_lock(), which boils
>down to a atomic_cmpset_acq_ptr(), in _mtx_trylock() didn't work as
>expected, I currently can't think of a good reason why that could
>happen though. The assembly generated for that code also looks just
>fine. Have you run the workload which is triggering this before? It
>would be interesting to know whether it also happens with SCHED_4BSD
>with current sources, pre-r226054 and pre-r225889 if the machine
>previously survived that load.

It was running 6 parallel -j16 buildworlds.  I switched to SCHED_4BSD
and haven't been able to reproduce it - even with a pile of added
"sysctl sysctl vm.vmtotal".  I haven't tried rolling back to an
earlier kernel.

>Have you enabled PREEMPTION by chance?

That was using GENERIC and only changing the scheduler.

>The other thing that worries me is that it could be a silicon bug,
>especially since that machine also has that issue of issuing stale
>vector interrupts along with a state in which it traps even on
>locked TLB entries, which isn't mentioned in the public erratum ...

I've had a rummage around in the OpenSolaris sources and nothing
jumps out at me.  (Actually, I can't find any special case code
that looks like it addresses silicon bugs in Jaguar).

One other thing is that I'm getting lots of isp watchdog timeouts:
(da4:isp0:0:4:0): first watchdog (handle 0x5cf020f3) timed out- deferring f=
or grace period
(da4:isp0:0:4:0): first watchdog (handle 0x5cf1206d) timed out- deferring f=
or grace period
(da4:isp0:0:4:0): first watchdog (handle 0x5cf2203a) timed out- deferring f=
or grace period
isp0: isp_watchdog: timeout for handle 0x5cad2046
(da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=3D0x2a 0x00 0x0f 0xdd 0xe8 0xe0 0=
x00 0x00 0x20 0x00  STS 0x0 XS_ERR=3D0xb
isp0: bad request handle 0x5cad2046 (iocb type 0x3)
isp0: isp_watchdog: timeout for handle 0x5cdb20cb
(da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=3D0x2a 0x00 0x0f 0xe3 0xa8 0x00 0=
x00 0x00 0x20 0x00  STS 0x0 XS_ERR=3D0xb
isp0: isp_watchdog: timeout for handle 0x5cdc2059
(da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=3D0x2a 0x00 0x0f 0xe3 0xa8 0x20 0=
x00 0x00 0x20 0x00  STS 0x0 XS_ERR=3D0xb
isp0: isp_watchdog: timeout for handle 0x5cdd2020
(da4:isp0:0:4:0): FIN dl16384 resid 0 CDB=3D0x2a 0x00 0x0f 0xe3 0xa8 0x40 0=
x00 0x00 0x20 0x00  STS 0x0 XS_ERR=3D0xb
isp0: bad request handle 0x5cdb20cb (iocb type 0x3)
isp0: bad request handle 0x5cdc2059 (iocb type 0x3)
isp0: bad request handle 0x5cdd2020 (iocb type 0x3)
(da4:isp0:0:4:0): first watchdog (handle 0x6b9520bb) timed out- deferring f=
or grace period
(da4:isp0:0:4:0): first watchdog (handle 0x6b96200e) timed out- deferring f=
or grace period

Any ideas on that?

--=20
Peter Jeremy

--Kj7319i9nmIyA2yE
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)

iEYEARECAAYFAk6h72EACgkQ/opHv/APuId+4QCeOZF5pKFYCK8YNDvtgW8cqvkx
7HMAniAXehip+/skW2wTqX7/18FkvXlc
=91W+
-----END PGP SIGNATURE-----

--Kj7319i9nmIyA2yE--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111021221705.GD45938>