From owner-freebsd-current@FreeBSD.ORG  Thu Jun 10 14:11:56 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 58C5716A4CE
	for <current@FreeBSD.org>; Thu, 10 Jun 2004 14:11:56 +0000 (GMT)
Received: from mail.sandvine.com (sandvine.com [199.243.201.138])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 850FD43D54
	for <current@FreeBSD.org>; Thu, 10 Jun 2004 14:11:55 +0000 (GMT)
	(envelope-from don@sandvine.com)
Received: by mail.sandvine.com with Internet Mail Service (5.5.2657.72)
	id <MTR27808>; Thu, 10 Jun 2004 10:11:39 -0400
Message-ID: <FE045D4D9F7AED4CBFF1B3B813C85337051D8F5C@mail.sandvine.com>
From: Don Bowman <don@sandvine.com>
To: 'Bruce Evans' <bde@zeta.org.au>, Don Bowman <don@sandvine.com>
Date: Thu, 10 Jun 2004 10:11:39 -0400
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2657.72)
Content-Type: text/plain;
	charset="iso-8859-1"
cc: "'current@freebsd.org'" <current@FreeBSD.org>
Subject: RE: kernel trap 19 with interrupts disabled
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 10 Jun 2004 14:11:56 -0000

From: Bruce Evans [mailto:bde@zeta.org.au]
> ... NMI, output but no debugger, hang, patch to workaround ...

I have applied the patch, and will await the next hang.

Out of curiousity, why not use something like this, so the
timeout is fixed in time, rather than a #? I used the tsc here.

static int
my_stop_cpus(u_int map)
{   
    unsigned long long end_ts = rdtsc() +
                                1ULL * tsc_freq;
    /* send the Xcpustop IPI to all CPUs in map */
    selected_apic_ipi(map, XCPUSTOP_OFFSET, APIC_DELMODE_FIXED);
    while ((stopped_cpus & map) != map)
    {  
       /* Wait 1 second */
       if ( rdtsc() > end_ts )
           return 0;
    }
    return 1;
}

Has anyone else been observing system hangs with
SMP Xeon (P4-based Xeon)? I have been observing this
for more than a year with 4.7. We came up with a workaround
by having a periodic NMI from the perfmon registers,
and having it check for hardclock still incrementing.
The problem we found is that hardclock would stop.
I was hoping it was a race condition in the stable
kernel, but now that i see what is most likely the
same issue on current, i'm starting to wonder. I have
a dual p3 system which has never experienced this problem.

--don