From owner-freebsd-current@FreeBSD.ORG Thu Jun 10 14:11:56 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 58C5716A4CE for ; Thu, 10 Jun 2004 14:11:56 +0000 (GMT) Received: from mail.sandvine.com (sandvine.com [199.243.201.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id 850FD43D54 for ; Thu, 10 Jun 2004 14:11:55 +0000 (GMT) (envelope-from don@sandvine.com) Received: by mail.sandvine.com with Internet Mail Service (5.5.2657.72) id ; Thu, 10 Jun 2004 10:11:39 -0400 Message-ID: From: Don Bowman To: 'Bruce Evans' , Don Bowman Date: Thu, 10 Jun 2004 10:11:39 -0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2657.72) Content-Type: text/plain; charset="iso-8859-1" cc: "'current@freebsd.org'" Subject: RE: kernel trap 19 with interrupts disabled X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Jun 2004 14:11:56 -0000 From: Bruce Evans [mailto:bde@zeta.org.au] > ... NMI, output but no debugger, hang, patch to workaround ... I have applied the patch, and will await the next hang. Out of curiousity, why not use something like this, so the timeout is fixed in time, rather than a #? I used the tsc here. static int my_stop_cpus(u_int map) { unsigned long long end_ts = rdtsc() + 1ULL * tsc_freq; /* send the Xcpustop IPI to all CPUs in map */ selected_apic_ipi(map, XCPUSTOP_OFFSET, APIC_DELMODE_FIXED); while ((stopped_cpus & map) != map) { /* Wait 1 second */ if ( rdtsc() > end_ts ) return 0; } return 1; } Has anyone else been observing system hangs with SMP Xeon (P4-based Xeon)? I have been observing this for more than a year with 4.7. We came up with a workaround by having a periodic NMI from the perfmon registers, and having it check for hardclock still incrementing. The problem we found is that hardclock would stop. I was hoping it was a race condition in the stable kernel, but now that i see what is most likely the same issue on current, i'm starting to wonder. I have a dual p3 system which has never experienced this problem. --don