Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 03 Jul 2005 00:57:50 -0000
From:      John Baldwin <jhb@FreeBSD.org>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        cvs-src@FreeBSD.org, Nate Lawson <nate@root.org>, cvs-all@FreeBSD.org, src-committers@FreeBSD.org, Kris Kennaway <kris@obsecurity.org>
Subject:   Re: cvs commit: src/sys/i386/i386 vm_machdep.c
Message-ID:  <200412161449.23781.jhb@FreeBSD.org>
In-Reply-To: <20041216144239.T1723@epsplex.bde.org>
References:  <200411300618.iAU6IkQX065609@repoman.freebsd.org> <20041215151526.GA3462@xor.obsecurity.org> <20041216144239.T1723@epsplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday 15 December 2004 10:51 pm, Bruce Evans wrote:
> On Wed, 15 Dec 2004, Kris Kennaway wrote:
> > On Tue, Dec 14, 2004 at 09:48:48PM -0500, John Baldwin wrote:
> > > On Tuesday 14 December 2004 07:10 pm, Kris Kennaway wrote:
> > > > NB: DDB often isn't usable on SMP machines thesedays, and will hang
> > > > when a panic tries to enter it.
> > >
> > > Try debug.kdb.stop_cpus=0 (sysctl and tunable) to prevent KDB from
> > > trying to stop the other CPUs.  Another possible fix that ups@ has
> > > talked about is changing IPI_STOP to use an NMI rather than a vector
> > > (you can send NMI IPIs via the local APIC) so that IPI_STOP is more
> > > reliable.
> >
> > This is already set, and it doesn't always fix the problem.
>
> debug.kdb.stop_cpus=0 should be expected to increase problems.  Given time,
> the other CPU are quite likely to enter ddb for whatever reason the first
> one did.  Then they stomp on ddb's global state (starting with ddb_regs).
>
> The NMI would need locking to prevent the CPUs stopping each other.
>
> > I often
> > get overlapping panics from the other CPUs on this machine, and it
> > often locks up when trying to enter DDB, or while printing the panic
> > string (the other day it only got as far as 'p' before hanging).
>
> panic() needs much the same locking as ddb to prevent concurrent entry.
> It must be fairly likely for all CPUs to panic on the same asertion.
> This is like all CPUs entering ddb on the same breakpoint.

The thing is, panic does have locking, but it appears to be ineffective:

#ifdef SMP
        /*
         * We don't want multiple CPU's to panic at the same time, so we
         * use panic_cpu as a simple spinlock.  We have to keep checking
         * panic_cpu if we are spinning in case the panic on the first
         * CPU is canceled.
         */
        if (panic_cpu != PCPU_GET(cpuid))
                while (atomic_cmpset_int(&panic_cpu, NOCPU,
                    PCPU_GET(cpuid)) == 0)
                        while (panic_cpu != NOCPU)
                                ; /* nothing */
#endif

In the smpng branch in p4, I have the lock changed to be based on the thread 
rather than the CPU to account for problems coming from migration due to 
preemption while in a panic, but I haven't observed any noticeable 
improvement from the change:

--- //depot/vendor/freebsd/src/sys/kern/kern_shutdown.c	2004/11/05 19:00:32
+++ //depot/projects/smpng/sys/kern/kern_shutdown.c	2004/11/05 19:22:55
@@ -473,7 +473,7 @@
 }
 
 #ifdef SMP
-static u_int panic_cpu = NOCPU;
+static struct thread *panic_thread = NULL;
 #endif
 
 /*
@@ -494,15 +494,14 @@
 #ifdef SMP
 	/*
 	 * We don't want multiple CPU's to panic at the same time, so we
-	 * use panic_cpu as a simple spinlock.  We have to keep checking
-	 * panic_cpu if we are spinning in case the panic on the first
+	 * use panic_thread as a simple spinlock.  We have to keep checking
+	 * panic_thread if we are spinning in case the panic on the first
 	 * CPU is canceled.
 	 */
-	if (panic_cpu != PCPU_GET(cpuid))
-		while (atomic_cmpset_int(&panic_cpu, NOCPU,
-		    PCPU_GET(cpuid)) == 0)
-			while (panic_cpu != NOCPU)
-				; /* nothing */
+	if (panic_thread != curthread)
+		while (atomic_cmpset_ptr(&panic_thread, NULL, curthread) == 0)
+			while (panic_thread != NULL)
+				cpu_spinwait();
 #endif
 
 	bootopt = RB_AUTOBOOT | RB_DUMP;
@@ -538,7 +537,7 @@
 	/* See if the user aborted the panic, in which case we continue. */
 	if (panicstr == NULL) {
 #ifdef SMP
-		atomic_store_rel_int(&panic_cpu, NOCPU);
+		atomic_store_rel_ptr(&panic_thread, NULL);
 #endif
 		return;
 	}

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200412161449.23781.jhb>