Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Apr 97 09:20:39 -0400
From:      cr@jcmax.com (Cyrus Rahman)
To:        smp@csn.net, smp@freebsd.org
Subject:   SMP kernel deadlocks
Message-ID:  <9704191320.AA18511@corona.jcmax.com>

next in thread | raw e-mail | index | archive | help
I've previously described a situation in which the freebsd SMP kernel
appeared to deadlock under heavy load.  I finally got another chunk of time
to look into the problem.

****

Problem summary (in Steve's words):

Summary of the problem:

 code:
        3-0.970209-SNAP, -current SMP src
        APIC_IO and all recommended options for same.

 symptom:
        heavily loaded system (ie lots of INTs happening) "freezes"
 
 reason:
        cpu0 is trying to service an INT, spin-locks attempting to get the
        mp_lock, which evidently is permanently held by some process on cpu1.
        the lock count that is being held is usually 2, but sometimes only 1.

reproducing the problem:
        although I have never seen this before, I can easily reproduce it
        by disabling the loprio code by changing TEST_LOPRIO to TEST_LOPRIO_NOT
        in smptests.h.  The effect of this is to cause ALL INTs to be serviced
        by cpu0.


****

At the time there was some question about whether there was a true deadlock.
As it turns out, there is.

The trouble occurs when a page fault occurs on one processor, and, during a
critical interval while that page fault is being serviced, an interrupt
occurs on the other processor.  Defining TEST_LOPRIO decreases the frequency
with which this happens, but does not eliminate the problem.

The details:

	During the page fault, it generally happens that at some point
	smp_invltlb() gets called to flush the TLB on the other CPU's.
	smp_invltlb() calls allButSelfIPI() and sends an IPI to the other
	processor, which, unfortunately, is sometimes already processing an
	interrupt of a higher priority.  This interrupt routine now spends
	its time trying to obtain the mp_lock spin lock so it can enter the
	kernel, but the processor which has this lock is also in a spin loop
	in apicIPI() waiting for the IPI to be delivered.


Clearly the solution we originally considered, routing the stalled interrupt
to the processor with the mp_lock, isn't going to work here.  I haven't
had time to think through any of the other ways to get around the problem,
(and since I need to be in Baltimore in a few hours I probably shouldn't
start), but I'd be very interested in any ideas.

Cyrus



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9704191320.AA18511>