Date: Sat, 19 Apr 97 09:20:39 -0400 From: cr@jcmax.com (Cyrus Rahman) To: smp@csn.net, smp@freebsd.org Subject: SMP kernel deadlocks Message-ID: <9704191320.AA18511@corona.jcmax.com>
next in thread | raw e-mail | index | archive | help
I've previously described a situation in which the freebsd SMP kernel appeared to deadlock under heavy load. I finally got another chunk of time to look into the problem. **** Problem summary (in Steve's words): Summary of the problem: code: 3-0.970209-SNAP, -current SMP src APIC_IO and all recommended options for same. symptom: heavily loaded system (ie lots of INTs happening) "freezes" reason: cpu0 is trying to service an INT, spin-locks attempting to get the mp_lock, which evidently is permanently held by some process on cpu1. the lock count that is being held is usually 2, but sometimes only 1. reproducing the problem: although I have never seen this before, I can easily reproduce it by disabling the loprio code by changing TEST_LOPRIO to TEST_LOPRIO_NOT in smptests.h. The effect of this is to cause ALL INTs to be serviced by cpu0. **** At the time there was some question about whether there was a true deadlock. As it turns out, there is. The trouble occurs when a page fault occurs on one processor, and, during a critical interval while that page fault is being serviced, an interrupt occurs on the other processor. Defining TEST_LOPRIO decreases the frequency with which this happens, but does not eliminate the problem. The details: During the page fault, it generally happens that at some point smp_invltlb() gets called to flush the TLB on the other CPU's. smp_invltlb() calls allButSelfIPI() and sends an IPI to the other processor, which, unfortunately, is sometimes already processing an interrupt of a higher priority. This interrupt routine now spends its time trying to obtain the mp_lock spin lock so it can enter the kernel, but the processor which has this lock is also in a spin loop in apicIPI() waiting for the IPI to be delivered. Clearly the solution we originally considered, routing the stalled interrupt to the processor with the mp_lock, isn't going to work here. I haven't had time to think through any of the other ways to get around the problem, (and since I need to be in Baltimore in a few hours I probably shouldn't start), but I'd be very interested in any ideas. Cyrus
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9704191320.AA18511>