From owner-freebsd-smp Sat Apr 5 12:14:55 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id MAA25294 for smp-outgoing; Sat, 5 Apr 1997 12:14:55 -0800 (PST) Received: from Ilsa.StevesCafe.com (sc-gw.StevesCafe.com [205.168.119.191]) by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id MAA25285 for ; Sat, 5 Apr 1997 12:14:50 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by Ilsa.StevesCafe.com (8.7.5/8.6.12) with SMTP id NAA10988; Sat, 5 Apr 1997 13:14:26 -0700 (MST) Message-Id: <199704052014.NAA10988@Ilsa.StevesCafe.com> X-Authentication-Warning: Ilsa.StevesCafe.com: Host localhost [127.0.0.1] didn't use HELO protocol X-Mailer: exmh version 1.6.5 12/11/95 From: Steve Passe To: Peter Wemm cc: cr@jcmax.com (Cyrus Rahman), Poul-Henning Kamp , smp@freebsd.org Subject: Re: Questions about mp_lock In-reply-to: Your message of "Sun, 06 Apr 1997 01:04:44 +0800." <199704051704.BAA18422@spinner.DIALix.COM> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Sat, 05 Apr 1997 13:14:26 -0700 Sender: owner-smp@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Hi everyone, It was great to see so many responses to the question, the list has been somewhat inactive for awhile... Summary of the problem: code: 3-0.970209-SNAP, -current SMP src APIC_IO and all recommended options for same. symptom: heavily loaded system (ie lots of INTs happening) "freezes" reason: cpu0 is trying to service an INT, spin-locks attempting to get the mp_lock, which evidently is perminately held by some process on cpu1. the lock count that is being held is usually 2, but sometimes only 1. open question: is the other cpu running a process that is somehow dead-locked waiting for a resource, ie is the lock value of 2 really true? OR is the lock count hosed, and the process on cpu1 not really holding the lock? reproducing the problem: although I have never seen this before, I can easily reproduce it by disabling the loprio code by changing TEST_LOPRIO to TEST_LOPRIO_NOT in smptests.h. The effect of this is to cause ALL INTs to be serviced by cpu0. apply this patch that Cyrus created: ------------------------------------- cut ------------------------------------- - *** mplock.s.dist Wed Dec 4 17:32:57 1996 --- mplock.s Thu Apr 3 08:20:23 1997 *************** *** 71,79 **** movl %eax, APIC_TPR(%ecx) /* set it */ #endif /** TEST_LOPRIO */ ret ! 3: cmpl $0xffffffff, (%edx) /* Wait for it to become free */ ! jne 3b ! jmp 2b /* XXX 1b ? */ /*********************************************************************** * int MPtrylock(unsigned int *lock) --- 71,88 ---- movl %eax, APIC_TPR(%ecx) /* set it */ #endif /** TEST_LOPRIO */ ret ! 3: movl $2000000000, %eax /* Timer */ ! 4: decl %eax ! jnz 5f ! pushl (%edx) ! pushl $pstrin ! movl $0xffffffff, (%edx) /* Let the panic grab a cpu for ddb */ ! call _panic ! 5: cmpl $0xffffffff, (%edx) /* Wait for it to become free */ ! jne 4b ! jmp 1b /* XXX 1b ? */ ! ! pstrin: .asciz "mplock: deadlock on %x" /*********************************************************************** * int MPtrylock(unsigned int *lock) *************** *** 128,134 **** ret 1: movl 4(%esp), %edx /* Get the address of the lock */ movl (%edx), %eax /* - get the value */ ! movl %eax,%ecx decl %ecx /* - new count is one less */ testl $0x00ffffff, %ecx /* - Unless it's zero... */ jnz 2f --- 137,149 ---- ret 1: movl 4(%esp), %edx /* Get the address of the lock */ movl (%edx), %eax /* - get the value */ ! ! cmpl $0xffffffff, %eax /* If it's free, we have a problem */ ! jne 3f ! pushl $rls_free ! call _panic ! ! 3: movl %eax,%ecx decl %ecx /* - new count is one less */ testl $0x00ffffff, %ecx /* - Unless it's zero... */ jnz 2f *************** *** 146,151 **** --- 161,169 ---- cmpxchg %ecx, (%edx) /* - try it atomically */ jne 1b /* ...do not collect $200 */ ret + + rls_free: + .asciz "mplock: releasing free lock" /*********************************************************************** * void get_mplock() ------------------------------------- cut ------------------------------------- - start a kernel build, then open a file for edit in another window or otherwise busy the system. the machine locks, and the patch drops you out in 30 seconds to several minute, be patient. when it does you see something like: 'panic (cpu#0): mplock: deadlock on 1000001' or 'panic (cpu#0): mplock: deadlock on 1000002', but mostly the latter. by disabling the TEST_LOPRIO code we guarantee a high frequency of hits where the cpu servicing the INT is NOT the one currently holding the lock. The loprio code *ATTEMPTS* to steer the INT to the cpu holding the lock (if any). BUT it will fail to do so a small percentage of the time since it isn't an atomic operation with reguards to whats happening on the other cpu(s). I didn't consider this to be fatal, just an inefficiency that we could live with. However that might not be the case.... With the loprio code in place this bug happens so seldom as to not affect most systems, but it IS still lurking there on all APIC_IO systemws, we need to find it!!! theroies, testers, etc. all welcome! -- Steve Passe | powered by smp@csn.net | Symmetric MultiProcessor FreeBSD