Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 05 Apr 1997 13:14:26 -0700
From:      Steve Passe <smp@csn.net>
To:        Peter Wemm <peter@spinner.dialix.com>
Cc:        cr@jcmax.com (Cyrus Rahman), Poul-Henning Kamp <phk@critter.dk.tfs.com>, smp@freebsd.org
Subject:   Re: Questions about mp_lock 
Message-ID:  <199704052014.NAA10988@Ilsa.StevesCafe.com>
In-Reply-To: Your message of "Sun, 06 Apr 1997 01:04:44 %2B0800." <199704051704.BAA18422@spinner.DIALix.COM> 

next in thread | previous in thread | raw e-mail | index | archive | help
Hi everyone,

It was great to see so many responses to the question, the list has been
somewhat inactive for awhile...

Summary of the problem:

 code:
	3-0.970209-SNAP, -current SMP src
	APIC_IO and all recommended options for same.

 symptom:
	heavily loaded system (ie lots of INTs happening) "freezes"

 reason:
	cpu0 is trying to service an INT, spin-locks attempting to get the
	mp_lock, which evidently is perminately held by some process on cpu1.
	the lock count that is being held is usually 2, but sometimes only 1.

 open question:
	is the other cpu running a process that is somehow dead-locked waiting
	for a resource, ie is the lock value of 2 really true?
  OR
	is the lock count hosed, and the process on cpu1 not really holding the
	lock?

 reproducing the problem:
	although I have never seen this before, I can easily reproduce it
	by disabling the loprio code by changing TEST_LOPRIO to TEST_LOPRIO_NOT
	in smptests.h.  The effect of this is to cause ALL INTs to be serviced
	by cpu0.  apply this patch that Cyrus created:

------------------------------------- cut -------------------------------------
-
*** mplock.s.dist	Wed Dec  4 17:32:57 1996
--- mplock.s	Thu Apr  3 08:20:23 1997
***************
*** 71,79 ****
  	movl	%eax, APIC_TPR(%ecx)	/* set it */
  #endif /** TEST_LOPRIO */
  	ret
! 3:	cmpl	$0xffffffff, (%edx)	/* Wait for it to become free */
! 	jne	3b
! 	jmp	2b			/* XXX 1b ? */
  
  /***********************************************************************
   *  int MPtrylock(unsigned int *lock)
--- 71,88 ----
  	movl	%eax, APIC_TPR(%ecx)	/* set it */
  #endif /** TEST_LOPRIO */
  	ret
! 3:	movl	$2000000000, %eax	/* Timer */
! 4:	decl	%eax
! 	jnz	5f	
! 	pushl	(%edx)
! 	pushl	$pstrin
! 	movl	$0xffffffff, (%edx)	/* Let the panic grab a cpu for ddb */
! 	call	_panic
! 5:	cmpl	$0xffffffff, (%edx)	/* Wait for it to become free */
! 	jne	4b
! 	jmp	1b			/* XXX 1b ? */
! 
! pstrin:	.asciz	"mplock: deadlock on %x"
  
  /***********************************************************************
   *  int MPtrylock(unsigned int *lock)
***************
*** 128,134 ****
  	ret
  1:	movl	4(%esp), %edx		/* Get the address of the lock */
    	movl	(%edx), %eax		/* - get the value */
! 	movl	%eax,%ecx
  	decl	%ecx			/* - new count is one less */
  	testl	$0x00ffffff, %ecx	/* - Unless it's zero... */
  	jnz	2f
--- 137,149 ----
  	ret
  1:	movl	4(%esp), %edx		/* Get the address of the lock */
    	movl	(%edx), %eax		/* - get the value */
! 
! 	cmpl	$0xffffffff, %eax	/* If it's free, we have a problem */
! 	jne	3f
! 	pushl	$rls_free
! 	call	_panic
! 
! 3:	movl	%eax,%ecx
  	decl	%ecx			/* - new count is one less */
  	testl	$0x00ffffff, %ecx	/* - Unless it's zero... */
  	jnz	2f
***************
*** 146,151 ****
--- 161,169 ----
  	cmpxchg	%ecx, (%edx)		/* - try it atomically */
  	jne	1b			/* ...do not collect $200 */
  	ret
+ 
+ rls_free:
+ 	.asciz	"mplock: releasing free lock"
  
  /***********************************************************************
   *  void get_mplock()
------------------------------------- cut -------------------------------------
-

start a kernel build, then open a file for edit in another window or otherwise
busy the system.  the machine locks, and the patch drops you out in 30
seconds to several minute, be patient.  when it does you see something like:

'panic (cpu#0): mplock: deadlock on 1000001' or
'panic (cpu#0): mplock: deadlock on 1000002', but mostly the latter.

by disabling the TEST_LOPRIO code we guarantee a high frequency of hits
where the cpu servicing the INT is NOT the one currently holding the lock.
The loprio code *ATTEMPTS* to steer the INT to the cpu holding the lock
(if any).  BUT it will fail to do so a small percentage of the time since it
isn't an atomic operation with reguards to whats happening on the other cpu(s).
I didn't consider this to be fatal, just an inefficiency that we could live
with.  However that might not be the case....  With the loprio code in place 
this bug happens so seldom as to not affect most systems, but it IS still
lurking there on all APIC_IO systemws, we need to find it!!!

theroies, testers, etc. all welcome!

--
Steve Passe	| powered by 
smp@csn.net	|            Symmetric MultiProcessor FreeBSD




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199704052014.NAA10988>