From owner-freebsd-smp  Sun Jun 27 22: 4:12 1999
Delivered-To: freebsd-smp@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP id C101314F2C
	for <freebsd-smp@FreeBSD.ORG>; Sun, 27 Jun 1999 22:04:08 -0700 (PDT)
	(envelope-from tlambert@usr04.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.8.8/8.8.8) id WAA08612;
	Sun, 27 Jun 1999 22:04:07 -0700 (MST)
Received: from usr04.primenet.com(206.165.6.204)
 via SMTP by smtp03.primenet.com, id smtpd008550; Sun Jun 27 22:03:58 1999
Received: (from tlambert@localhost)
	by usr04.primenet.com (8.8.5/8.8.5) id WAA26806;
	Sun, 27 Jun 1999 22:03:54 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199906280503.WAA26806@usr04.primenet.com>
Subject: Re: high-efficiency SMP locks - submission for review
To: dillon@apollo.backplane.com (Matthew Dillon)
Date: Mon, 28 Jun 1999 05:03:54 +0000 (GMT)
Cc: freebsd-smp@FreeBSD.ORG
In-Reply-To: <199906272024.NAA15634@apollo.backplane.com> from "Matthew Dillon" at Jun 27, 99 01:24:01 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-smp@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

>     I would like to know what the SMP gurus think of this code.  I have
>     not posted the complete patch, but have included the meat of the code. 
>     I am interested in comments relating to the implementation & performance
>     aspects of the code.  I would also appreciate it if an inline assembly
>     expert could look at my inline assembly.  I have tested it well and
>     I believe it to be correct, but I don't write much inline assembly so...
> 
>     This code is designed to implement extremely low overhead SMP-compatible 
>     locks.  It will probably be used instead of lockmgr() locks for the
>     buffer cache, and I hope to use it as a basis to replace lockmgr() locks
>     in other modules later on.  The same structure can be used to implement
>     both spin locks and normal blocking locks, and can handle shared, 
>     exclusive, and recursive-exclusive lock types.  The critical path has 
>     been optimize down to 20 instructions or so ( compared to the hundreds
>     in lockmgr() ).
> 
>     As many of you may already know, lockmgr() locks are extremely expensive
>     and require interrupt synchronization for their internal simplelocks to
>     operate properly in a mixed process/interrupt environment.  The result is
>     a horrendous amount of overhead for very little gain.  I am attempting
>     to come up with a general locking module that can eventually replace
>     lockmgr() locks.

The lockmgr() locks are ill suited to use for anything other than
advisory/mandatory file locking, IMO.  Any move away from them is a
good thing.  However, I have some comment on the implementation.

1)	SMP locking should be done utilizing intention modes.  I
	had a nice discussion with Mike Smith and Julian over the
	proposed locking mechanism for non-intention mode multiple
	reader, single writer locks for the buffer cache.  This
	boiled down to:

	A)	Use of non-intention mode locks has a serious
		serialization penalty.

	B)	Use of locks, where the write queue has a shared
		intention exclusive (SIX) lock implicit to the
		enqueueing of synchronization point data (e.g. a
		soft dependency on a directory entry block with
		one backed out operation) _significantly_ reduces
		the serialization overhead.

	C)	There's really no reason that the blocks that are
		not locked exclusive on the list can't be updated
		by user processes.  This resolves the Ziff-Davis
		Labs benchmark issue, more cleanly than the way
		that Mike stated Kirk proposed to solve the problem.

2)	In general, as regards, SMP locking, I think that your lock
	structure is too abbreviated:

> struct qlock {
> 	volatile int count;	/* lock refs: neg=exclusive, pos=shared */
> 	int	waiting;	/* process(s) are waiting on lock	*/
> 	struct proc *holder;	/* holder of an exclusive lock		*/
> };

	In particular, the use of a single wait count means that you
	will not have an ordered list for the equivalent of a "wake
	one".  This is a real problem, since it allows for a deadly
	embrace deadlock to occur when two kernel contexts each hold
	one lock and want to acquire the other contexts lock.

	I believe that resolving this as an EWOULDBLOCK, and then
	backtracking the stack and any partially complete operations
	is prohibitive.

	The use of a count, rather than a relationship between
	the loc holder(s) and the things being locked is also
	similarly problematic.

	Finally, there is insufficient mechanism to avoid competition
	starvation, where a write blocks indefinitely as multiple
	readers pass the resource between each other.

	I believe the following is the minimal set of structures
	required to resolve the blocking operation inheritance and
	deadlock detection:

/*
 *
 */
     
struct lockable;
typedef struct lockable LKA;

struct lockingentity; 
typedef struct lockingentity LKE;

struct locklistentry;
typedef struct locklistentry LKLE;

struct lockentitylistentry;  
typedef struct lockentitylistentry LKEE;

struct lockentitylistentry { 
        LKEE    *next;          /* next list entry*/
        LKEE    *inherit;       /* who inherits this dependency*/
        LKE     *entity;        /* entity waiting on us*/
};      

struct locklistentry {
        LKLE    *next;          /* next lock in list*/
        LKA     *held;          /* a held lock*/
};       

            
struct lockingentity { 
        LKLE    *holding;       /* locks that entity is holding*/
        LKA     *waiton;        /* lock that entity is waiting on*/
};

struct lockable {
        LKEE    *holds;         /* the locking entities*/
        LKEE    *waits;         /* entities waiting for the lock*/
};

	This presumes a model where each context which wishes to
	acquire a lock is considered a lockingentity (this works
	equally well for kernel threads vs. processes vs. async
	call gates, vs. pseudo async call gates ala lazy thread
	context creation for kernel sleeps ala BSDI).

	Each object that can be locked is considered a lockable,
	and the relationship between a locking entity and a
	lockable is a locklistentry.

	The lockentitylistentry is used to inherit blocked,
	pending operations to the root of the lock tree (graph).
	This allows near instantaneous deadlock detection, and
	the inheritance list combined with the graph represents
	the transitive closure patch between any lock you propose,
	and the locks which already exist.


	Finally, one could envision a hierarchical relationship
	between lockingentities, e.g. multiple threads within a
	process, so as to avoid self-deadlock.  This really
	depends on your threads implemenation; obviously, a user
	space call conversion mechanism backed by multiple kernel
	threads, and implemented wholly using asynchronous kernel
	entries (e.g. an async call gate) would be immune to the
	requirement.  Other implementations would need something
	like:

/*
 * iml.c
 *
 * An intention mode based lock manager
 *
 * This software uses graph theory.  A graph is a mathematical object
 * that can accurately model a problem formulated in terms of objects
 * and the relationships between them.  Lock management is one such
 * problem.  I recommend:
 *
 *      Algorithms in C++
 *      Robert Sedgewick
 *      Addison-Wesley Publishing Company
 *      ISBN 0-201-51059-6
 *
 *      The Art Of computer Programming
 *      _Volume 1: Fundamental Algorithms_
 *      Donald Knuth
 *      Addison-Wesley Publishing Company
 *      ISBN 0-201-03809-9
 *
 *      UNIX Systems for Modern Architectures
 *      _Symmetric Multiprocessing and Caching for Kernel Programmers_
 *      Curt Schimmel
 *      Addison-Wesley Publishing Company
 *      ISBN 0-201-63338-8
 *
 */

struct lockable;
typedef struct lockable         IML_L;
struct lockentity;
typedef struct lockentity       IML_E;
struct locklock;
typedef struct locklock         IML_LOCK;

/*
 * A node that can have a lock applied against it
 */
struct lockable {
        IML_L           *parent;        /* our parent locakable*/
        IML_L           *sibling;       /* sibling lockable(s), if any*/
        IML_L           *child;         /* child lockable(s), if any*/
        IML_LOCK        *locks;         /* locks that are held*/
        IML_LOCK        *waits;         /* locks that are waiting*/
        unsigned long   nesting_lvl;
};

/*
 * A lock entity node for holding locks and entity relationships
 */
struct lockentity {
        IML_E           *sibling;       /* entites with a relationship to us*/
        IML_LOCK        *locks;         /* the list of locks we hold*/
}; 
  
/*
 * A lock instance
 */     
struct locklock {
        IML_E           *entity;        /* the entity that applied the lock*/
        IML_L           *lockable;      /* where the lock was applied*/
        IML_LOCK        *enextlock;     /* the next lock by the entity*/
        IML_LOCK        *lnextlock;     /* the next lock on the lockable*/
        unsigned long   mode;           /* the mode of this lock*/
};      

	Note: This doesn't deal with the inheritance issue for waiters,
	which would use a structure similar to the structure from the
	previous example.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message