FreeBSD Mail Archives

Date:      Sun, 22 Sep 1996 15:46:47 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        jmb@freefall.freebsd.org (Jonathan M. Bresler)
Cc:        terry@lambert.org, bad@owl.WPI.EDU, freebsd-smp@FreeBSD.org
Subject:   Re: multithreading in the kernel and applications.
Message-ID:  <199609222246.PAA01210@phaeton.artisoft.com>
In-Reply-To: <199609221513.IAA03219@freefall.freebsd.org> from "Jonathan M. Bresler" at Sep 22, 96 08:13:46 am

> > A hierarchy with intention modes. 
>   
>         "intention modes".  uh oh, another term i have never heard before.
>         what is an "intention mode" and how is one coded?


	IR	Intention to establish a read lock
	R	A read lock
	IW	Intention to establish a write lock
	W	A write lock
	IX	Intention to establish an exclusive lock
	X	An exclusive lock

Think of it as "madvise" for locking:

	I intend a read lock, then I read lock

Etc.

> > IF you invoke a mutex without sharing which is an inferior node in 
> 
>         how can you have a "mutex without sharing"?  if the data structure
>         that supports the mutex is not accessible to all processors, how
>         can it be used to acheive mutual exclusion?  the mutual exclusion
>         primatives that i know rely on a datum that has at least two states:
>         locked and unlocked.  each processor must read-modify-write the datum
>         in order to obtain the lock.


Say I have the following lock hierarchy:

    Hierarchy root o IX           <- shared (MESI/MEI required)
                  / \                [ protects system wide memory pool ]
                 /   \
                /     \
               /       \
         CPU1 o IX   IX o CPU 2   <- CPU local (unshared)
                                     [ protects per CPU memory pools ]


I can assert the CPU 2 X lock to access the CPU 2 local pool without
asserting the system wide lock.

(in the example above, the "root" actually would be an inferior node
of the system wide lock that was the root of the hierarchy for all
subsystems).

Since both the CPU1 and CPU2 locks may simultaneously hold an intention
to exclude, there is no need to update the shared lock to convert a
per CPU lock (IX -> X, IR ->R, IW -> W).

This is a six state lock system; there is more complex mapping which
is possible (8 state [ IRW -> RW ], or 12 state [ IIX -> IX -> X ],
or 16 state ...).


> > a hierarchy of nodes, where some superior node is a mutext *with*
> > sharing, THEN as long as the superior node has an intention mode
> > established, the actual locking can be local to the inferior node.
> 
>         superior node:
>                 shared 
>                 uncached (or consistency protocol allows sharing)
>                 contains structure (called intention mode), perhaps
>                         many of these, each indicates a different purpose
>                         for obtaining the lock in the superior node?
>                 counted semaphore or multireader lock?????
> 
>         inferior node:
>                 shared
>                 uncached (or consistency protocol allows sharing)
>                 binary semaphore
                  may also be a counting semaphore (for multiple reader)

>   
>         SO, proc A can lock the superior node, set an "intention mode"
>                 to write some files "A's file".

It is not a file locking mechanism.  It is a context synchronization
mechanism.  But the principles apply.

>         proc B comes along sees the superior node locked
>                 and bumps the lock count then sets a different
>                 "intention mode" which indicates it wants to write 
>                 "B's file" (use inode addresses for vnode addresses or
>                 some unique thing as the "intention mode" data????? 
>                 or even the address of the inferior node.  yeah thats better)
>         then you only have to obtain an exclusive lock on the inferior node???

Yes.

> > And thus local to a single processor and image of cache coherency...
> > no additional bus arbitration need take place.
>   
>         because coherency is enforced at the superior node??

Because conflicts are commutitive or associative in nature, and only
conflicts which are associative between CPU's need to have a conflict
resoloution take place (bus arbitration plus synchornization of a
shared obejct of some kind -- condition variable, semaphore, etc.).

Most of the intention modes can be permanently established at system
initialization time.

>         sounds interesting, but only for SOFTWARE MAINTAINED cache coherency
>         or systems that allow you to set the hardare coherency protocol on
>         a page by page basis.  ala MIPS 4000, you set the superior node to
>         uncached and the inferior to noncoherent??

No, you don't need to associate the locking with hardware enforcement
of the locking modes.  You have propagation and inheritance on your
side.  You inherit association down and you propagate intention up.

[ ... ]

> > In effect, you would establish a hierarchy of per-CPU "top level in 
> > the context of a single CPU" mutexes which could be used without  
> > forcing an update cycle. to the top level ("system wide") mutex.
> 
>         you are talking software cache coherency, no?

No, I'm talking about shared region cache coherency, and definiing what
needs to be shared and what doesn't, and the protocol necessary for
mving an object from one location in the hierarchy to another.  See
the memory pool example above.


It is more interesting to talk about how you can avoid the need for
sharing than it is to discuss how to make sharing more efficient.
It solves the same problem, but at a different level.

> 	given the processor speed that we are using and the bus
> 	speeds that we are limited to, memory and i/o bandwidth
> 	seems to be very scarce, something that we must husband,
> 	least the system spin like a whirling derivsh trying to
> 	obtain the required locks to perform some real work ;(

Yes.  Exactly.  You wish to be able to access memory without
invoking MEI/MESI synchronization phases.  If you are operating
in the L1 cache (where you will hopefully spend most of your time),
then you only bang on memory bandwidth for objects which you have
in common with other CPU's.

You can't avoid the overhead of loading your L1 cache from memory or
the L2 cache: the processor must have data on which to operate.

But the overhead of traditional SMP implementations is primarily
bus arbitration for inter-CPU synchronization, much of which is
unnecessary.  For instance, to get a page on SVR4, you must hit
a global mutex and each processor must invalidate the cache data
(or update it) for the shared memory region where the mutext lives,
even if they don't have to suspend trying to hit the same mutex.

It's *this* overhead that I want to avoid.


Hopefully the picture makes things clearer.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199609222246.PAA01210>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation