Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 13 Dec 1996 01:53:31 -0500
From:      Bakul Shah <bakul@plexuscom.com>
To:        Michael Hancock <michaelh@cet.co.jp>
Cc:        FreeBSD Hackers <Hackers@freebsd.org>
Subject:   Re: Implementing atomic DCAS on Intel for NBS 
Message-ID:  <199612130653.BAA17932@chai.plexuscom.com>
In-Reply-To: Your message of "Fri, 13 Dec 1996 11:44:25 %2B0900." <Pine.SV4.3.95.961213112940.12235A-100000@parkplace.cet.co.jp> 

next in thread | previous in thread | raw e-mail | index | archive | help
> It doesn't look like the P5 has DCAS which is important for NBS because it
> wants to atomically manipulate a structure and update it's version
> number.

P5/P6 have CMPXCHG.

    cmpxchg src, dst

is defined to be

    if (accumulator==dst)
	{ zeroflag = 1; dst = src; }
    else
	{ zeroflag = 0; accumulator = dst; }

When src is a register & dst is a memory addr, this instruction is
the same as compare-and-swap (CAS). The processor implements using
locked bus cycles so this is guaranteed to be atomic.

Here is some crude, untested C+asm goop for a cas() function:

/*
 * \begin{critical section}
 * if (*addr == old) { *addr = new; return 1; } else return 0;
 * \end{critical section}
 */
int
cas(int* addr, register int old, register int new)
{
	int r = 0;
	asm volatile("movl %0,%%eax": "=r"(old));
	asm volatile("cmpxchg %0,%1": "=r"(new): "r"(addr));
	asm volatile("je L1");
	r = 1;
	asm("L1:");
	return r;
}

Undoubtely this can be optmized and turned into an inline function
if we can get rid of that L1 label.

Anyway, you can implement DCAS in terms of CAS.

Doing so for an MP kernel is easy because all the kernel threads can
use the same global shared variable to guard the DCAS code.

volatile int global_dcas_lock = 0;
...
dcas(int *addr1, int* addr2, int old1, int old2, int new1, int new2)
{
	int r;
	while (cas(&global_dcas_lock, 0, p->p_pid) == 0)
		continue;
	r = *addr1 == old1 && *addr2 == old2;
	if (r) {
		*addr1 == new1; *addr2 == new2;
	}
	global_dcas_lock = 0;
	return r;
}

Note that effectively we have reduced the number of locks to 1 and
this lock is held for a very short duration.  This is why the
priority inversion problem does not even arise.

If you want a user level DCAS, you can do something similar.  But
the user process can get preempted in the middle.  So at the very
least you have to provide enough info to the kernel so that it can
roll you back out of a partial dcas().

If you don't want to use such a global variable, it gets quite a bit
hairy.  Proof left as an exercise.

Structures like singly linked list, trees and so forth can be
updated with just DCAS using the version numer idea.  But updating
a doubly linked list will require CAS3 (deletion will require
updating version number, which may be stored in a list head,
this->prev->next, this->next->prev).  In general you may need
vcas(), to atomically update N words.  The above implementation
easily extends to vcas().

Comments about NBS:

In general, when the number of words read exceeds the number of
written words (for shared data structures), NBS will be more
efficient than lockIng.  Deadlock simply can not happen because
atmost there is just one shared lock.

So in an MP kernel instead of a few (or one!) very coarse grained
lock)s( to guard shared data structures you can use NBS and increase
concurrency.  You still have to identify shared data structures and
figure out the boundaries of your transactions but intuitively this
seems easier than increasing kernel concurrency by replacing a few
coarse grain locks with many fine grained locks.  Unfortunately I
don't have time to test this theory :-(

-- bakul



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199612130653.BAA17932>