Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Jul 2000 23:00:49 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        cp@bsdi.com (Chuck Paterson)
Cc:        grog@lemis.com (Greg Lehey), eischen@vigrid.com (Daniel Eischen), jasone@canonware.com (Jason Evans), luoqi@watermarkgroup.com (Luoqi Chen), smp@FreeBSD.ORG
Subject:   Re: SMP meeting summary
Message-ID:  <200007052300.QAA26840@usr05.primenet.com>
In-Reply-To: <200007040218.UAA01169@berserker.bsdi.com> from "Chuck Paterson" at Jul 03, 2000 08:18:00 PM

next in thread | previous in thread | raw e-mail | index | archive | help
> }I'm not sure we're talking about the same thing, but if so I must be
> }missing something.  If I'm waiting on a mutex, I still need to
> }reacquire it on wakeup, don't I?  In that case, only the first process
> }to be scheduled will actually get the mutex, and the others will block
> }again.
> 
> 	Yes, you need to acquire the mutex on wakeup, but likely
> one process will run acquiring and releasing the mutex in an
> uncontested fashion before other processes run and do the same
> thing.

You can assume in SVR4, in the wake_one case, that you will be
the only process awake, and so your acquisition will not be
contested, and will not result in a sleep.

Logically, you can consider that there is one waiter and N-1
sleepers for every N processses trying to acquire a mutex.

This is normally handled [in the literature] by using a hybrid
lock in a hierarchy.

That is, you attempt a fast lock, and if that fails, then you
attempt a slow ("sleeping") lock.  You are guaranteed a wakeup
on release of a fast lock, and on release of a sleeping lock,
so it's sixes,

Of course, it's a lot easier to just critical section.


> }In my experience, I've seen mutexes used for long-term waits, and I
> }don't see any a priori reason not to do so.  Of course, if we make
> }design decisions based on the assumption that all waits will be short,
> }then we will have a reason, but it won't be a good one.
> }
> }Before you say that long-term waits are evil, note that we're probably
> }talking about different kinds of waits.  Obviously anything that
> }threatens to keep the system idle while it waits is bad, but a
> }replacement for tsleep(), say, can justifiably wait for a long time.
> 
> 	A replacement for tsleep is not a mutex, but in Solaris
> parlance a conditional variable. The uses are different, one is
> for locking a resource, the other is waiting on a synch event. A
> conditional variable, like the sleep queues has a mutex associated
> with it. This mutex is not held except while processing the event,
> both by the process waiting and the process doing the activation.
> I don't think it is a good idea to assume that the heuristics for
> waking up tsleep / conditional variables  is going to be
> anything like those seen with mutexs.

Effectively, condition variables are critical sectioned in their
manipulation through the use of a mutex.

In practice, there are some ugly areas in the Solaris SMP
reentrant VFS code that necessitate trating the cond variable
as if it were a mutext on a larger structure.  This reduces
concurrency considerably.

The main point about wake_one that's problematic is the deadly
embrace deadlock, not the priority inversion deadlock, which
can always be "opted out of" by lending (or making the wake_one
more choosy about who it wakes, above and beyond the head of
the wait queue).

The thing that makes a thundering herd expensive is less the
herd than it is the traversal of the list; think about it: if
I have the cycles to burn in the scheduler to pick someone to
run, then I wasn't doing important other work anyway, and I
might as well burn them in the herd, as opposed to other places
I could burn them.

A spinlock fixes this by implementing back-off + retry, at
least for sets of two locks.  Sets of more locks are really
problematic.

A lot of work was done in SVR4 ES/MP to, effectively, resolve
the problem using Djikstra's "Banker's Algorithm" (that is, all
the resources for sets of greater than two members, and in some
cases, one member -- usually parent directory in a descending
path lookup -- are allocated "up front", which is to say "at
the same stack depth/in the same function" to permit state to
be backed out easily in the case of a deadlock detection).

This stuff is really unsatisfying from the point of view of
someone trying to write a reentrant ("kernel thread safe" or
"kernel preemption safe") VFS provider of some kind, since
it's really hard to know when the semantics applied by an upper
level function might result in a problem.  Other subsystems
have similar issues, but most of my experience was with VFS
providers, so I can't give you the PCMCIA device attach issues
in SVR4 (maybe we can track down Kurt Mahon, though).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200007052300.QAA26840>