Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 9 Apr 2017 19:16:26 -0700 (PDT)
From:      Chris Torek <torek@elf.torek.net>
To:        rysto32@gmail.com, vasanth.raonaik@gmail.com
Cc:        ablacktshirt@gmail.com, ed@nuxi.nl, freebsd-hackers@freebsd.org
Subject:   Re: Understanding the FreeBSD locking mechanism
Message-ID:  <201704100216.v3A2GQ2s032227@elf.torek.net>
In-Reply-To: <CAFMmRNzOypqsBam2BfaFm%2BpX7hSYoEvB2oFtec8OtH6D=s9yTw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Ryan Stone is of course correct here.  I have not kept up with the
latest terminology shifts, but I can describe a bit of the history
of all of this.  (I was somewhat involved with writing the
original MTX_DEF and MTX_SPIN mutex code way back when).

(None of the rest of this should be new to experienced kernel
hackers.)

In the old non-SMP days, BSD, like traditional V6 Unix, divided
the kernel into "top half" and "bottom half" sections.  The top
half was anything driven from something other than an interrupt,
such as initial bootstrap or any user-sourced system call.  Each
of these had just one (per-process) kernel stack, in the "u.
area", which was UPAGES * NBPG (number of bytes per page) bytes
long, but also had to contain "struct user".

(In other words, the stack space available was actually smaller
than that.  The "user" struct was *above* the kernel stack, so
that ksp would not grow down into the structure; there was also
signal trampoline code wedged in there, at least on the VAX and
some of the early other ports.  I desperately wanted to move the
trampoline code to libc for the sparc port.  It was *in theory*
easy to do this :-) ... practice was another matter.)

When an interrupt arrived, as long as it was not interrupting
another interrupt, the system would get on a separate "interrupt
stack" -- some hardware supports this directly, with a separate
interrupt stack register -- which meant we did not have to provide
enough interrupt-handling space in the per-process kernel stack,
nor take interrupts on some possibly dodgy user stack.
(Interrupts can occur at any time, so the system may be running
user code, not kernel code.)

It also meant that a simple:

    s = splfoo();

call in the top half would block any interrupts at priority foo or
lower, so that "top half" code could know that "bottom half" code
for foo would not run at this point.

With prioritized interrupts, *taking* an interrupt at level foo
automatically raised the CPU's priority to foo, so any "bottom
half" code for foo would know that no important "top half" code
was running at the time -- if that had been the case, the top half
would have done an splfoo() to block it -- and of course no other
"bottom half" code for foo could run now.

Meanwhile, no bottom-half code was *ever* allowed to block.  When
you took an interrupt, you were committed to finishing all of the
work for that interrupt before issuing a "return from interrupt"
instruction (which would, if the interrupt was not interrrupting
another intterupt, switch back to the appropriate

A good way to describe this strategy:

    s = splfoo();  /* block all bottom half code at this priority */
    ...
    splx(s);       /* resume ability of blocked bottom half code to run */

is that spl (set priority level) provides mutual exclusion to
*code paths*.  The top half blocks out the bottom half with an
spl, and the bottom half blocks out the top half by simply *being*
bottom-half code, handling interrupts.

     -----

With SMP, this whole strategy is a non-starter.  We don't have
just one CPU running code; we cannot block *code* paths at all.
Instead, we switch to mutually exclusive access to *data*.

We then make several observations:

 * Most data structures are mostly uncontested.  (If not, we need to
   redesign them!)  "Get lock" should be fast in this usual case.

 * If we provide what used to be "bottom half" drivers with *their
   own* stacks / interrupt threads ("ithreads"), they can block if
   they need to: when the data structure *is* actually contested.

This means we mainly need just one kind of mutex.  For read-mostly
data structures, we would like to have shared-read locks and so
on, but we can build them on this base mutex.  (As it turns out,
this view is a little simplistic; we want to build them on a base
compare-and-swap, typically, rather than a base full-blown-mutex.
It would also be nice to have CAS-based multi-producer single-
consumer and single-producer multi-consumer queues.  These are
particuarly useful on systems with hundreds or thousands of
cores.)

Of course, we also have to start dealing with issues like priority
inversion and lock order / possible deadlock, any time we lock
data instead of code paths.  But that's mostly a separate issue.

     -----

This is all fine for most code, including most device drivers, but
for manipulating the hardware itself, the lowest level interrupt
dispatching code, and also the system scheduler, still must block
interrupts from time to time.  We also have some special oddball
cases, such as profiling interrupts, where we want to know: "What
was running when the interrupt fired?"  For these cases we *don't*
want to switch to a separate interrupt thread:

 * In the hardware interrupt dispatcher, we may not *know which
   thread to switch to* yet.  We must find the right ithread,
   then schedule it to run.  (Then we have to manipulate the
   hardware based on whether the interrupt is edge or level
   triggered, and so on, but that's an in-the-weeds detail also
   mostly unrelated to this scheduling.)

   For the profiling "what was running" case, we'd like to sample
   what was running, which we *can't* do from a separate thread:
   we need access to the current stack.  (Strictly speaking, we
   merely need said access ... but we also need that thread to
   remain paused while we sample it.)

   And, for some low-cost paths such as gathering entropy, we may
   not want or need to *pay* the up-front cost of a separate ithread.

 * In the scheduler, we're either in the process of choosing
   threads and changing stacks, or setting up data structures to
   tell the chooser which threads to choose.  We need to block all
   scheduling events, including all interrupts, for some of these
   super-critical sections.

These use the MTX_SPIN type lock, which is similar to MTX_DEF, but:

 * does block interrupts, and 

 * never *invokes* the scheduler: never tries to put the current
   thread out of the running until the locked data are available.

Since then, we have added another special case:

 * In a "critical section", we wish to make sure that the current
   thread does not migrate from one CPU to another.  This does
   not, strictly speaking, require blocking interrupts entirely,
   but because the scheduler does its thing by blocking interrupts,
   we block interrupts for short durations here as well (actually
   when *leaving* the critical section, where we check to see if
   the scheduler would *like* us to migrate).

   This is not really a mutex at all, but it does interact with
   them, so it's worth mentioning.  Essentially, if you are in a
   critical section, you may not switch threads, so if you need
   a mutex, you must use a spin mutex.

   (This *is* well-documented in "man 9 critical_enter".)

One might argue that being in a critical section should turn an
MTX_DEF mutex into an MTX_SPIN mutex, but it's not that easy to
do, and if you're taking a possibly slow MTX_DEF lock *while* in a
critical section, "you're doing it wrong" (as with the heavily
contested datum problem, we should rewrite the code so that the
critical section happens *while* holding the MTX_DEF object, not
the other way around).

Anyway, that's how we got here, and why things are the way they
are.

Chris



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201704100216.v3A2GQ2s032227>