Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Apr 2017 15:46:45 -0700 (PDT)
From:      Chris Torek <torek@elf.torek.net>
To:        ablacktshirt@gmail.com, imp@bsdimp.com
Cc:        ed@nuxi.nl, freebsd-hackers@freebsd.org, kostikbel@gmail.com, rysto32@gmail.com
Subject:   Re: Understanding the FreeBSD locking mechanism
Message-ID:  <201704132246.v3DMkj27027791@elf.torek.net>
In-Reply-To: <a51d29c2-4cab-0dfc-6fdc-81d7b2188d61@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
(This is getting a bit far afield; let me know if we should
take this off-list.)

>why would [regular read, vs CAS] produce wrong result?

There are both hardware architecture (and sometimes individual
CPU architecture) and compiler reasons for this.

First, compilers may try to optimize load and store operations,
especially on register-rich architectures.  What's coded as:

    ... some code section A ...
    x = p->foo;
    y = p->bar;
    ... some code section B ...

might actually move the loads of p->foo and/or p->bar into either
the A or B sections.  The same goes for stores.  The compiler
makes the (somewhat reasonable for most programming) assumption
that only the instructions the compiler itself emits, actually
access the data -- not some instructions running on some other
CPU.

For any lock, this assumption is automatically wrong.

We can defeat part of this with the "volatile" keyword, but we
need to insert compiler level memory barriers to make sure that
the operations proceed in a temporally-defined manner, i.e.,
so that time appears to be linear.

Second, the CPU itself may also have both temporal and non-
temporal loads and stores (with arbitrarily complicated rules
about using them).  In this case there may be special instructions
("sfence", "mfence", etc; "membar" on SPARC) for forcing order.
For more about non-temporal operations, see, e.g.:

 http://stackoverflow.com/q/37070/1256452
 http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/CJACGJJF.html

There are some lock algorithms that work without most of this, but
they tend to be a bit hard to set up.  Even then we usually depend
on an atomic compare-and-swap: see
http://wiki.c2.com/?LockFreeSynchronization for instance.

>I think what the inner loop wants to do is to perform some no-op
>for a while before it tries again to acquire the spinlock.

Yes - but the point is that it tries to "gently" read the actual
mutex lock value, and inspect the result to see whether to try the
more-savage (at the hardware level) CAS.

Some of this gets deep into the weeds of hardware implementations.
I had this in my earlier reply (but ripped it out as too much
detail).  On Intel dual-socket systems, for instance, there is a
special bus that connects the two sockets called the QPI, and then
there are caches around each core within any one given socket.
These caches come in multiple levels (L1 and higher, details vary
and one should always expect the *next* CPU to do something
different) with some caches physically local to one core and
others shared between multiple cores in one socket.

These caches tend to coordinate using protocols called MESI or
MOESI.  The letters stand for cache line states: Modified, Owned,
Exclusive, Shared, or Invalid.  A Modified cache line has data not
yet written to the next level out (whether that's a higher level,
larger cache, or main memory).  An Excusive line is in this cache
only and can therefore be written-to.  (I'm ignoring "owned", it
is kind of a tweak between M and E.)  A Shared line is in this
cache *and some other cache* and therefore can *not* be written
to, but *can* be read from; and finally, an Invalid line has no
valid data at all.

As a rule, the closer a cache line is to the CPU, the faster its
access is.  (This rule is pretty reliable since otherwise there is
no point in having that cache.)  *Writing* to a cache line
requires exclusive access, though, so we must know if the line is
shared.  If it *is* shared, we must signal higher level caches
that we intend to write, and wait for them to give up their cached
copies of data.  In other words we fire a bullet at them: "I want
exclusive access, kill off your shared copy."  Then we must wait
for a reply, or a time delay (whichever is architecturally
appropriate), so that we know that this succeeded, or get a
failure indication ("you may not have that exlusively, not just
yet anyway").  This reply or delay takes up to the *worst case*,
slowest, access time may be.  For dual-socket Intel that means
doing a QPI bus transaction to the other socket.

(This is true for any write operation, not just compare-and-swap.
For this reason, we often like our mutexes to fill out a complete
cache line, so that any data *protected by* the mutex is not shot
out of the cache every time we poke at the mutex itself.)

Note that when we *read* the object, however, we're doing a read,
not a write.  This does not need exclusive access to the cache line:
shared access suffices.  If we do not have the data in cache, we
send out a request: "I would like to read this."  Any CPU that has
the item cached, e.g., whoever actually locked the lock, must drop
it back to whatever level accomplishes sharing -- if it's dirty,
writing it out -- and take his core-side cache line status back
from M or E to S.  Any other CPU also spinning, waiting for the
lock, must go to this shared state.  Now all CPUs interested in
the lock -- the holder, and all waiters -- have it shared.  They
can all *read* the line and see whether it's still locked.  There
is no traffic over inter-cache or inter-socket busses at this
point.  These are the "gentle" spins, that only *read* the lock.

Eventually, whoever owns the lock, unlocks it.  This requires a
write (or CAS) operation, which yanks the cache line away from all
the spinners (that part is unavoidable, and slow, and causes all
this bus traffic we were avoiding) and releases the lock.  The
spinners then take the shared cache lines back to shared state and
see that they *may* be able to get the lock.  At this point they
attempt the expensive operation, and *that* produces a reliable
answer -- which may be "someone else beat us to the lock so we
go back to the gentle spin code".

Note that some architectures do not have an actual compare-and-
swap instruction.  For instance, PowerPC and MIPS use a different
technique: there is a load instruction that takes the cache line
to exclusive state *and* sets an internal CPU register to remember
this.  If the cache line drops back out of exclusive state, a
subsequent "store conditional" instruction fails the condition,
does *not* store, and lets you branch to a loop that repeats the
load if needed.  If it is still exclusive, the write succeeds (and
the cache line goes to M state, if the cache is write-back).  This
lets you build a compare-and-swap from the low level cache-line
synchronization operations that are what the hardware uses.

There is more on this at:

http://stackoverflow.com/q/151783/1256452

and specifically for x86 (64 bit Intel and ARM) at:

http://stackoverflow.com/q/151783/1256452

(These are not the only ways to implement synchronization.
Some more exotic architectures have special regions of physical
memory that can act transactionally, or that auto-increment
upon load so that each CPU can "take a ticket" and use a
version of Lamport's Bakery algorithm: see
https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
for details.  However, the BSD mtx_lock() is designed around
compare-and-swap plus optimizations for MESI cache implementations.)

Chris



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201704132246.v3DMkj27027791>