From owner-freebsd-hackers@freebsd.org Thu Apr 13 22:46:48 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 58531D3CF29 for ; Thu, 13 Apr 2017 22:46:48 +0000 (UTC) (envelope-from torek@elf.torek.net) Received: from elf.torek.net (mail.torek.net [96.90.199.121]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "elf.torek.net", Issuer "elf.torek.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 42DA09DC for ; Thu, 13 Apr 2017 22:46:47 +0000 (UTC) (envelope-from torek@elf.torek.net) Received: from elf.torek.net (localhost [127.0.0.1]) by elf.torek.net (8.15.2/8.15.2) with ESMTPS id v3DMkjGK027792 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 13 Apr 2017 15:46:45 -0700 (PDT) (envelope-from torek@elf.torek.net) Received: (from torek@localhost) by elf.torek.net (8.15.2/8.15.2/Submit) id v3DMkj27027791; Thu, 13 Apr 2017 15:46:45 -0700 (PDT) (envelope-from torek) Date: Thu, 13 Apr 2017 15:46:45 -0700 (PDT) From: Chris Torek Message-Id: <201704132246.v3DMkj27027791@elf.torek.net> To: ablacktshirt@gmail.com, imp@bsdimp.com Subject: Re: Understanding the FreeBSD locking mechanism Cc: ed@nuxi.nl, freebsd-hackers@freebsd.org, kostikbel@gmail.com, rysto32@gmail.com In-Reply-To: X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (elf.torek.net [127.0.0.1]); Thu, 13 Apr 2017 15:46:45 -0700 (PDT) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Apr 2017 22:46:48 -0000 (This is getting a bit far afield; let me know if we should take this off-list.) >why would [regular read, vs CAS] produce wrong result? There are both hardware architecture (and sometimes individual CPU architecture) and compiler reasons for this. First, compilers may try to optimize load and store operations, especially on register-rich architectures. What's coded as: ... some code section A ... x = p->foo; y = p->bar; ... some code section B ... might actually move the loads of p->foo and/or p->bar into either the A or B sections. The same goes for stores. The compiler makes the (somewhat reasonable for most programming) assumption that only the instructions the compiler itself emits, actually access the data -- not some instructions running on some other CPU. For any lock, this assumption is automatically wrong. We can defeat part of this with the "volatile" keyword, but we need to insert compiler level memory barriers to make sure that the operations proceed in a temporally-defined manner, i.e., so that time appears to be linear. Second, the CPU itself may also have both temporal and non- temporal loads and stores (with arbitrarily complicated rules about using them). In this case there may be special instructions ("sfence", "mfence", etc; "membar" on SPARC) for forcing order. For more about non-temporal operations, see, e.g.: http://stackoverflow.com/q/37070/1256452 http://infocenter.arm.com/help/topic/com.arm.doc.den0024a/CJACGJJF.html There are some lock algorithms that work without most of this, but they tend to be a bit hard to set up. Even then we usually depend on an atomic compare-and-swap: see http://wiki.c2.com/?LockFreeSynchronization for instance. >I think what the inner loop wants to do is to perform some no-op >for a while before it tries again to acquire the spinlock. Yes - but the point is that it tries to "gently" read the actual mutex lock value, and inspect the result to see whether to try the more-savage (at the hardware level) CAS. Some of this gets deep into the weeds of hardware implementations. I had this in my earlier reply (but ripped it out as too much detail). On Intel dual-socket systems, for instance, there is a special bus that connects the two sockets called the QPI, and then there are caches around each core within any one given socket. These caches come in multiple levels (L1 and higher, details vary and one should always expect the *next* CPU to do something different) with some caches physically local to one core and others shared between multiple cores in one socket. These caches tend to coordinate using protocols called MESI or MOESI. The letters stand for cache line states: Modified, Owned, Exclusive, Shared, or Invalid. A Modified cache line has data not yet written to the next level out (whether that's a higher level, larger cache, or main memory). An Excusive line is in this cache only and can therefore be written-to. (I'm ignoring "owned", it is kind of a tweak between M and E.) A Shared line is in this cache *and some other cache* and therefore can *not* be written to, but *can* be read from; and finally, an Invalid line has no valid data at all. As a rule, the closer a cache line is to the CPU, the faster its access is. (This rule is pretty reliable since otherwise there is no point in having that cache.) *Writing* to a cache line requires exclusive access, though, so we must know if the line is shared. If it *is* shared, we must signal higher level caches that we intend to write, and wait for them to give up their cached copies of data. In other words we fire a bullet at them: "I want exclusive access, kill off your shared copy." Then we must wait for a reply, or a time delay (whichever is architecturally appropriate), so that we know that this succeeded, or get a failure indication ("you may not have that exlusively, not just yet anyway"). This reply or delay takes up to the *worst case*, slowest, access time may be. For dual-socket Intel that means doing a QPI bus transaction to the other socket. (This is true for any write operation, not just compare-and-swap. For this reason, we often like our mutexes to fill out a complete cache line, so that any data *protected by* the mutex is not shot out of the cache every time we poke at the mutex itself.) Note that when we *read* the object, however, we're doing a read, not a write. This does not need exclusive access to the cache line: shared access suffices. If we do not have the data in cache, we send out a request: "I would like to read this." Any CPU that has the item cached, e.g., whoever actually locked the lock, must drop it back to whatever level accomplishes sharing -- if it's dirty, writing it out -- and take his core-side cache line status back from M or E to S. Any other CPU also spinning, waiting for the lock, must go to this shared state. Now all CPUs interested in the lock -- the holder, and all waiters -- have it shared. They can all *read* the line and see whether it's still locked. There is no traffic over inter-cache or inter-socket busses at this point. These are the "gentle" spins, that only *read* the lock. Eventually, whoever owns the lock, unlocks it. This requires a write (or CAS) operation, which yanks the cache line away from all the spinners (that part is unavoidable, and slow, and causes all this bus traffic we were avoiding) and releases the lock. The spinners then take the shared cache lines back to shared state and see that they *may* be able to get the lock. At this point they attempt the expensive operation, and *that* produces a reliable answer -- which may be "someone else beat us to the lock so we go back to the gentle spin code". Note that some architectures do not have an actual compare-and- swap instruction. For instance, PowerPC and MIPS use a different technique: there is a load instruction that takes the cache line to exclusive state *and* sets an internal CPU register to remember this. If the cache line drops back out of exclusive state, a subsequent "store conditional" instruction fails the condition, does *not* store, and lets you branch to a loop that repeats the load if needed. If it is still exclusive, the write succeeds (and the cache line goes to M state, if the cache is write-back). This lets you build a compare-and-swap from the low level cache-line synchronization operations that are what the hardware uses. There is more on this at: http://stackoverflow.com/q/151783/1256452 and specifically for x86 (64 bit Intel and ARM) at: http://stackoverflow.com/q/151783/1256452 (These are not the only ways to implement synchronization. Some more exotic architectures have special regions of physical memory that can act transactionally, or that auto-increment upon load so that each CPU can "take a ticket" and use a version of Lamport's Bakery algorithm: see https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm for details. However, the BSD mtx_lock() is designed around compare-and-swap plus optimizations for MESI cache implementations.) Chris