Date: Sun, 9 Apr 2017 19:16:26 -0700 (PDT) From: Chris Torek <torek@elf.torek.net> To: rysto32@gmail.com, vasanth.raonaik@gmail.com Cc: ablacktshirt@gmail.com, ed@nuxi.nl, freebsd-hackers@freebsd.org Subject: Re: Understanding the FreeBSD locking mechanism Message-ID: <201704100216.v3A2GQ2s032227@elf.torek.net> In-Reply-To: <CAFMmRNzOypqsBam2BfaFm%2BpX7hSYoEvB2oFtec8OtH6D=s9yTw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Ryan Stone is of course correct here. I have not kept up with the latest terminology shifts, but I can describe a bit of the history of all of this. (I was somewhat involved with writing the original MTX_DEF and MTX_SPIN mutex code way back when). (None of the rest of this should be new to experienced kernel hackers.) In the old non-SMP days, BSD, like traditional V6 Unix, divided the kernel into "top half" and "bottom half" sections. The top half was anything driven from something other than an interrupt, such as initial bootstrap or any user-sourced system call. Each of these had just one (per-process) kernel stack, in the "u. area", which was UPAGES * NBPG (number of bytes per page) bytes long, but also had to contain "struct user". (In other words, the stack space available was actually smaller than that. The "user" struct was *above* the kernel stack, so that ksp would not grow down into the structure; there was also signal trampoline code wedged in there, at least on the VAX and some of the early other ports. I desperately wanted to move the trampoline code to libc for the sparc port. It was *in theory* easy to do this :-) ... practice was another matter.) When an interrupt arrived, as long as it was not interrupting another interrupt, the system would get on a separate "interrupt stack" -- some hardware supports this directly, with a separate interrupt stack register -- which meant we did not have to provide enough interrupt-handling space in the per-process kernel stack, nor take interrupts on some possibly dodgy user stack. (Interrupts can occur at any time, so the system may be running user code, not kernel code.) It also meant that a simple: s = splfoo(); call in the top half would block any interrupts at priority foo or lower, so that "top half" code could know that "bottom half" code for foo would not run at this point. With prioritized interrupts, *taking* an interrupt at level foo automatically raised the CPU's priority to foo, so any "bottom half" code for foo would know that no important "top half" code was running at the time -- if that had been the case, the top half would have done an splfoo() to block it -- and of course no other "bottom half" code for foo could run now. Meanwhile, no bottom-half code was *ever* allowed to block. When you took an interrupt, you were committed to finishing all of the work for that interrupt before issuing a "return from interrupt" instruction (which would, if the interrupt was not interrrupting another intterupt, switch back to the appropriate A good way to describe this strategy: s = splfoo(); /* block all bottom half code at this priority */ ... splx(s); /* resume ability of blocked bottom half code to run */ is that spl (set priority level) provides mutual exclusion to *code paths*. The top half blocks out the bottom half with an spl, and the bottom half blocks out the top half by simply *being* bottom-half code, handling interrupts. ----- With SMP, this whole strategy is a non-starter. We don't have just one CPU running code; we cannot block *code* paths at all. Instead, we switch to mutually exclusive access to *data*. We then make several observations: * Most data structures are mostly uncontested. (If not, we need to redesign them!) "Get lock" should be fast in this usual case. * If we provide what used to be "bottom half" drivers with *their own* stacks / interrupt threads ("ithreads"), they can block if they need to: when the data structure *is* actually contested. This means we mainly need just one kind of mutex. For read-mostly data structures, we would like to have shared-read locks and so on, but we can build them on this base mutex. (As it turns out, this view is a little simplistic; we want to build them on a base compare-and-swap, typically, rather than a base full-blown-mutex. It would also be nice to have CAS-based multi-producer single- consumer and single-producer multi-consumer queues. These are particuarly useful on systems with hundreds or thousands of cores.) Of course, we also have to start dealing with issues like priority inversion and lock order / possible deadlock, any time we lock data instead of code paths. But that's mostly a separate issue. ----- This is all fine for most code, including most device drivers, but for manipulating the hardware itself, the lowest level interrupt dispatching code, and also the system scheduler, still must block interrupts from time to time. We also have some special oddball cases, such as profiling interrupts, where we want to know: "What was running when the interrupt fired?" For these cases we *don't* want to switch to a separate interrupt thread: * In the hardware interrupt dispatcher, we may not *know which thread to switch to* yet. We must find the right ithread, then schedule it to run. (Then we have to manipulate the hardware based on whether the interrupt is edge or level triggered, and so on, but that's an in-the-weeds detail also mostly unrelated to this scheduling.) For the profiling "what was running" case, we'd like to sample what was running, which we *can't* do from a separate thread: we need access to the current stack. (Strictly speaking, we merely need said access ... but we also need that thread to remain paused while we sample it.) And, for some low-cost paths such as gathering entropy, we may not want or need to *pay* the up-front cost of a separate ithread. * In the scheduler, we're either in the process of choosing threads and changing stacks, or setting up data structures to tell the chooser which threads to choose. We need to block all scheduling events, including all interrupts, for some of these super-critical sections. These use the MTX_SPIN type lock, which is similar to MTX_DEF, but: * does block interrupts, and * never *invokes* the scheduler: never tries to put the current thread out of the running until the locked data are available. Since then, we have added another special case: * In a "critical section", we wish to make sure that the current thread does not migrate from one CPU to another. This does not, strictly speaking, require blocking interrupts entirely, but because the scheduler does its thing by blocking interrupts, we block interrupts for short durations here as well (actually when *leaving* the critical section, where we check to see if the scheduler would *like* us to migrate). This is not really a mutex at all, but it does interact with them, so it's worth mentioning. Essentially, if you are in a critical section, you may not switch threads, so if you need a mutex, you must use a spin mutex. (This *is* well-documented in "man 9 critical_enter".) One might argue that being in a critical section should turn an MTX_DEF mutex into an MTX_SPIN mutex, but it's not that easy to do, and if you're taking a possibly slow MTX_DEF lock *while* in a critical section, "you're doing it wrong" (as with the heavily contested datum problem, we should rewrite the code so that the critical section happens *while* holding the MTX_DEF object, not the other way around). Anyway, that's how we got here, and why things are the way they are. Chris
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201704100216.v3A2GQ2s032227>