From owner-freebsd-hackers@freebsd.org  Tue Apr 11 23:11:06 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C627CD3A0D1
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Tue, 11 Apr 2017 23:11:06 +0000 (UTC)
 (envelope-from torek@elf.torek.net)
Received: from elf.torek.net (mail.torek.net [96.90.199.121])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "elf.torek.net", Issuer "elf.torek.net" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id B27191B24
 for <freebsd-hackers@freebsd.org>; Tue, 11 Apr 2017 23:11:06 +0000 (UTC)
 (envelope-from torek@elf.torek.net)
Received: from elf.torek.net (localhost [127.0.0.1])
 by elf.torek.net (8.15.2/8.15.2) with ESMTPS id v3BNB45w094086
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Tue, 11 Apr 2017 16:11:04 -0700 (PDT)
 (envelope-from torek@elf.torek.net)
Received: (from torek@localhost)
 by elf.torek.net (8.15.2/8.15.2/Submit) id v3BNB4fc094085;
 Tue, 11 Apr 2017 16:11:04 -0700 (PDT) (envelope-from torek)
Date: Tue, 11 Apr 2017 16:11:04 -0700 (PDT)
From: Chris Torek <torek@elf.torek.net>
Message-Id: <201704112311.v3BNB4fc094085@elf.torek.net>
To: ablacktshirt@gmail.com, imp@bsdimp.com
Subject: Re: Understanding the FreeBSD locking mechanism
Cc: ed@nuxi.nl, freebsd-hackers@freebsd.org, rysto32@gmail.com
In-Reply-To: <4768e26a-cdec-6f40-1463-ece9847ca34d@gmail.com>
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2
 (elf.torek.net [127.0.0.1]); Tue, 11 Apr 2017 16:11:04 -0700 (PDT)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Apr 2017 23:11:06 -0000

>The difference between the "ithread" and "interrupt filter" things
>is that ithread has its own thread context, while interrupt handling 
>through interrupt filter shares the same kernel stack.

Right -- though rather than "the same" I would just say "shares
a stack", i.e., we're not concerned with *whose* stack and/or
thread we're borrowing, just that we have one borrowed.

>So, for ithread, we should use the MTX_DEF, which don't disable
>interrupt, and for "interrupt filter", we should use the MTX_SPIN, which
>disable interrupt.

Right.

>What really confuses me is that I don't really see how owning an
>"independent" thread context(i.e ithread) makes a thread run in the 
>"top-half" and how sharing the same kernel stack makes a thread run in
>the "bottom-half".

It's not that it *makes* it run that way, it's that it *allows* it
to run that way -- and then the scheduler *does* run it that way.

>I did read your long explanation in the previous mail. For the non-SMP
>case, the "top-half/bottom-half" model goes well and I understand how 
>the *code* path/*data* path things go. But I cannot still fully
>understand the model for the SMP case.

It's fundamentally fairly tricky, but we start with that same first
notion:

 * If you have your own state (i.e., stack), you can be suspended
   (stopped in the scheduler, giving the CPU to other threads):
   *your* (private) state is preserved on *your* (private) stack.

 * If you have borrowed someone else's state, anything that suspends
   you, suspends them too.  Since this may deadlock, you are not
   allowed to do it at all.

Once we block interrupts locally (as for MTX_SPIN, or
automatically inside a filter style or "bottom half" interrupt),
we are in a special state: we may not take *any* MTX_DEF locks at
all (the kernel should panic if we do).

This in turn means that data structures are protected *either* by
a spin mutex *or* by a default (non-spin) mutex, never both.  So
if you need to touch a spin-mutex data structure from thread-y
("top half") code, you obtain the spin mutex, and now no interrupts
will occur *on this CPU*, and as a key side effect, you won't move
*off* this CPU either.  If an interrupt occurs on another CPU and
it goes to take the spin lock that protects that CPU, it loops
at that point, not switching tasks, waiting for the MTX_SPIN mutex
to be released:

       CPU 1                          CPU 2
    ----------------------------|-----------------------------
    func() {                    | ... code not involving mtx
        mtx_lock_spin(&mtx);    | ...
        do some work            |    mtx_lock_spin(&mtx); /* loops */
             .                  |        [stuck]
             .                  |        [stuck]
             .                  |        [stuck]
       mtx_unlock_spin(&mtx);   |        [unstuck]
             ...                |        do some work

If an interrupt occurs on CPU 2, and that interrupt-handling code
wants to touch the data protected by the spin lock, that code
obtains the spin lock as usual.  Meanwhile the interrupt *cannot*
occur on CPU 1, as holding the spin lock has blocked interrupts.
So the code path on CPU 2 blocks -- looping in mtx_lock_spin(),
not giving CPU 2 over to the scheduler -- for as long as CPU 1
holds the spin lock.  The corresponding code path is already
blocked on CPU 1, the same way it was back in the non-SMP, single-
CPU days.

This means it is unwise to hold spin locks for long periods.  In
fact, if CPU 2 waits too long in that [stuck] section, it will
panic, on the assumption that CPU 1 has done something terrible
and the system is now hung.

This is also waht gives rise to the constrant that you must take
MTX_SPIN locks "inside" any outer MTX_DEF locks.

Chris