From owner-freebsd-hackers@freebsd.org  Mon Apr 10 07:11:44 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 92935D373BB
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Mon, 10 Apr 2017 07:11:44 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 371DDBC3
 for <freebsd-hackers@freebsd.org>; Mon, 10 Apr 2017 07:11:44 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kib@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v3A7BcUx095781
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Mon, 10 Apr 2017 10:11:38 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v3A7BcUx095781
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id v3A7Bb1h095780;
 Mon, 10 Apr 2017 10:11:37 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Mon, 10 Apr 2017 10:11:37 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Chris Torek <torek@mail.torek.net>
Cc: rysto32@gmail.com, vasanth.raonaik@gmail.com, freebsd-hackers@freebsd.org, 
 ed@nuxi.nl, ablacktshirt@gmail.com
Subject: Re: Understanding the FreeBSD locking mechanism
Message-ID: <20170410071137.GH1788@kib.kiev.ua>
References: <CAFMmRNzOypqsBam2BfaFm+pX7hSYoEvB2oFtec8OtH6D=s9yTw@mail.gmail.com>
 <201704100216.v3A2GQ2s032227@elf.torek.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201704100216.v3A2GQ2s032227@elf.torek.net>
User-Agent: Mutt/1.8.0 (2017-02-23)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Apr 2017 07:11:44 -0000

On Sun, Apr 09, 2017 at 07:16:26PM -0700, Chris Torek wrote:
> In the old non-SMP days, BSD, like traditional V6 Unix, divided
> the kernel into "top half" and "bottom half" sections.  The top
> half was anything driven from something other than an interrupt,
> such as initial bootstrap or any user-sourced system call.  Each
> of these had just one (per-process) kernel stack, in the "u.
> area", which was UPAGES * NBPG (number of bytes per page) bytes
> long, but also had to contain "struct user".
> 
> (In other words, the stack space available was actually smaller
> than that.  The "user" struct was *above* the kernel stack, so
> that ksp would not grow down into the structure; there was also
> signal trampoline code wedged in there, at least on the VAX and
> some of the early other ports.  I desperately wanted to move the
> trampoline code to libc for the sparc port.  It was *in theory*
> easy to do this :-) ... practice was another matter.)
Signal trampolines never were put on the kernel stack, simply because
uarea/kstack is not accessible from the user space. They lived on top
the user mode stack of the main thread. Currently on x86/powerpc/arm,
signal trampolines are mapped from the 'shared page', which was done to
allow marking the user stack as non-executable.

Kstack still contains the remnants of the uarea, renamed to (per-thread)
pcb. There is no much sense in the split of struct thread vs. struct
pcb, but it is historically survived up to this moment, and clearing
things up requires too much MD work.
My opinion is that pcb on kstack indeed only eats the space and better be
put into td_md.

Yet another thing which is shared with kstack, is the usermode FPU save
area for x86 and arm64. At least on x86, the save area is dynamically
sized at boot to support extentions like AVX/AVX256/AVX512 etc, and
chomping part of the kstack saves one more contiguous KVA allocation and
allows to reuse kstack cache. Again historically, pre-AVX kernels put
XMM save area into pcb->kstack.

> 
> When an interrupt arrived, as long as it was not interrupting
> another interrupt, the system would get on a separate "interrupt
> stack" -- some hardware supports this directly, with a separate
> interrupt stack register -- which meant we did not have to provide
> enough interrupt-handling space in the per-process kernel stack,
> nor take interrupts on some possibly dodgy user stack.
> (Interrupts can occur at any time, so the system may be running
> user code, not kernel code.)
No, this is not a case, at least on x86. There, 'normal' interrupts
and exceptions reuse the current thread kstack, thus participating in
the common stack overflow business. On i386, only NMI and double fault
exceptions are routed through task gates in IDT, and are provided with
the separate stack [double fault almost always indicates that stack
overflow]. On amd64, TSS switching is impossible, but IDT descriptors
may be marked with non-zero IST, which basically reference some static
stack besides kstack. Only NMI uses IST.

> Since then, we have added another special case:
> 
>  * In a "critical section", we wish to make sure that the current
>    thread does not migrate from one CPU to another.  This does
>    not, strictly speaking, require blocking interrupts entirely,
>    but because the scheduler does its thing by blocking interrupts,
>    we block interrupts for short durations here as well (actually
>    when *leaving* the critical section, where we check to see if
>    the scheduler would *like* us to migrate).
This is not true, both in explanation of intent, and in the implementation
details.

Critical section prevents de-scheduling of the current thread, disabling
any context switches on the current CPU.  It works by incrementing current
thread td_critnest counter.  Note that the interrupts are still enabled
when critical section is ensured, so the flow of control can still be
'preempted' to the interrupt, but after return from the interrupt, current
thread continues to execute.  If any higher-priority thread needs to be
scheduled due to interrupt, the scheduler and context switch are done after
the td_critnest returns to zero.

> 
>    This is not really a mutex at all, but it does interact with
>    them, so it's worth mentioning.  Essentially, if you are in a
>    critical section, you may not switch threads, so if you need
>    a mutex, you must use a spin mutex.
You probably mixed critical_enter() and spinlock_enter() there.
The later indeed disables interrupt and intended to be used as part
of the spinlock (spin mutexes) implementation.


> 
>    (This *is* well-documented in "man 9 critical_enter".)
The explanation in critical_enter(9) is somewhat misleading.
The spinlock_enter() call consequences include most side-effects of
critical_enter(), because interrupts are disabled for later and thus
context-switching cannot occur at all.  Spinlocks do not enter the
critical section technically, i.e. the td_critnest is not incremented.