From owner-freebsd-hackers@freebsd.org Mon Apr 10 07:11:44 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 92935D373BB for ; Mon, 10 Apr 2017 07:11:44 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 371DDBC3 for ; Mon, 10 Apr 2017 07:11:44 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v3A7BcUx095781 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 10 Apr 2017 10:11:38 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v3A7BcUx095781 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v3A7Bb1h095780; Mon, 10 Apr 2017 10:11:37 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 10 Apr 2017 10:11:37 +0300 From: Konstantin Belousov To: Chris Torek Cc: rysto32@gmail.com, vasanth.raonaik@gmail.com, freebsd-hackers@freebsd.org, ed@nuxi.nl, ablacktshirt@gmail.com Subject: Re: Understanding the FreeBSD locking mechanism Message-ID: <20170410071137.GH1788@kib.kiev.ua> References: <201704100216.v3A2GQ2s032227@elf.torek.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201704100216.v3A2GQ2s032227@elf.torek.net> User-Agent: Mutt/1.8.0 (2017-02-23) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Apr 2017 07:11:44 -0000 On Sun, Apr 09, 2017 at 07:16:26PM -0700, Chris Torek wrote: > In the old non-SMP days, BSD, like traditional V6 Unix, divided > the kernel into "top half" and "bottom half" sections. The top > half was anything driven from something other than an interrupt, > such as initial bootstrap or any user-sourced system call. Each > of these had just one (per-process) kernel stack, in the "u. > area", which was UPAGES * NBPG (number of bytes per page) bytes > long, but also had to contain "struct user". > > (In other words, the stack space available was actually smaller > than that. The "user" struct was *above* the kernel stack, so > that ksp would not grow down into the structure; there was also > signal trampoline code wedged in there, at least on the VAX and > some of the early other ports. I desperately wanted to move the > trampoline code to libc for the sparc port. It was *in theory* > easy to do this :-) ... practice was another matter.) Signal trampolines never were put on the kernel stack, simply because uarea/kstack is not accessible from the user space. They lived on top the user mode stack of the main thread. Currently on x86/powerpc/arm, signal trampolines are mapped from the 'shared page', which was done to allow marking the user stack as non-executable. Kstack still contains the remnants of the uarea, renamed to (per-thread) pcb. There is no much sense in the split of struct thread vs. struct pcb, but it is historically survived up to this moment, and clearing things up requires too much MD work. My opinion is that pcb on kstack indeed only eats the space and better be put into td_md. Yet another thing which is shared with kstack, is the usermode FPU save area for x86 and arm64. At least on x86, the save area is dynamically sized at boot to support extentions like AVX/AVX256/AVX512 etc, and chomping part of the kstack saves one more contiguous KVA allocation and allows to reuse kstack cache. Again historically, pre-AVX kernels put XMM save area into pcb->kstack. > > When an interrupt arrived, as long as it was not interrupting > another interrupt, the system would get on a separate "interrupt > stack" -- some hardware supports this directly, with a separate > interrupt stack register -- which meant we did not have to provide > enough interrupt-handling space in the per-process kernel stack, > nor take interrupts on some possibly dodgy user stack. > (Interrupts can occur at any time, so the system may be running > user code, not kernel code.) No, this is not a case, at least on x86. There, 'normal' interrupts and exceptions reuse the current thread kstack, thus participating in the common stack overflow business. On i386, only NMI and double fault exceptions are routed through task gates in IDT, and are provided with the separate stack [double fault almost always indicates that stack overflow]. On amd64, TSS switching is impossible, but IDT descriptors may be marked with non-zero IST, which basically reference some static stack besides kstack. Only NMI uses IST. > Since then, we have added another special case: > > * In a "critical section", we wish to make sure that the current > thread does not migrate from one CPU to another. This does > not, strictly speaking, require blocking interrupts entirely, > but because the scheduler does its thing by blocking interrupts, > we block interrupts for short durations here as well (actually > when *leaving* the critical section, where we check to see if > the scheduler would *like* us to migrate). This is not true, both in explanation of intent, and in the implementation details. Critical section prevents de-scheduling of the current thread, disabling any context switches on the current CPU. It works by incrementing current thread td_critnest counter. Note that the interrupts are still enabled when critical section is ensured, so the flow of control can still be 'preempted' to the interrupt, but after return from the interrupt, current thread continues to execute. If any higher-priority thread needs to be scheduled due to interrupt, the scheduler and context switch are done after the td_critnest returns to zero. > > This is not really a mutex at all, but it does interact with > them, so it's worth mentioning. Essentially, if you are in a > critical section, you may not switch threads, so if you need > a mutex, you must use a spin mutex. You probably mixed critical_enter() and spinlock_enter() there. The later indeed disables interrupt and intended to be used as part of the spinlock (spin mutexes) implementation. > > (This *is* well-documented in "man 9 critical_enter".) The explanation in critical_enter(9) is somewhat misleading. The spinlock_enter() call consequences include most side-effects of critical_enter(), because interrupts are disabled for later and thus context-switching cannot occur at all. Spinlocks do not enter the critical section technically, i.e. the td_critnest is not incremented.