From owner-freebsd-arch@freebsd.org Tue Feb 16 20:50:28 2016 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 25DCEAAB28F for ; Tue, 16 Feb 2016 20:50:28 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 0BA8D18D6 for ; Tue, 16 Feb 2016 20:50:28 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: by mailman.ysv.freebsd.org (Postfix) id 07DE4AAB28E; Tue, 16 Feb 2016 20:50:28 +0000 (UTC) Delivered-To: arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E1D02AAB28D for ; Tue, 16 Feb 2016 20:50:27 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id BD45E18D5 for ; Tue, 16 Feb 2016 20:50:27 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 134F0B93E for ; Tue, 16 Feb 2016 15:50:26 -0500 (EST) From: John Baldwin To: arch@freebsd.org Subject: Starting APs earlier during boot Date: Tue, 16 Feb 2016 12:50:22 -0800 Message-ID: <1730061.8Ii36ORVKt@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 16 Feb 2016 15:50:26 -0500 (EST) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Feb 2016 20:50:28 -0000 Currently the kernel bootstraps the non-boot processors fairly early in the SI_SUB_CPU SYSINIT. The APs then spin waiting to be "released". We currently release the APs as one of the last steps at SI_SUB_SMP. On the one hand this removes much of the need for synchronization while SYSINITs are running since SYSINITs basically assume they are single-threaded. However, it also enforces some odd quirks. Several places that deal with per-CPU resources have to split initialization up so that the BSP init happens in one SYSINIT and the initialization of the APs happens in a second SYSINIT at SI_SUB_SMP. Another issue that is becoming more prominent on x86 (and probably will also affect other platforms if it isn't already) is that to support working interrupts for interrupt config hooks we bind all interrupts to the BSP during boot and only distribute them among other CPUs near the end at SI_SUB_SMP. This is especially problematic with drivers for modern hardware allocating num(CPUs) interrupts (hoping to use one per CPU). On x86 we have aboug 190 IDT vectors available for device interrupts, so in theory we should be able to tolerate a lot of drivers doing this (e.g. 60 drivers could allocate 3 interrupts for every CPU and we should still be fine). However, if you have, say, 32 cores in a system, then you can only handle about 5 drivers doing this before you run out of vectors on CPU 0. Longer term we would also like to eventually have most drivers attach in the same environment during boot as during post-boot. Right now post-boot is quite different as all CPUs are running, interrupts work, etc. One of the goals of multipass support for new-bus is to help us get there by probing enough hardware to get timers working and starting the scheduler before probing the rest of the devices. That goal isn't quite realized yet. However, we can run a slightly simpler version of our scheduler before timers are working. In fact, sleep/wakeup work just fine fairly early (we allocate the necessary structures at SI_SUB_KMEM which is before the APs are even started). Once idle threads are created and ready we could in theory let the APs startup and run other threads. You just don't have working timeouts. OTOH, you can sort of simulate timeouts if you modify the scheduler to yield the CPU instead of blocking the thread for a sleep with a timeout. The effect would be for threads that do sleeps with a timeout to fall back to polling before timers are working. In practice, all of the early kernel threads use sleeps without timeouts when idle so this doesn't really matter. I've implemented these changes and tested them for x86. For x86 at least AP startup needed some bits of the interrupt infrastructure in place, so I moved SI_SUB_SMP up to after SI_SUB_INTR but before SI_SUB_SOFTINTR. I modified the *sleep() and cv_*wait*() routines to not always bail if cold is true. Instead, sleeps without a timeout are permitted to sleep "normally". Sleeps with a timeout drop their interlock and yield the CPU (but remain runnable). Since APs are now fully running this means interrupts are now routed to all CPUs from the get go removing the need for the post-boot shuffle. This also resolves the issue of running out of IDT vectors on the boot CPU. I believe that adopting other platforms for this change should be relatively simple, but we should do that before committing the full patch. I do think that some parts of the patch (such as the changes to the sleep routines, and using SI_SUB_LAST instead of SI_SUB_SMP as a catch-all SYSINIT) can be committed now without breaking anything. However, I'd like feedback on the general idea and if it is acceptable I'd like to coordinate testing with other platforms so this can go into the tree. The current changes are in the 'ap_startup' branch at github/bsdjhb/freebsd. You can view them here: https://github.com/bsdjhb/freebsd/compare/master...bsdjhb:ap_startup -- John Baldwin