From owner-freebsd-hackers@freebsd.org Sun Oct 1 18:09:50 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2407AE2B131 for ; Sun, 1 Oct 2017 18:09:50 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id BFE1E69982 for ; Sun, 1 Oct 2017 18:09:49 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v91I9iUg098691 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sun, 1 Oct 2017 21:09:44 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v91I9iUg098691 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v91I9hIc098690; Sun, 1 Oct 2017 21:09:43 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 1 Oct 2017 21:09:43 +0300 From: Konstantin Belousov To: Andreas Longwitz Cc: freebsd-hackers@freebsd.org Subject: Re: double fault on 10.3-Stable i386 during installworld Message-ID: <20171001180943.GO95911@kib.kiev.ua> References: <59D11664.1060206@incore.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <59D11664.1060206@incore.de> User-Agent: Mutt/1.9.1 (2017-09-22) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 01 Oct 2017 18:09:50 -0000 On Sun, Oct 01, 2017 at 06:23:00PM +0200, Andreas Longwitz wrote: > Hello hackers, > > On a server running 10.3-STABLE #2 r317936 i386 I got a double fault > exception after a fresh boot in single user mode during installworld. I > try to understand the magic of double fault and need help with this. > > The server did run with 8-Stable for many years without any problems and > now has run fine also with 10.3 for a week until the double fault > occured. The server has 4GB, one disk (amr) and I use UFS with SU. In > the kernel for 10.3 I have NKPT=36 (default is 30) because of the > problem I have described in Bug 216606. > > As far as I understand the debug output from console and the kerneldump, > the CPU 2 gets the DOUBLE_FAULT trap (0x17). Therefore CPU 2 sends the > ipi message T_NMI to the other CPU's, so they run the cpustop handler. > > I would like to know the reason for the DOUBLE_FAULT. > > What was the first exception ? > > My understanding is that during handling of the first exception there > was another exception, what was this second exception ? CPU does not report this information. It seems that there was a page fault, because ddb shows the interrupted frame as executing the first instruction of the Xpage label. Note that the page fault was not the reason for double fault (this fault was _not_ the 'first exception' in your terms). First instruction of any trap handler is pushl . Execution of this instruction in the page fault caused trap, attempt to report which caused another trap. It is only possible to conclude what was the initial trap and the secondary trap by looking at the dumped CPU state. > > The second exception was "changed" (by hardare ?) to DOUBLE_FAULT, is > that correct ? No, see above. Double fault is not related to nested faults, it happens atomically for ISA (instruction set architecture) level. For instance, destroyed page tables or IDT may result in it. Common reason for double fault is the stack overflow, since down the bottom of the kernel stack we allocate a guard page. So the attempt to handle trap by pushing %eflags/%cs/%eip results in page fault, which is the definition of double fault. But this is not your case, most likely, because stack depth is relatively small. > > Output serial console: > > Fatal double fault: > eip = 0xc0bacac8 > esp = 0xe437f000 > ebp = 0xe437fafc > cpuid = 2; apic id = 06 > panic: double fault > cpuid = 2 > KDB: stack backtrace: > db_trace_self_wrapper(c0cc3575,c0745e1b,10000000,fc,c0d2b220,...) at > db_trace_self_wrapper+0x2d/frame 0xc0e62c50 > kdb_backtrace(c0cf2f2c,2,c0cf3c11,c0e62d0c,2,...) at > kdb_backtrace+0x30/frame 0xc0e62cb8 > vpanic(c0cf3c11,c0e62d0c,c0e62d0c,c0e62d24,c0bc2bab,...) at > vpanic+0x11b/frame 0xc0e62cec > panic(c0cf3c11,6,6,2,e437fafc,...) at panic+0x1b/frame 0xc0e62d00 > dblfault_handler() at dblfault_handler+0xab/frame 0xc0e62d00 > --- trap 0x17, eip = 0xc0bacac8, esp = 0xe437f000, ebp = 0xe437fafc --- > Xpage(c7ebd000,0,608,c08d12a5,c7ebd000,...) at Xpage/frame 0xe437fafc > mi_switch(608,0,c0cc08f4,e2,c7ebd000,...) at mi_switch+0x145/frame > 0xe437fb34 > critical_exit(c7ebd000,0,2) at critical_exit+0x89/frame 0xe437fb50 > ipi_bitmap_handler(8,28,28,c7f83200,0,...) at > ipi_bitmap_handler+0x6b/frame 0xe437fb70 > Xipi_intr_bitmap_handler() at Xipi_intr_bitmap_handler+0x3d/frame 0xe437fb70 > --- interrupt, eip = 0xc0ba87e5, esp = 0xe437fbb8, ebp = 0xe437fbb8 --- > acpi_cpu_c1(16600000,16,a65c2a4c,0,c90786c0,...) at > acpi_cpu_c1+0x5/frame 0xe437fbb8 > acpi_cpu_idle(ffffffff,ffffffff,ffffffff,e437fc28,c0bb0e3a,...) at > acpi_cpu_idle+0x15a/frame 0xe437fbf8 > cpu_idle_acpi(ffffffff,ffffffff,c0e43484,c0e43488,c0e43494,...) at > cpu_idle_acpi+0x3f/frame 0xe437fc0c > cpu_idle(1,e437fc78,c0cc1cf3,a4c,0,...) at cpu_idle+0x9a/frame 0xe437fc28 > sched_idletd(0,e437fce8,0,0,0,...) at sched_idletd+0x1dd/frame 0xe437fca4 > fork_exit(c08f67e0,0,e437fce8) at fork_exit+0xa3/frame 0xe437fcd4 > fork_trampoline() at fork_trampoline+0x8/frame 0xe437fcd4 > --- trap 0, eip = 0, esp = 0xe437fd20, ebp = 0 ---