Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 1 Oct 2017 21:09:43 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Andreas Longwitz <longwitz@incore.de>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: double fault on 10.3-Stable i386 during installworld
Message-ID:  <20171001180943.GO95911@kib.kiev.ua>
In-Reply-To: <59D11664.1060206@incore.de>
References:  <59D11664.1060206@incore.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Oct 01, 2017 at 06:23:00PM +0200, Andreas Longwitz wrote:
> Hello hackers,
> 
> On a server running 10.3-STABLE #2 r317936 i386 I got a double fault
> exception after a fresh boot in single user mode during installworld. I
> try to understand the magic of double fault and need help with this.
> 
> The server did run with 8-Stable for many years without any problems and
> now has run fine also with 10.3 for a week until the double fault
> occured. The server has 4GB, one disk (amr) and I use UFS with SU. In
> the kernel for 10.3 I have NKPT=36 (default is 30) because of the
> problem I have described in Bug 216606.
> 
> As far as I understand the debug output from console and the kerneldump,
> the CPU 2 gets the DOUBLE_FAULT trap (0x17). Therefore CPU 2 sends the
> ipi message T_NMI to the other CPU's, so they run the cpustop handler.
> 
> I would like to know the reason for the DOUBLE_FAULT.
> 
> What was the first exception ?
> 
> My understanding is that during handling of the first exception there
> was another exception, what was this second exception ?
CPU does not report this information.

It seems that there was a page fault, because ddb shows the interrupted
frame as executing the first instruction of the Xpage label. Note that
the page fault was not the reason for double fault (this fault was _not_
the 'first exception' in your terms).

First instruction of any trap handler is pushl <trap code>. Execution
of this instruction in the page fault caused trap, attempt to report
which caused another trap. It is only possible to conclude what was the
initial trap and the secondary trap by looking at the dumped CPU state.

> 
> The second exception was "changed" (by hardare ?) to DOUBLE_FAULT, is
> that correct ?
No, see above.  Double fault is not related to nested faults, it happens
atomically for ISA (instruction set architecture) level.  For instance,
destroyed page tables or IDT may result in it.

Common reason for double fault is the stack overflow, since down the
bottom of the kernel stack we allocate a guard page.  So the attempt to
handle trap by pushing %eflags/%cs/%eip results in page fault, which
is the definition of double fault.  But this is not your case, most
likely, because stack depth is relatively small.

> 
> Output serial console:
> 
> Fatal double fault:
> eip = 0xc0bacac8
> esp = 0xe437f000
> ebp = 0xe437fafc
> cpuid = 2; apic id = 06
> panic: double fault
> cpuid = 2
> KDB: stack backtrace:
> db_trace_self_wrapper(c0cc3575,c0745e1b,10000000,fc,c0d2b220,...) at
> db_trace_self_wrapper+0x2d/frame 0xc0e62c50
> kdb_backtrace(c0cf2f2c,2,c0cf3c11,c0e62d0c,2,...) at
> kdb_backtrace+0x30/frame 0xc0e62cb8
> vpanic(c0cf3c11,c0e62d0c,c0e62d0c,c0e62d24,c0bc2bab,...) at
> vpanic+0x11b/frame 0xc0e62cec
> panic(c0cf3c11,6,6,2,e437fafc,...) at panic+0x1b/frame 0xc0e62d00
> dblfault_handler() at dblfault_handler+0xab/frame 0xc0e62d00
> --- trap 0x17, eip = 0xc0bacac8, esp = 0xe437f000, ebp = 0xe437fafc ---
> Xpage(c7ebd000,0,608,c08d12a5,c7ebd000,...) at Xpage/frame 0xe437fafc
> mi_switch(608,0,c0cc08f4,e2,c7ebd000,...) at mi_switch+0x145/frame
> 0xe437fb34
> critical_exit(c7ebd000,0,2) at critical_exit+0x89/frame 0xe437fb50
> ipi_bitmap_handler(8,28,28,c7f83200,0,...) at
> ipi_bitmap_handler+0x6b/frame 0xe437fb70
> Xipi_intr_bitmap_handler() at Xipi_intr_bitmap_handler+0x3d/frame 0xe437fb70
> --- interrupt, eip = 0xc0ba87e5, esp = 0xe437fbb8, ebp = 0xe437fbb8 ---
> acpi_cpu_c1(16600000,16,a65c2a4c,0,c90786c0,...) at
> acpi_cpu_c1+0x5/frame 0xe437fbb8
> acpi_cpu_idle(ffffffff,ffffffff,ffffffff,e437fc28,c0bb0e3a,...) at
> acpi_cpu_idle+0x15a/frame 0xe437fbf8
> cpu_idle_acpi(ffffffff,ffffffff,c0e43484,c0e43488,c0e43494,...) at
> cpu_idle_acpi+0x3f/frame 0xe437fc0c
> cpu_idle(1,e437fc78,c0cc1cf3,a4c,0,...) at cpu_idle+0x9a/frame 0xe437fc28
> sched_idletd(0,e437fce8,0,0,0,...) at sched_idletd+0x1dd/frame 0xe437fca4
> fork_exit(c08f67e0,0,e437fce8) at fork_exit+0xa3/frame 0xe437fcd4
> fork_trampoline() at fork_trampoline+0x8/frame 0xe437fcd4
> --- trap 0, eip = 0, esp = 0xe437fd20, ebp = 0 ---



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20171001180943.GO95911>