From owner-freebsd-hackers@freebsd.org Sun Nov 5 16:24:26 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 03006E373FA for ; Sun, 5 Nov 2017 16:24:26 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from dss.incore.de (dss.incore.de [195.145.1.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 89815D5D for ; Sun, 5 Nov 2017 16:24:24 +0000 (UTC) (envelope-from longwitz@incore.de) Received: from inetmail.dmz (inetmail.dmz [10.3.0.3]) by dss.incore.de (Postfix) with ESMTP id 766851277; Sun, 5 Nov 2017 17:24:16 +0100 (CET) X-Virus-Scanned: amavisd-new at incore.de Received: from dss.incore.de ([10.3.0.3]) by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024) with LMTP id BHvlVF73H-QH; Sun, 5 Nov 2017 17:24:14 +0100 (CET) Received: from mail.local.incore (fwintern.dmz [10.0.0.253]) by dss.incore.de (Postfix) with ESMTP id B46FD117F; Sun, 5 Nov 2017 17:24:14 +0100 (CET) Received: from bsdmhs.longwitz (unknown [192.168.99.6]) by mail.local.incore (Postfix) with ESMTP id 79A3E508A1; Sun, 5 Nov 2017 17:24:14 +0100 (CET) Message-ID: <59FF3B2E.5010603@incore.de> Date: Sun, 05 Nov 2017 17:24:14 +0100 From: Andreas Longwitz User-Agent: Thunderbird 2.0.0.19 (X11/20090113) MIME-Version: 1.0 To: Konstantin Belousov CC: freebsd-hackers@freebsd.org Subject: Re: double fault on 10.3-Stable i386 during installworld References: <59D11664.1060206@incore.de> <20171001180943.GO95911@kib.kiev.ua> <59F910C5.8020709@incore.de> <20171101092619.GJ2566@kib.kiev.ua> In-Reply-To: <20171101092619.GJ2566@kib.kiev.ua> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Sun, 05 Nov 2017 18:31:18 +0000 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Nov 2017 16:24:26 -0000 Thanks for answer, I am now sure the reason for the double fault is not a FreeBSD problem, it is a CPU problem. >> On the stack we have >> >> 0xe437faa0: 0x00000000 R7:0xc0bc051c 0x00000020 0x00010007 >> >> so there is an exception on the instruction "movl PCB_CR3(%edx),%eax" >> in function cpu_switch(). The next stack entries indicates a lot of page >> faults, but the "double fault" happens not until the page boundary at >> 0xe437f000 is reached. I do not really understand this, but it seems to >> me that the thread > > Can you try to recover the %ecx, %edx values for the faulted frame ? > Note that %ecx is loaded from the on-stack argument. >From source swtch.s /* Save is done. Now fire up new thread. Leave old vmspace. */ movl 4(%esp),%edi movl 8(%esp),%ecx /* New thread */ movl 12(%esp),%esi /* New lock */ #ifdef INVARIANTS testl %ecx,%ecx /* no thread? */ jz badsw3 /* no, panic */ #endif movl TD_PCB(%ecx),%edx /* switch address space */ movl PCB_CR3(%edx),%eax it can be seen by inspection of the stack, that %ecx is loaded with address of newtd (0xc8029a20) and %edx is loaded with address of newpcb (0xf0a3ad40). So we see an exception during the execution of a correct machine instruction. At the moment of double fault I see the same values in the saved TSS: (kgdb) p/x __pcpu[2]->pc_common_tss $16 = {tss_link = 0x0, tss_esp0 = 0xe437fd30, tss_ss0 = 0x28, tss_esp1 = 0x0, tss_ss1 = 0x0, tss_esp2 = 0x0, tss_ss2 = 0x0, tss_cr3 = 0x0, tss_eip = 0xc0bacac8, tss_eflags = 0x10007, tss_eax = 0xc08f492f, tss_ecx = 0xc8029a20, tss_edx = 0xf0a3ad40, tss_ebx = 0xd3cf, t ss_esp = 0xe437f000, tss_ebp = 0xe437fafc, tss_esi = 0xc0e43400, tss_edi = 0xc7ebd000, tss_es = 0x28, tss_cs = 0x20, tss_ss = 0x28, ts s_ds = 0x28, tss_fs = 0x8, tss_gs = 0x3b, tss_ldt = 0x0, tss_ioopt = 0x680000}. Also we have tss_eax = 0xc08f492f = return address, so the movl for "switch address space" was not executed. > Do you have latest CPU microcode loaded ? Your machine is very old, > I believe this is P4 class processor, am I right ? I have to correct one detail: The output (kgdb) p/x cpu_id $4 = 0xf29 for the CPUID was correct, but the correspondig output from dmesg was not from the crashing server, so here is the correct one: CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2791.05-MHz 686-class CPU) Origin="GenuineIntel" Id=0xf29 Family=0xf Model=0x2 Stepping=9 Features=0xbfebfbff Features2=0x4400 kenv gives: smbios.bios.reldate="01/25/2005" smbios.bios.vendor="Intel Corporation" smbios.bios.version="SWV25.86B.0250.P39.0501252032" smbios.chassis.maker="Intel Corporation" smbios.memory.enabled="4194304" smbios.planar.maker="Intel " smbios.planar.product="SE7501WV2S" smbios.planar.serial="000E0C5C4ADE374" smbios.planar.version="A99386-112" smbios.socket.enabled="2" smbios.socket.populated="2" smbios.system.maker="MAXDATA" smbios.system.product="PLATINUM 2210R" (OEM, Intel SR2300) smbios.system.serial=" " smbios.system.uuid="d69da6f3-015e-11d9-b9dc-00108365a7e7" smbios.version="2.3" >From manual "Intel Xeon Processor (Document Number 249679-056(" I found my CPU is a Xeon 2.8B "Prestonia" (CPUID 0F29H, Core Stepping D1) released 8.11.2002. I have the last microcode revision m02f292d, but my BIOS version P39 was not latest. In the meantime I have upgraded to BIOS version P43. > Sure if pcb access faults, the system is in very broken state and > since an attempt to handle the fault causes a new fault for pcb access, > it recurses and dies due to the stack overflow. Agree. -- Andreas Longwitz