Date: Sun, 05 Nov 2017 17:24:14 +0100 From: Andreas Longwitz <longwitz@incore.de> To: Konstantin Belousov <kostikbel@gmail.com> Cc: freebsd-hackers@freebsd.org Subject: Re: double fault on 10.3-Stable i386 during installworld Message-ID: <59FF3B2E.5010603@incore.de> In-Reply-To: <20171101092619.GJ2566@kib.kiev.ua> References: <59D11664.1060206@incore.de> <20171001180943.GO95911@kib.kiev.ua> <59F910C5.8020709@incore.de> <20171101092619.GJ2566@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Thanks for answer, I am now sure the reason for the double fault is not a FreeBSD problem, it is a CPU problem. >> On the stack we have >> >> 0xe437faa0: 0x00000000 R7:0xc0bc051c 0x00000020 0x00010007 >> >> so there is an exception on the instruction "movl PCB_CR3(%edx),%eax" >> in function cpu_switch(). The next stack entries indicates a lot of page >> faults, but the "double fault" happens not until the page boundary at >> 0xe437f000 is reached. I do not really understand this, but it seems to >> me that the thread > > Can you try to recover the %ecx, %edx values for the faulted frame ? > Note that %ecx is loaded from the on-stack argument. >From source swtch.s /* Save is done. Now fire up new thread. Leave old vmspace. */ movl 4(%esp),%edi movl 8(%esp),%ecx /* New thread */ movl 12(%esp),%esi /* New lock */ #ifdef INVARIANTS testl %ecx,%ecx /* no thread? */ jz badsw3 /* no, panic */ #endif movl TD_PCB(%ecx),%edx /* switch address space */ movl PCB_CR3(%edx),%eax it can be seen by inspection of the stack, that %ecx is loaded with address of newtd (0xc8029a20) and %edx is loaded with address of newpcb (0xf0a3ad40). So we see an exception during the execution of a correct machine instruction. At the moment of double fault I see the same values in the saved TSS: (kgdb) p/x __pcpu[2]->pc_common_tss $16 = {tss_link = 0x0, tss_esp0 = 0xe437fd30, tss_ss0 = 0x28, tss_esp1 = 0x0, tss_ss1 = 0x0, tss_esp2 = 0x0, tss_ss2 = 0x0, tss_cr3 = 0x0, tss_eip = 0xc0bacac8, tss_eflags = 0x10007, tss_eax = 0xc08f492f, tss_ecx = 0xc8029a20, tss_edx = 0xf0a3ad40, tss_ebx = 0xd3cf, t ss_esp = 0xe437f000, tss_ebp = 0xe437fafc, tss_esi = 0xc0e43400, tss_edi = 0xc7ebd000, tss_es = 0x28, tss_cs = 0x20, tss_ss = 0x28, ts s_ds = 0x28, tss_fs = 0x8, tss_gs = 0x3b, tss_ldt = 0x0, tss_ioopt = 0x680000}. Also we have tss_eax = 0xc08f492f = return address, so the movl for "switch address space" was not executed. > Do you have latest CPU microcode loaded ? Your machine is very old, > I believe this is P4 class processor, am I right ? I have to correct one detail: The output (kgdb) p/x cpu_id $4 = 0xf29 for the CPUID was correct, but the correspondig output from dmesg was not from the crashing server, so here is the correct one: CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2791.05-MHz 686-class CPU) Origin="GenuineIntel" Id=0xf29 Family=0xf Model=0x2 Stepping=9 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR ,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x4400<CNXT-ID,xTPR> kenv gives: smbios.bios.reldate="01/25/2005" smbios.bios.vendor="Intel Corporation" smbios.bios.version="SWV25.86B.0250.P39.0501252032" smbios.chassis.maker="Intel Corporation" smbios.memory.enabled="4194304" smbios.planar.maker="Intel " smbios.planar.product="SE7501WV2S" smbios.planar.serial="000E0C5C4ADE374" smbios.planar.version="A99386-112" smbios.socket.enabled="2" smbios.socket.populated="2" smbios.system.maker="MAXDATA" smbios.system.product="PLATINUM 2210R" (OEM, Intel SR2300) smbios.system.serial=" " smbios.system.uuid="d69da6f3-015e-11d9-b9dc-00108365a7e7" smbios.version="2.3" >From manual "Intel Xeon Processor (Document Number 249679-056(" I found my CPU is a Xeon 2.8B "Prestonia" (CPUID 0F29H, Core Stepping D1) released 8.11.2002. I have the last microcode revision m02f292d, but my BIOS version P39 was not latest. In the meantime I have upgraded to BIOS version P43. > Sure if pcb access faults, the system is in very broken state and > since an attempt to handle the fault causes a new fault for pcb access, > it recurses and dies due to the stack overflow. Agree. -- Andreas Longwitz
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59FF3B2E.5010603>