From owner-freebsd-hackers@freebsd.org  Sun Nov  5 16:24:26 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 03006E373FA
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Sun,  5 Nov 2017 16:24:26 +0000 (UTC)
 (envelope-from longwitz@incore.de)
Received: from dss.incore.de (dss.incore.de [195.145.1.138])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 89815D5D
 for <freebsd-hackers@freebsd.org>; Sun,  5 Nov 2017 16:24:24 +0000 (UTC)
 (envelope-from longwitz@incore.de)
Received: from inetmail.dmz (inetmail.dmz [10.3.0.3])
 by dss.incore.de (Postfix) with ESMTP id 766851277;
 Sun,  5 Nov 2017 17:24:16 +0100 (CET)
X-Virus-Scanned: amavisd-new at incore.de
Received: from dss.incore.de ([10.3.0.3])
 by inetmail.dmz (inetmail.dmz [10.3.0.3]) (amavisd-new, port 10024)
 with LMTP id BHvlVF73H-QH; Sun,  5 Nov 2017 17:24:14 +0100 (CET)
Received: from mail.local.incore (fwintern.dmz [10.0.0.253])
 by dss.incore.de (Postfix) with ESMTP id B46FD117F;
 Sun,  5 Nov 2017 17:24:14 +0100 (CET)
Received: from bsdmhs.longwitz (unknown [192.168.99.6])
 by mail.local.incore (Postfix) with ESMTP id 79A3E508A1;
 Sun,  5 Nov 2017 17:24:14 +0100 (CET)
Message-ID: <59FF3B2E.5010603@incore.de>
Date: Sun, 05 Nov 2017 17:24:14 +0100
From: Andreas Longwitz <longwitz@incore.de>
User-Agent: Thunderbird 2.0.0.19 (X11/20090113)
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
CC: freebsd-hackers@freebsd.org
Subject: Re: double fault on 10.3-Stable i386 during installworld
References: <59D11664.1060206@incore.de> <20171001180943.GO95911@kib.kiev.ua>
 <59F910C5.8020709@incore.de> <20171101092619.GJ2566@kib.kiev.ua>
In-Reply-To: <20171101092619.GJ2566@kib.kiev.ua>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailman-Approved-At: Sun, 05 Nov 2017 18:31:18 +0000
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Nov 2017 16:24:26 -0000

Thanks for answer, I am now sure the reason for the double fault is not
a FreeBSD problem, it is a CPU problem.

>> On the stack we have
>>
>> 0xe437faa0:    0x00000000  R7:0xc0bc051c     0x00000020     0x00010007
>>
>> so there is an exception on the instruction "movl  PCB_CR3(%edx),%eax"
>> in function cpu_switch(). The next stack entries indicates a lot of page
>> faults, but the "double fault" happens not until the page boundary at
>> 0xe437f000 is reached. I do not really understand this, but it seems to
>> me that the thread
> 
> Can you try to recover the %ecx, %edx values for the faulted frame ?
> Note that %ecx is loaded from the on-stack argument.

>From source swtch.s

        /* Save is done.  Now fire up new thread. Leave old vmspace. */
        movl    4(%esp),%edi
        movl    8(%esp),%ecx                    /* New thread */
        movl    12(%esp),%esi                   /* New lock */
#ifdef INVARIANTS
        testl   %ecx,%ecx                       /* no thread? */
        jz      badsw3                          /* no, panic */
#endif
        movl    TD_PCB(%ecx),%edx

        /* switch address space */
        movl    PCB_CR3(%edx),%eax

it can be seen by inspection of the stack, that %ecx is loaded with
address of newtd (0xc8029a20) and %edx is loaded with address of newpcb
(0xf0a3ad40). So we see an exception during the execution of a correct
machine instruction. At the moment of double fault I see the same values
in the saved TSS:

(kgdb) p/x __pcpu[2]->pc_common_tss
$16 = {tss_link = 0x0, tss_esp0 = 0xe437fd30, tss_ss0 = 0x28, tss_esp1 =
0x0, tss_ss1 = 0x0, tss_esp2 = 0x0, tss_ss2 = 0x0, tss_cr3 =
0x0, tss_eip = 0xc0bacac8, tss_eflags = 0x10007, tss_eax = 0xc08f492f,
tss_ecx = 0xc8029a20, tss_edx = 0xf0a3ad40, tss_ebx = 0xd3cf, t
ss_esp = 0xe437f000, tss_ebp = 0xe437fafc, tss_esi = 0xc0e43400, tss_edi
= 0xc7ebd000, tss_es = 0x28, tss_cs = 0x20, tss_ss = 0x28, ts
s_ds = 0x28, tss_fs = 0x8, tss_gs = 0x3b, tss_ldt = 0x0, tss_ioopt =
0x680000}.

Also we have tss_eax = 0xc08f492f = return address, so the movl for
"switch address space" was not executed.

> Do you have latest CPU microcode loaded ?  Your machine is very old,
> I believe this is P4 class processor, am I right ?

I have to correct one detail: The output

(kgdb) p/x cpu_id
$4 = 0xf29

for the CPUID was correct, but the correspondig output from dmesg was
not from the crashing server, so here is the correct one:

CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2791.05-MHz 686-class CPU)
  Origin="GenuineIntel"  Id=0xf29  Family=0xf  Model=0x2  Stepping=9
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR
,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x4400<CNXT-ID,xTPR>

kenv gives:
smbios.bios.reldate="01/25/2005"
smbios.bios.vendor="Intel Corporation"
smbios.bios.version="SWV25.86B.0250.P39.0501252032"
smbios.chassis.maker="Intel Corporation"
smbios.memory.enabled="4194304"
smbios.planar.maker="Intel     "
smbios.planar.product="SE7501WV2S"
smbios.planar.serial="000E0C5C4ADE374"
smbios.planar.version="A99386-112"
smbios.socket.enabled="2"
smbios.socket.populated="2"
smbios.system.maker="MAXDATA"
smbios.system.product="PLATINUM 2210R" (OEM, Intel SR2300)
smbios.system.serial="               "
smbios.system.uuid="d69da6f3-015e-11d9-b9dc-00108365a7e7"
smbios.version="2.3"

>From manual "Intel Xeon Processor (Document Number 249679-056(" I found
my CPU is a Xeon 2.8B "Prestonia" (CPUID 0F29H, Core Stepping D1)
released 8.11.2002. I have the last microcode revision m02f292d, but my
BIOS version P39 was not latest. In the meantime I have upgraded to BIOS
version P43.

> Sure if pcb access faults, the system is in very broken state and
> since an attempt to handle the fault causes a new fault for pcb access,
> it recurses and dies due to the stack overflow.

Agree.

-- 
Andreas Longwitz