Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 Nov 2013 00:56:37 -0800 (PST)
From:      Don Lewis <truckman@FreeBSD.org>
To:        kostikbel@gmail.com
Cc:        pho@FreeBSD.org, freebsd-current@FreeBSD.org
Subject:   Re: panic: double fault with 11.0-CURRENT r258504
Message-ID:  <201311280856.rAS8ubLR044563@gw.catspoiler.org>
In-Reply-To: <20131128075610.GJ59496@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 28 Nov, Konstantin Belousov wrote:
> On Wed, Nov 27, 2013 at 01:11:35PM -0800, Don Lewis wrote:
>> On 27 Nov, Konstantin Belousov wrote:
>> > On Wed, Nov 27, 2013 at 11:35:19AM -0800, Don Lewis wrote:
>> >> On 27 Nov, Konstantin Belousov wrote:
>> >> > On Wed, Nov 27, 2013 at 11:02:57AM -0800, Don Lewis wrote:
>> >> >> On 27 Nov, Konstantin Belousov wrote:
>> >> >> > On Wed, Nov 27, 2013 at 10:33:30AM -0800, Don Lewis wrote:
>> >> >> >> On 27 Nov, Konstantin Belousov wrote:
>> >> >> >> > On Wed, Nov 27, 2013 at 09:41:36AM -0800, Don Lewis wrote:
>> >> >> >> >> On 27 Nov, Konstantin Belousov wrote:
>> >> >> >> >> > On Wed, Nov 27, 2013 at 02:49:12AM -0800, Don Lewis wrote:
>> >> >> >> >> >> <http://people.freebsd.org/~truckman/doublefault2.JPG>;
>> >> >> >> >> > 
>> >> >> >> >> > What is the instruction at cpu_switch+0x9b ?
>> >> >> >> >> 
>> >> >> >> >> movl 0x8(%edx),%eax
>> >> >> >> > So it is line 176 in swtch.s. Is machine still in ddb, or did you
>> >> >> >> > obtained the core ? If yes, please print out the content of words at
>> >> >> >> > 0xe4f62bb0 + 4, +8 (*), +16. Please print the content of the word at
>> >> >> >> > address (*) + 8.
>> >> >> >> 
>> >> >> >> It is still in ddb.
>> >> >> >> 
>> >> >> >> <http://people.freebsd.org/~truckman/doublefault3.JPG>, though not in
>> >> >> >> the above order.
>> >> >> > Uhm, sorry, I mistyped the last part of the instructions.
>> >> >> > 
>> >> >> > The new thread pointer is 0xd2f4e000, there is nothing incriminating.
>> >> >> > Please print the word at 0xd2f4e000+0x254 == 0xd2f4e254, which would be
>> >> >> > the address of the new thread pcb. It is load from the pcb + 8 which
>> >> >> > faults.
>> >> >> 
>> >> >> 0xf3d44d60
>> >> > Again, the pointer looks fine, and its tail is 0xd60, which is correct for
>> >> > the pcb offset in the last page of the thread stack.
>> >> > 
>> >> > Please do 'show thread 0xd2f4e000' before trying below instructions.
>> >> 
>> >> Ok, see below:
>> >>  
>> >> > What happens if you try to read word at 0xf3d44d68 ?
>> >> 
>> >> Nothing bad ...
>> >> 
>> >> <http://people.freebsd.org/~truckman/doublefault4.JPG>;
>> >> 
>> > So the thread structure looks sane, the stack region is in place where
>> > it is supposed to be, all the gathered data looks self-consistent. And,
>> > the access to the faulted address from ddb does not fault.
>> > 
>> > Thread stacks can only be invalidated when the process is swapped out and
>> > kernel stack is written to swap.  Your thread flags indicate that it is
>> > in memory, and TDF_CANSWAP is not set.  I do not believe that our swapout
>> > code would invalidate stack mapping in such situation, otherwise we would
>> > have too many complaints already.
>> > 
>> > Just in case, do you use swap on this box ?
>> 
>> I do.
>> 
>> > And, as the last resort, I do understand that this sounds as giving up,
>> > do you monitor the temperature of the CPUs ? BTW, which CPUs are that,
>> > please show the cpu identification lines from the boot dmesg.
>> 
>> I don't monitor the temperature, but I do hear the CPU fan speed ramping
>> up and down when I'm building ports like this.  Even though I'm pretty
>> much keeping one core busy the whole time, the temperature must drop
>> enough at times to let the fan speed drop.
>> 
>> I can run math/mprime on this machine for a while to see if anything
>> shows up.  I also have a very similar machine (same motherboard but
>> different CPU) that I can move the drive over to and test.
>> 
>> Here's the full dmesg.boot:
>> 
>> Copyright (c) 1992-2013 The FreeBSD Project.
>> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
>> 	The Regents of the University of California. All rights reserved.
>> FreeBSD is a registered trademark of The FreeBSD Foundation.
>> FreeBSD 11.0-CURRENT #63 r258614M: Tue Nov 26 00:29:01 PST 2013
>>     dl@scratch.catspoiler.org:/usr/obj/usr/src/sys/GENERICSMB i386
>> FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610
>> WARNING: WITNESS option enabled, expect reduced performance.
>> CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2500.06-MHz 686-class CPU)
>>   Origin = "AuthenticAMD"  Id = 0x60fb1  Family = 0xf  Model = 0x6b  Stepping = 1
>>   Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
>>   Features2=0x2001<SSE3,CX16>
>>   AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
>>   AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch>
> 
> The errata list for the Athlon 64 X2 is quite long.  Do you have latest
> BIOS ?  I am not sure if AMD provides standalone firmware update blocks
> for their CPUs.  If any Linux distribution ships updates for AMD CPUs,
> it might be useful to load the update with cpucontrol(8).  Even if we
> do not hit a CPU bug, it would provide me with more certainity that we
> are not chasing ghost.

I haven't figured out how to find the currently installed BIOS version.
The motherboard is Abit, which is no more, but I found an archive of all
of their downloads.  I'll also check into updates from the Linux world.

> Another things to try, in vain, is to compile kernel with gcc or disable
> SMP.

It has survived 10 hours running two copies of mprime.  I just moved the
boot drive over to another machine with the the same type of
motherboard, but a different model AMD X2 CPU.

CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (2200.05-MHz 686-class CPU
)
  Origin = "AuthenticAMD"  Id = 0x40fb2  Family = 0xf  Model = 0x4b  Stepping
= 2
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA
,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x2001<SSE3,CX16>
  AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x1f<LAHF,CMP,SVM,ExtAPIC,CR8>
real memory  = 2147483648 (2048 MB)
avail memory = 1940611072 (1850 MB)

I also have a fairly new quad core AMD box I can test on, as well as an
old dual P III machine.

This machine gets updated every month or so and I've never had stability
problems with it until just recently.  It's definitely been using clang
for quite a while without any problems other than the ports mess.

> Peter, could you, please, try to reproduce the issue ?  It does not look
> like a random hardware failure, since in all cases, it is curthread access
> which is faulting.  The issue is only reported by Don, and so far only
> for i386 SMP.

The workload that is triggering this is
	portupgrade -fr lang/perl5.16

I've got 1000+ ports installed and this causes 400+ to be rebuilt.  That
seems to cause it to panic about half the time.  The last time it made
it through 268 ports before it croaked.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201311280856.rAS8ubLR044563>