Date: Sat, 30 Nov 2013 13:02:16 +0100 From: Peter Holm <peter@holm.cc> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Don Lewis <truckman@freebsd.org>, freebsd-current@freebsd.org Subject: Re: panic: double fault with 11.0-CURRENT r258504 Message-ID: <20131130120216.GA48738@x2.osted.lan> In-Reply-To: <20131128075610.GJ59496@kib.kiev.ua> References: <20131127200050.GE59496@kib.kiev.ua> <201311272111.rARLBZk9042868@gw.catspoiler.org> <20131128075610.GJ59496@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Nov 28, 2013 at 09:56:10AM +0200, Konstantin Belousov wrote: > On Wed, Nov 27, 2013 at 01:11:35PM -0800, Don Lewis wrote: > > On 27 Nov, Konstantin Belousov wrote: > > > On Wed, Nov 27, 2013 at 11:35:19AM -0800, Don Lewis wrote: > > >> On 27 Nov, Konstantin Belousov wrote: > > >> > On Wed, Nov 27, 2013 at 11:02:57AM -0800, Don Lewis wrote: > > >> >> On 27 Nov, Konstantin Belousov wrote: > > >> >> > On Wed, Nov 27, 2013 at 10:33:30AM -0800, Don Lewis wrote: > > >> >> >> On 27 Nov, Konstantin Belousov wrote: > > >> >> >> > On Wed, Nov 27, 2013 at 09:41:36AM -0800, Don Lewis wrote: > > >> >> >> >> On 27 Nov, Konstantin Belousov wrote: > > >> >> >> >> > On Wed, Nov 27, 2013 at 02:49:12AM -0800, Don Lewis wrote: > > >> >> >> >> >> <http://people.freebsd.org/~truckman/doublefault2.JPG> > > >> >> >> >> > > > >> >> >> >> > What is the instruction at cpu_switch+0x9b ? > > >> >> >> >> > > >> >> >> >> movl 0x8(%edx),%eax > > >> >> >> > So it is line 176 in swtch.s. Is machine still in ddb, or did you > > >> >> >> > obtained the core ? If yes, please print out the content of words at > > >> >> >> > 0xe4f62bb0 + 4, +8 (*), +16. Please print the content of the word at > > >> >> >> > address (*) + 8. > > >> >> >> > > >> >> >> It is still in ddb. > > >> >> >> > > >> >> >> <http://people.freebsd.org/~truckman/doublefault3.JPG>, though not in > > >> >> >> the above order. > > >> >> > Uhm, sorry, I mistyped the last part of the instructions. > > >> >> > > > >> >> > The new thread pointer is 0xd2f4e000, there is nothing incriminating. > > >> >> > Please print the word at 0xd2f4e000+0x254 == 0xd2f4e254, which would be > > >> >> > the address of the new thread pcb. It is load from the pcb + 8 which > > >> >> > faults. > > >> >> > > >> >> 0xf3d44d60 > > >> > Again, the pointer looks fine, and its tail is 0xd60, which is correct for > > >> > the pcb offset in the last page of the thread stack. > > >> > > > >> > Please do 'show thread 0xd2f4e000' before trying below instructions. > > >> > > >> Ok, see below: > > >> > > >> > What happens if you try to read word at 0xf3d44d68 ? > > >> > > >> Nothing bad ... > > >> > > >> <http://people.freebsd.org/~truckman/doublefault4.JPG> > > >> > > > So the thread structure looks sane, the stack region is in place where > > > it is supposed to be, all the gathered data looks self-consistent. And, > > > the access to the faulted address from ddb does not fault. > > > > > > Thread stacks can only be invalidated when the process is swapped out and > > > kernel stack is written to swap. Your thread flags indicate that it is > > > in memory, and TDF_CANSWAP is not set. I do not believe that our swapout > > > code would invalidate stack mapping in such situation, otherwise we would > > > have too many complaints already. > > > > > > Just in case, do you use swap on this box ? > > > > I do. > > > > > And, as the last resort, I do understand that this sounds as giving up, > > > do you monitor the temperature of the CPUs ? BTW, which CPUs are that, > > > please show the cpu identification lines from the boot dmesg. > > > > I don't monitor the temperature, but I do hear the CPU fan speed ramping > > up and down when I'm building ports like this. Even though I'm pretty > > much keeping one core busy the whole time, the temperature must drop > > enough at times to let the fan speed drop. > > > > I can run math/mprime on this machine for a while to see if anything > > shows up. I also have a very similar machine (same motherboard but > > different CPU) that I can move the drive over to and test. > > > > Here's the full dmesg.boot: > > > > Copyright (c) 1992-2013 The FreeBSD Project. > > Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 > > The Regents of the University of California. All rights reserved. > > FreeBSD is a registered trademark of The FreeBSD Foundation. > > FreeBSD 11.0-CURRENT #63 r258614M: Tue Nov 26 00:29:01 PST 2013 > > dl@scratch.catspoiler.org:/usr/obj/usr/src/sys/GENERICSMB i386 > > FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610 > > WARNING: WITNESS option enabled, expect reduced performance. > > CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2500.06-MHz 686-class CPU) > > Origin = "AuthenticAMD" Id = 0x60fb1 Family = 0xf Model = 0x6b Stepping = 1 > > Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> > > Features2=0x2001<SSE3,CX16> > > AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!> > > AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch> > > The errata list for the Athlon 64 X2 is quite long. Do you have latest > BIOS ? I am not sure if AMD provides standalone firmware update blocks > for their CPUs. If any Linux distribution ships updates for AMD CPUs, > it might be useful to load the update with cpucontrol(8). Even if we > do not hit a CPU bug, it would provide me with more certainity that we > are not chasing ghost. > > Another things to try, in vain, is to compile kernel with gcc or disable > SMP. > > Peter, could you, please, try to reproduce the issue ? It does not look > like a random hardware failure, since in all cases, it is curthread access > which is faulting. The issue is only reported by Don, and so far only > for i386 SMP. I'm not seeing this issue on my AMD Phenom(tm) 9150e Quad-Core Processor with i386/r258703. - Peter
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131130120216.GA48738>