From owner-freebsd-current@FreeBSD.ORG Thu Nov 28 08:56:52 2013 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B35A8248; Thu, 28 Nov 2013 08:56:52 +0000 (UTC) Received: from gw.catspoiler.org (gw.catspoiler.org [75.1.14.242]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 852EA1EB7; Thu, 28 Nov 2013 08:56:52 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id rAS8ubLR044563; Thu, 28 Nov 2013 00:56:41 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201311280856.rAS8ubLR044563@gw.catspoiler.org> Date: Thu, 28 Nov 2013 00:56:37 -0800 (PST) From: Don Lewis Subject: Re: panic: double fault with 11.0-CURRENT r258504 To: kostikbel@gmail.com In-Reply-To: <20131128075610.GJ59496@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Cc: pho@FreeBSD.org, freebsd-current@FreeBSD.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Nov 2013 08:56:52 -0000 On 28 Nov, Konstantin Belousov wrote: > On Wed, Nov 27, 2013 at 01:11:35PM -0800, Don Lewis wrote: >> On 27 Nov, Konstantin Belousov wrote: >> > On Wed, Nov 27, 2013 at 11:35:19AM -0800, Don Lewis wrote: >> >> On 27 Nov, Konstantin Belousov wrote: >> >> > On Wed, Nov 27, 2013 at 11:02:57AM -0800, Don Lewis wrote: >> >> >> On 27 Nov, Konstantin Belousov wrote: >> >> >> > On Wed, Nov 27, 2013 at 10:33:30AM -0800, Don Lewis wrote: >> >> >> >> On 27 Nov, Konstantin Belousov wrote: >> >> >> >> > On Wed, Nov 27, 2013 at 09:41:36AM -0800, Don Lewis wrote: >> >> >> >> >> On 27 Nov, Konstantin Belousov wrote: >> >> >> >> >> > On Wed, Nov 27, 2013 at 02:49:12AM -0800, Don Lewis wrote: >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> > What is the instruction at cpu_switch+0x9b ? >> >> >> >> >> >> >> >> >> >> movl 0x8(%edx),%eax >> >> >> >> > So it is line 176 in swtch.s. Is machine still in ddb, or did you >> >> >> >> > obtained the core ? If yes, please print out the content of words at >> >> >> >> > 0xe4f62bb0 + 4, +8 (*), +16. Please print the content of the word at >> >> >> >> > address (*) + 8. >> >> >> >> >> >> >> >> It is still in ddb. >> >> >> >> >> >> >> >> , though not in >> >> >> >> the above order. >> >> >> > Uhm, sorry, I mistyped the last part of the instructions. >> >> >> > >> >> >> > The new thread pointer is 0xd2f4e000, there is nothing incriminating. >> >> >> > Please print the word at 0xd2f4e000+0x254 == 0xd2f4e254, which would be >> >> >> > the address of the new thread pcb. It is load from the pcb + 8 which >> >> >> > faults. >> >> >> >> >> >> 0xf3d44d60 >> >> > Again, the pointer looks fine, and its tail is 0xd60, which is correct for >> >> > the pcb offset in the last page of the thread stack. >> >> > >> >> > Please do 'show thread 0xd2f4e000' before trying below instructions. >> >> >> >> Ok, see below: >> >> >> >> > What happens if you try to read word at 0xf3d44d68 ? >> >> >> >> Nothing bad ... >> >> >> >> >> >> >> > So the thread structure looks sane, the stack region is in place where >> > it is supposed to be, all the gathered data looks self-consistent. And, >> > the access to the faulted address from ddb does not fault. >> > >> > Thread stacks can only be invalidated when the process is swapped out and >> > kernel stack is written to swap. Your thread flags indicate that it is >> > in memory, and TDF_CANSWAP is not set. I do not believe that our swapout >> > code would invalidate stack mapping in such situation, otherwise we would >> > have too many complaints already. >> > >> > Just in case, do you use swap on this box ? >> >> I do. >> >> > And, as the last resort, I do understand that this sounds as giving up, >> > do you monitor the temperature of the CPUs ? BTW, which CPUs are that, >> > please show the cpu identification lines from the boot dmesg. >> >> I don't monitor the temperature, but I do hear the CPU fan speed ramping >> up and down when I'm building ports like this. Even though I'm pretty >> much keeping one core busy the whole time, the temperature must drop >> enough at times to let the fan speed drop. >> >> I can run math/mprime on this machine for a while to see if anything >> shows up. I also have a very similar machine (same motherboard but >> different CPU) that I can move the drive over to and test. >> >> Here's the full dmesg.boot: >> >> Copyright (c) 1992-2013 The FreeBSD Project. >> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 >> The Regents of the University of California. All rights reserved. >> FreeBSD is a registered trademark of The FreeBSD Foundation. >> FreeBSD 11.0-CURRENT #63 r258614M: Tue Nov 26 00:29:01 PST 2013 >> dl@scratch.catspoiler.org:/usr/obj/usr/src/sys/GENERICSMB i386 >> FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610 >> WARNING: WITNESS option enabled, expect reduced performance. >> CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2500.06-MHz 686-class CPU) >> Origin = "AuthenticAMD" Id = 0x60fb1 Family = 0xf Model = 0x6b Stepping = 1 >> Features=0x178bfbff >> Features2=0x2001 >> AMD Features=0xea500800 >> AMD Features2=0x11f > > The errata list for the Athlon 64 X2 is quite long. Do you have latest > BIOS ? I am not sure if AMD provides standalone firmware update blocks > for their CPUs. If any Linux distribution ships updates for AMD CPUs, > it might be useful to load the update with cpucontrol(8). Even if we > do not hit a CPU bug, it would provide me with more certainity that we > are not chasing ghost. I haven't figured out how to find the currently installed BIOS version. The motherboard is Abit, which is no more, but I found an archive of all of their downloads. I'll also check into updates from the Linux world. > Another things to try, in vain, is to compile kernel with gcc or disable > SMP. It has survived 10 hours running two copies of mprime. I just moved the boot drive over to another machine with the the same type of motherboard, but a different model AMD X2 CPU. CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (2200.05-MHz 686-class CPU ) Origin = "AuthenticAMD" Id = 0x40fb2 Family = 0xf Model = 0x4b Stepping = 2 Features=0x178bfbff Features2=0x2001 AMD Features=0xea500800 AMD Features2=0x1f real memory = 2147483648 (2048 MB) avail memory = 1940611072 (1850 MB) I also have a fairly new quad core AMD box I can test on, as well as an old dual P III machine. This machine gets updated every month or so and I've never had stability problems with it until just recently. It's definitely been using clang for quite a while without any problems other than the ports mess. > Peter, could you, please, try to reproduce the issue ? It does not look > like a random hardware failure, since in all cases, it is curthread access > which is faulting. The issue is only reported by Don, and so far only > for i386 SMP. The workload that is triggering this is portupgrade -fr lang/perl5.16 I've got 1000+ ports installed and this causes 400+ to be rebuilt. That seems to cause it to panic about half the time. The last time it made it through 268 ports before it croaked.