From owner-freebsd-current@FreeBSD.ORG Thu Nov 17 16:33:35 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BE2E7106564A; Thu, 17 Nov 2011 16:33:35 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 925EC8FC0A; Thu, 17 Nov 2011 16:33:35 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 4535946B0C; Thu, 17 Nov 2011 11:33:35 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D459A8A050; Thu, 17 Nov 2011 11:33:34 -0500 (EST) From: John Baldwin To: Stefan Esser Date: Thu, 17 Nov 2011 11:33:34 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p8; KDE/4.5.5; amd64; ; ) References: <4EBB885E.9060908@freebsd.org> <201111161116.24855.jhb@freebsd.org> <4EC4CCFF.8040704@freebsd.org> In-Reply-To: <4EC4CCFF.8040704@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201111171133.34108.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Thu, 17 Nov 2011 11:33:34 -0500 (EST) Cc: Attilio Rao , freebsd-current@freebsd.org Subject: Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Nov 2011 16:33:35 -0000 On Thursday, November 17, 2011 3:59:43 am Stefan Esser wrote: > Am 16.11.2011 17:16, schrieb John Baldwin: > > On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote: > >> ... > >> WARNING: WITNESS option enabled, expect reduced performance. > >> Table 'FACP' at 0xba918a58 > >> Table 'APIC' at 0xba918b50 > >> Table 'SSDT' at 0xba918be8 > >> Table 'MCFG' at 0xba918dc0 > >> Table 'HPET' at 0xba918e00 > >> ACPI: No SRAT table found > >> Preloaded elf kernel "/boot/kernel/kernel" at 0xffffffff81109000 > >> Preloaded elf obj module "/boot/kernel/zfs.ko" at 0xffffffff81109370 <-- > >> kldload: unexpected relocation type 67108875 > >> kernel trap 12 with interrupts disabled > >> > >> The irritating detail is the load address of "zfs.ko", which is just > >> 0x370 bytes above the kernel load address ... > > > > That isn't unusual. Those are the addresses of the metadata provided by the > > loader, not the base address of the kernel or zfs.ko object themselves. The > > unexpected relocation type is interesting however. That value in hex is > > 0x400000b. 0xb is the R_X86_64_32S relocation type which is normal for the > > kernel. I think you just have a single-bit memory error due to a failing > > DIMM. > > Thanks for the information about the load address semantics. The other > unexpected relocation type I observed was 268435457 == 0x10000001, which > also hints at a single bit error. But today the system failed with a > different error: > > ath0: ... > ioapic0: routing interrupt 18 to ... > panic: vm_page_insert: page already inserted > > This could of course also be caused by a single bit error ... Yes, very likely. > Hmmm, perhaps there is a problem with components at room temperature > and the system is still significantly warmer after 3 hours? Yes, I strongly suspect it is a thermal effect that the RAM "works" once it is warmed up. If you have data you care about on the machine, I would just go ahead and replace the RAM now before waiting for the RAM's failure to become worse. -- John Baldwin