From owner-freebsd-current@FreeBSD.ORG Wed Nov 16 16:23:29 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7E2D71065673; Wed, 16 Nov 2011 16:23:29 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 4A8928FC1C; Wed, 16 Nov 2011 16:23:29 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 0071B46B0A; Wed, 16 Nov 2011 11:23:27 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 8D9848A050; Wed, 16 Nov 2011 11:23:27 -0500 (EST) From: John Baldwin To: freebsd-current@freebsd.org Date: Wed, 16 Nov 2011 11:16:24 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p8; KDE/4.5.5; amd64; ; ) References: <4EBB885E.9060908@freebsd.org> <4EC004BC.6060406@freebsd.org> In-Reply-To: <4EC004BC.6060406@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201111161116.24855.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Wed, 16 Nov 2011 11:23:27 -0500 (EST) Cc: Attilio Rao Subject: Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Nov 2011 16:23:29 -0000 On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote: > Am 11.11.2011 13:15, schrieb Attilio Rao: > > Can you try rebuilding your kernel and modules from scratch and see if > > it fixes your problem? > > Sorry for the delay, but my system seems to need being turned off (S5) > for many hours (whole night) to reproduce the problem ... > > I had already rebuilt my kernel multiple times in the last weeks. But > just to be sure, I removed the build directories for kernel and world > and built a new kernel after building and installing world from scratch. > The next reboot (with boot blocks from the freshly built world) failed > again ... > > But the first lines of boot messages look strange: > > ... > WARNING: WITNESS option enabled, expect reduced performance. > Table 'FACP' at 0xba918a58 > Table 'APIC' at 0xba918b50 > Table 'SSDT' at 0xba918be8 > Table 'MCFG' at 0xba918dc0 > Table 'HPET' at 0xba918e00 > ACPI: No SRAT table found > Preloaded elf kernel "/boot/kernel/kernel" at 0xffffffff81109000 > Preloaded elf obj module "/boot/kernel/zfs.ko" at 0xffffffff81109370 <-- > kldload: unexpected relocation type 67108875 > kernel trap 12 with interrupts disabled > > The irritating detail is the load address of "zfs.ko", which is just > 0x370 bytes above the kernel load address ... That isn't unusual. Those are the addresses of the metadata provided by the loader, not the base address of the kernel or zfs.ko object themselves. The unexpected relocation type is interesting however. That value in hex is 0x400000b. 0xb is the R_X86_64_32S relocation type which is normal for the kernel. I think you just have a single-bit memory error due to a failing DIMM. > A verbose boot scrolls these lines off the screen to fast (and is to > long to be preserved in dmesg.boot from the start), so I do not have any > idea whether other values are reported in case of a successful boot. > > I had already assumed that memory was corrupted during early start-up, > but now I think that gptzfsboot writes the zfs kernel module over the > start of the loaded kernel. I'll try some more tests later today. Nah, if zfs.ko were loaded over the beginning of the kernel you wouldn't even get to the point of the first kernel printf. -- John Baldwin