Date: Wed, 16 Nov 2011 11:16:24 -0500 From: John Baldwin <jhb@freebsd.org> To: freebsd-current@freebsd.org Cc: Attilio Rao <attilio@freebsd.org> Subject: Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT Message-ID: <201111161116.24855.jhb@freebsd.org> In-Reply-To: <4EC004BC.6060406@freebsd.org> References: <4EBB885E.9060908@freebsd.org> <CAJ-FndANGDEhiMm99Sx2__CNg3fxi8xtaU1GLugB3e-EOrf5Sg@mail.gmail.com> <4EC004BC.6060406@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote: > Am 11.11.2011 13:15, schrieb Attilio Rao: > > Can you try rebuilding your kernel and modules from scratch and see if > > it fixes your problem? > > Sorry for the delay, but my system seems to need being turned off (S5) > for many hours (whole night) to reproduce the problem ... > > I had already rebuilt my kernel multiple times in the last weeks. But > just to be sure, I removed the build directories for kernel and world > and built a new kernel after building and installing world from scratch. > The next reboot (with boot blocks from the freshly built world) failed > again ... > > But the first lines of boot messages look strange: > > ... > WARNING: WITNESS option enabled, expect reduced performance. > Table 'FACP' at 0xba918a58 > Table 'APIC' at 0xba918b50 > Table 'SSDT' at 0xba918be8 > Table 'MCFG' at 0xba918dc0 > Table 'HPET' at 0xba918e00 > ACPI: No SRAT table found > Preloaded elf kernel "/boot/kernel/kernel" at 0xffffffff81109000 > Preloaded elf obj module "/boot/kernel/zfs.ko" at 0xffffffff81109370 <-- > kldload: unexpected relocation type 67108875 > kernel trap 12 with interrupts disabled > > The irritating detail is the load address of "zfs.ko", which is just > 0x370 bytes above the kernel load address ... That isn't unusual. Those are the addresses of the metadata provided by the loader, not the base address of the kernel or zfs.ko object themselves. The unexpected relocation type is interesting however. That value in hex is 0x400000b. 0xb is the R_X86_64_32S relocation type which is normal for the kernel. I think you just have a single-bit memory error due to a failing DIMM. > A verbose boot scrolls these lines off the screen to fast (and is to > long to be preserved in dmesg.boot from the start), so I do not have any > idea whether other values are reported in case of a successful boot. > > I had already assumed that memory was corrupted during early start-up, > but now I think that gptzfsboot writes the zfs kernel module over the > start of the loaded kernel. I'll try some more tests later today. Nah, if zfs.ko were loaded over the beginning of the kernel you wouldn't even get to the point of the first kernel printf. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201111161116.24855.jhb>