From owner-freebsd-current@FreeBSD.ORG Thu Nov 17 08:59:44 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8F1901065675 for ; Thu, 17 Nov 2011 08:59:44 +0000 (UTC) (envelope-from se@freebsd.org) Received: from nm21.bullet.mail.sp2.yahoo.com (nm21.bullet.mail.sp2.yahoo.com [98.139.91.91]) by mx1.freebsd.org (Postfix) with SMTP id 6EAE48FC16 for ; Thu, 17 Nov 2011 08:59:44 +0000 (UTC) Received: from [98.139.91.64] by nm21.bullet.mail.sp2.yahoo.com with NNFMP; 17 Nov 2011 08:59:44 -0000 Received: from [208.71.42.192] by tm4.bullet.mail.sp2.yahoo.com with NNFMP; 17 Nov 2011 08:59:44 -0000 Received: from [127.0.0.1] by smtp203.mail.gq1.yahoo.com with NNFMP; 17 Nov 2011 08:59:43 -0000 X-Yahoo-Newman-Id: 904142.39373.bm@smtp203.mail.gq1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: VRuUg40VM1kRKfDicZCyyJ3sqSbYjTBSH._8jJ_QbiaHh4j Zc_nSeAq7reATPHljkbWXbK3HEPKJl8rQ3m5EObKu.JRW732Ee2pTKP6Az4p VOzbhpF9ro_w63yoDys79Co1miy8dIyePT4ry769c.1T4mco3m4mRQxo5Wuv JhOA.eEvT6WMZyEuZIYstdU4T0oGvmdbrDyWmJyImgw24LcL4TyBg3sh3Ucv nWb0n9mMr4OwTPxLzsnmcm7UasMLJAUx0PIp4XqlOihqBQzqlSYMCK57pmRp lJeTPGAsugvGXv0W_M7IucASKbMuZWQqI.qQlLMXnOIOWpVqdzeFsUKvbMxU Dy36EpxVTb.wIW1m4KR7U4Buqix1OczglXeAevWwF4XaVau488e.1z9zmWci JagRMmn683kF1US8- X-Yahoo-SMTP: iDf2N9.swBDAhYEh7VHfpgq0lnq. Received: from [192.168.119.20] (se@81.173.142.172 with plain) by smtp203.mail.gq1.yahoo.com with SMTP; 17 Nov 2011 00:59:43 -0800 PST Message-ID: <4EC4CCFF.8040704@freebsd.org> Date: Thu, 17 Nov 2011 09:59:43 +0100 From: Stefan Esser User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: John Baldwin References: <4EBB885E.9060908@freebsd.org> <4EC004BC.6060406@freebsd.org> <201111161116.24855.jhb@freebsd.org> In-Reply-To: <201111161116.24855.jhb@freebsd.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Attilio Rao , freebsd-current@freebsd.org Subject: Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Nov 2011 08:59:44 -0000 Am 16.11.2011 17:16, schrieb John Baldwin: > On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote: >> ... >> WARNING: WITNESS option enabled, expect reduced performance. >> Table 'FACP' at 0xba918a58 >> Table 'APIC' at 0xba918b50 >> Table 'SSDT' at 0xba918be8 >> Table 'MCFG' at 0xba918dc0 >> Table 'HPET' at 0xba918e00 >> ACPI: No SRAT table found >> Preloaded elf kernel "/boot/kernel/kernel" at 0xffffffff81109000 >> Preloaded elf obj module "/boot/kernel/zfs.ko" at 0xffffffff81109370 <-- >> kldload: unexpected relocation type 67108875 >> kernel trap 12 with interrupts disabled >> >> The irritating detail is the load address of "zfs.ko", which is just >> 0x370 bytes above the kernel load address ... > > That isn't unusual. Those are the addresses of the metadata provided by the > loader, not the base address of the kernel or zfs.ko object themselves. The > unexpected relocation type is interesting however. That value in hex is > 0x400000b. 0xb is the R_X86_64_32S relocation type which is normal for the > kernel. I think you just have a single-bit memory error due to a failing > DIMM. Thanks for the information about the load address semantics. The other unexpected relocation type I observed was 268435457 == 0x10000001, which also hints at a single bit error. But today the system failed with a different error: ath0: ... ioapic0: routing interrupt 18 to ... panic: vm_page_insert: page already inserted This could of course also be caused by a single bit error ... But the strange thing is that the system runs perfectly stable under load (e.g. "make -j8 world") and that the ZFS ARC grows to some 6GB (of 8GB RAM installed) and I'd expect checksum errors to occur, if there is a bad DIMM. Anyway, I'll check with memtest86+ (or whatever best supports my system with 8GB RAM) over night. The system boots reliably when switched off for less than a few hours (I haven't determined the exact limit, but 3 hours are not sufficient to reproduce the boot failure, while 10 hours cause the first boot attempt to fail with 90% likelihood; the second one always succeeds). I'm wondering whether the system RAM is not correctly initialized after being powered off for 10 hours (but I do not understand why 3 hours should not lead to the exact same initial state). BTW: It suffices to have the system at power state S5 for 10 hours to cause the boot failure, while less than 3 hours (without any power or at S5) let the boot succeed on the first attempt. >> I had already assumed that memory was corrupted during early start-up, >> but now I think that gptzfsboot writes the zfs kernel module over the >> start of the loaded kernel. I'll try some more tests later today. > > Nah, if zfs.ko were loaded over the beginning of the kernel you wouldn't even > get to the point of the first kernel printf. Yes, I see that the failure would be less random (3 different kinds of panic and different warning messages before the panic occurs). But I still do not understand how the symptoms can be interpreted: 1) The system booted reliably for many months 2) It boots reliably when powered off for only a few hours 3) It fails on the first boot attempt after 10 hours or more 4) It never shows signs of instability after a successful boot Hmmm, perhaps there is a problem with components at room temperature and the system is still significantly warmer after 3 hours? I'll have to check for such a thermal effect too ... Best regards, STefan