Date: Wed, 30 Nov 2011 11:16:25 +0100 From: Stefan Esser <se@freebsd.org> To: John Baldwin <jhb@freebsd.org> Cc: Attilio Rao <attilio@freebsd.org>, freebsd-current@freebsd.org Subject: [SOLVED]: HW defect (was: Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT) Message-ID: <4ED60279.10901@freebsd.org> In-Reply-To: <201111171133.34108.jhb@freebsd.org> References: <4EBB885E.9060908@freebsd.org> <201111161116.24855.jhb@freebsd.org> <4EC4CCFF.8040704@freebsd.org> <201111171133.34108.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Am 17.11.2011 17:33, schrieb John Baldwin: > On Thursday, November 17, 2011 3:59:43 am Stefan Esser wrote: >> Am 16.11.2011 17:16, schrieb John Baldwin: [...] >>> That isn't unusual. Those are the addresses of the metadata provided by the >>> loader, not the base address of the kernel or zfs.ko object themselves. The >>> unexpected relocation type is interesting however. That value in hex is >>> 0x400000b. 0xb is the R_X86_64_32S relocation type which is normal for the >>> kernel. I think you just have a single-bit memory error due to a failing >>> DIMM. >> >> Thanks for the information about the load address semantics. The other >> unexpected relocation type I observed was 268435457 == 0x10000001, which >> also hints at a single bit error. But today the system failed with a >> different error: >> >> ath0: ... >> ioapic0: routing interrupt 18 to ... >> panic: vm_page_insert: page already inserted >> >> This could of course also be caused by a single bit error ... > > Yes, very likely. > >> Hmmm, perhaps there is a problem with components at room temperature >> and the system is still significantly warmer after 3 hours? > > Yes, I strongly suspect it is a thermal effect that the RAM "works" once it > is warmed up. If you have data you care about on the machine, I would just > go ahead and replace the RAM now before waiting for the RAM's failure to > become worse. Thanks a lot, John! I should have checked the hardware before, but since the system was perfectly stable, once it had been up and running, I had been suspecting an initialization bug instead of defective RAM. In fact, one of the 4GB DIMMs in the system returns bogus data (0x10000000 or 0x04000000 instead of 0) for some 40 to 50 seconds after power-on. Once warmed up, memtest86+ runs for days without a single extra data error (I wanted to have an estimate for the defect having led to damaged data in disk files). When I was still doing hardware work, I always had a freezer aerosol on my desk, which allowed me to quickly cool down a DUT by a few tens of degrees, but without such a tool I had to wait for the components to cool down over night between test. Best regards, STefan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4ED60279.10901>